Parsing syntaxes to RDF Triples

Introduction The typical sequence of operations to parse is to create a parser object, set various handlers and options, start the parsing, send some syntax content to the parser object, finish the parsing and destroy the parser object. Several parts of this process are optional, including actually using the triple results, which is useful as a syntax checking process.

Create the Parser object The parser can be created directly from a known name such as rdfxml for the W3C Recommendation RDF/XML syntax: raptor_parser* rdf_parser; rdf_parser = raptor_new_parser(world, "rdfxml"); or the name can be discovered from an description as discussed in Querying Functionality The parser can also be created by identifying the syntax by a URI, specifying the syntax by a MIME Type, providng an identifier for the content such as filename or URI string or giving some initial content bytes that can be used to guess. Using the raptor_new_parser_for_content() function, all of these can be given as optional parameters, using NULL or 0 for undefined parameters. The constructor will then use as much of this information as possible. raptor_parser* rdf_parser; Create a parser that reads the MIME Type for RDF/XML application/rdf+xml rdf_parser = raptor_new_parser_for_content(world, NULL, "application/rdf+xml", NULL, 0, NULL); Create a parser that can read a syntax identified by the URI for Turtle http://www.dajobe.org/2004/01/turtle/, which has no registered MIME Type at this date: syntax_uri = raptor_new_uri(world, "http://www.dajobe.org/2004/01/turtle/"); rdf_parser = raptor_new_parser_for_content(world, syntax_uri, NULL, NULL, 0, NULL); Create a parser that recognises the identifier foo.rss: rdf_parser = raptor_new_parser_for_content(world, NULL, NULL, NULL, 0, "foo.rss"); Create a parser that recognises the content in buffer: rdf_parser = raptor_new_parser_for_content(world, NULL, NULL, buffer, len, NULL); Any of the constructor calls can return NULL if no matching parser could be found, or the construction failed in another way.

Parser options There are several options that can be set on parsers. The exact list of options can be found at run time via the Querying Functionality or in the API reference for raptor_option. Options are integer enumerations of the raptor_option enum and have typed values that are either booleans, integers or strings. The function that sets options for parsers is raptor_parser_set_option() used as follows: /* Set a boolean or integer valued option to value 1 */ raptor_parser_set_option(rdf_parser, option, NULL, 1); /* Set a string valued option to value "abc" */ raptor_parser_set_option(rdf_parser, option, "abc", -1); There is a corresponding function for reading the values of parser option raptor_parser_get_option() which takes the option enumeration parameter and returns the boolean / integer or string value correspondingly into the appropriate pointer argument. /* Get a boolean or integer option value */ int int_var; raptor_parser_get_option(rdf_parser, option, NULL, &int_var); /* Get a string option value */ char* string_var; raptor_parser_get_option(rdf_parser, option, &string_var, NULL);

Set RDF statement callback handler The main reason to parse a syntax is to get RDF triples returned and these are return by a user-defined handler function which is called with parameters of a user data pointer and a raptor statement, which includes the triple terms plus the optional named graph term. The handler is set with raptor_parser_set_statement_handler() as follows: void statement_handler(void* user_data, const raptor_statement* statement) { /* do something with the statement */ } raptor_parser_set_statement_handler(rdf_parser, user_data, statements_handler); Setting a stateemnt handler function is optional since parsing without returning statements is a valid use, such as when parsing in order to validate a syntax.

Set parsing log message handlers Any time before parsing is called, a log handler can be set on the world object via the raptor_world_set_log_handler() method to report errors and warnings from parsing. The method takes a user data argument plus a handler callback of type raptor_log_handler with a signature that looks like this: void message_handler(void *user_data, raptor_log_message* message) { /* do something with the message */ } The handler gets the user data pointer as well as a raptor_log_handler pointer that includes associated location information, such as the log level, raptor_locator, and the log message itself. The locator structure contains full information on the details of where in the file or URI the message occurred.

Set the identifier creator handler Identifiers are created in some parsers by generating them automatically or via hints given a syntax. Raptor can customise this process using a user-supplied identifier handler function. For example, in RDF/XML generated blank node identifiers and those those specified rdf:nodeID are passed through this process. Setting a handler allows the identifier generation mechanism to be fully replaced. A lighter alternative is to use raptor_world_set_generate_bnodeid_parameters() to adjust the default algorithm for generated identifiers. It is used as follows raptor_generate_bnodeid_handler bnodeid_handler; raptor_world_set_generate_bnodeid_handler(rdf_parser, user_data, bnodeid_handler); The bnodeid_handler takes the following signature: unsigned char* generate_id_handler(void* user_data, unsigned char* user_id) { /* return a new generated ID based on user_id (optional) */ } where user_id an optional user-supplied identifier, such as the value of a rdf:nodeID in RDF/XML.

Set namespace declared handler Raptor can report when namespace prefix/URIs are declared in during parsing a syntax such as those in XML, RDF/XML or Turtle. A handler function can be set to receive these declarations using the namespace handler method. raptor_namespace_handler namespaces_handler; raptor_parser_set_namespace_handler(rdf_parser, user_data, namespaces_handler); The namespaces_handler takes the following signature: void namespaces_handler(void* user_data, raptor_namespace *nspace) { /* */ } This may be called multiple times with the same namespace, if the namespace is declared inside different XML sub-trees.

Set the parsing strictness raptor_parser_set_option() with option RAPTOR_OPTION_STRICT allows setting of the parser strictness flag. The default is lax parsing, accepting older or deprecated syntax forms but may generate a warning. Setting to non-0 (true) will cause parser errors to be generated in these cases.

Provide syntax content to parse The operation of turning syntax into RDF triples has several alternatives from functions that do most of the work starting from a URI to functions that allow passing in data buffers. Parsing and MIME Types The mime type of the retrieved content is not used to choose a parser unless the parser is of type guess. The guess parser will send an Accept: header for all known parser syntax mime types (if a URI request is made) and based on the response, including the identifiers used, pick the appropriate parser to execute. See raptor_world_guess_parser_name() for a full discussion of the inputs to the guessing.

Parse the content from a URI (<link linkend="raptor-parser-parse-uri"><function>raptor_parser_parse_uri()</function></link>) The URI is resolved and the content read from it and passed to the parser: raptor_parser_parse_uri(rdf_parser, uri, base_uri); The base_uri is optional (can be NULL) and will default to the uri.

Parse the content of a URI using an existing WWW connection (<link linkend="raptor-parser-parse-uri-with-connection"><function>raptor_parser_parse_uri_with_connection()</function></link>) The URI is resolved using an existing WWW connection (for example a libcurl CURL handle) to allow for any existing WWW configuration to be reused. See raptor_new_www_with_connection for full details of how this works. The content is then read from the result of resolving the URI: raptor_parser_parse_uri_with_connection(rdf_parser, uri, base_uri, connection); The base_uri is optional (can be NULL) and will default to the uri.

Parse the content of a C <literal>FILE*</literal> (<link linkend="raptor-parser-parse-file-stream"><function>raptor_parser_parse_file_stream()</function></link>) Parsing can read from a C STDIO file handle: stream = fopen(filename, "rb"); raptor_parser_parse_file_stream(rdf_parser, stream, filename, base_uri); fclose(stream); This function can use take an optional filename which is used in locator error messages. The base_uri may be required by some parsers and if NULL will cause the parsing to fail. This requirement can be checked by looking at the flags in the parser description using raptor_world_get_parser_description().

Parse the content of a file URI (<link linkend="raptor-parser-parse-file"><function>raptor_parser_parse_file()</function></link>) Parsing can read from a URI known to be a file: URI: raptor_parser_parse_file(rdf_parser, file_uri, base_uri); This function requires that the file_uri is a file URI, that is raptor_uri_uri_string_is_file_uri( raptor_uri_as_string( file_uri) ) must be true. The base_uri may be required by some parsers and if NULL will cause the parsing to fail.

Parse chunks of syntax content provided by the application (<link linkend="raptor-parser-parse-start"><function>raptor_parser_parse_start()</function></link> and <link linkend="raptor-parser-parse-chunk"><function>raptor_parser_parse_chunk()</function></link>) raptor_parser_parse_start(rdf_parser, base_uri); while(/* not finished getting content */) { unsigned char *buffer; size_t buffer_len; /* ... obtain some syntax content in buffer of size buffer_len bytes ... */ raptor_parser_parse_chunk(rdf_parser, buffer, buffer_len, 0); } raptor_parser_parse_chunk(rdf_parser, NULL, 0, 1); /* no data and is_end = 1 */ The base_uri argument to raptor_parser_parse_start() may be required by some parsers and if NULL will cause the parsing to fail. On the last raptor_parser_parse_chunk() call, or after the loop is ended, the is_end parameter must be set to non-0. Content can be passed with the final call. If no content is present at the end (such as in some kind of end of file situation), then a 0-length buffer_len or NULL buffer can be used. The minimal case is an entire parse in one chunk as follows: raptor_parser_parse_start(rdf_parser, base_uri); raptor_parser_parse_chunk(rdf_parser, buffer, buffer_len, 1); /* is_end = 1 */

Restrict parser network access Parsing can cause network requests to be performed, especially if a URI is given as an argument such as with raptor_parser_parse_uri() however there may also be indirect requests such as with the GRDDL parser that retrieves URIs depending on the results of initial parse requests. The URIs requested may not be wanted to be fetched or need to be filtered, and this can be done in three ways.

Filtering parser network requests with option <link linkend="RAPTOR-OPTION-NO-NET:CAPS"><literal>RAPTOR_OPTION_NO_NET</literal></link> The parser option RAPTOR_OPTION_NO_NET can be set with raptor_parser_set_option() and forbids all network requests. There is no customisation with this approach, for that see the URI filter in the next section. rdf_parser = raptor_new_parser(world, "rdfxml"); /* Disable internal network requests */ raptor_parser_set_option(rdf_parser, RAPTOR_OPTION_NO_NET, NULL, 1);

Filtering parser network requests with <link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link> The raptor_www_set_uri_filter() allows setting of a filtering function to operate on all URIs retrieved by a WWW connection. This connection can be used in parsing when operated by hand. void write_bytes_handler(raptor_www* www, void *user_data, const void *ptr, size_t size, size_t nmemb) { { raptor_parser* rdf_parser = (raptor_parser*)user_data; raptor_parser_parse_chunk(rdf_parser, (unsigned char*)ptr, size*nmemb, 0); } int uri_filter(void* filter_user_data, raptor_uri* uri) { /* return non-0 to forbid the request */ } int main(int argc, char *argv[]) { ... rdf_parser = raptor_new_parser(world, "rdfxml"); www = raptor_new_www(world); /* filter all URI requests */ raptor_www_set_uri_filter(www, uri_filter, filter_user_data); /* make WWW write bytes to parser */ raptor_www_set_write_bytes_handler(www, write_bytes_handler, rdf_parser); raptor_parser_parse_start(rdf_parser, uri); raptor_www_fetch(www, uri); /* tell the parser that we are done */ raptor_parser_parse_chunk(rdf_parser, NULL, 0, 1); raptor_free_www(www); raptor_free_parser(rdf_parser); ... }

Filtering parser network requests with <link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link> The raptor_parser_set_uri_filter() allows setting of a filtering function to operate on all URIs that the parser sees. This operates on the internal raptor_www object used inside parsing to retrieve URIs, similar to that described in the previous section. int uri_filter(void* filter_user_data, raptor_uri* uri) { /* return non-0 to forbid the request */ } rdf_parser = raptor_new_parser(world, "rdfxml"); raptor_parser_set_uri_filter(rdf_parser, uri_filter, filter_user_data); /* parse content as normal */ raptor_parser_parse_uri(rdf_parser, uri, base_uri);

Setting timeout for parser network requests with option <link linkend="RAPTOR-OPTION-WWW-TIMEOUT:CAPS"><literal>RAPTOR_OPTION_WWW_TIMEOUT</literal></link> If the value of option RAPTOR_OPTION_WWW_TIMEOUT if set to a number >0, it is used as the timeout in seconds for retrieving of URIs during parsing (primarily for GRDDL). This uses raptor_www_set_connection_timeout() internally. rdf_parser = raptor_new_parser(world, "grddl"); /* set internal URI retrieval maximum time to 5 seconds */ raptor_parser_set_option(rdf_parser, RAPTOR_OPTION_WWW_TIMEOUT, NULL, 5);

Querying parser static information These methods return information about the constructed parser implementation corresponding to the information available via raptor_world_get_parser_description() for all parsers. raptor_parser_get_name() returns the parser syntax name, raptor_parser_get_description() returns more detailed description fields including the long label and mime_types for the parser with quality levels. raptor_parser_get_accept_header() returns a string that would be sent in an HTTP request Accept: header for the syntaxes accepted by this parser only.

Querying parser run-time information raptor_parser_get_locator() returns the raptor_locator for the current position in the input stream. The locator structure contains full information on the details of where in the file or URI the current parser has reached.

Aborting parsing raptor_parser_parse_abort() allows the current parsing to be aborted, at which point no further triples will be passed to callbacks and the parser will attempt to return control to the application. This is most useful when called inside a handler function which allows the application to decide to stop an active parsing.

Destroy the parser To tidy up, delete the parser object as follows: raptor_free_parser(rdf_parser);

Parsing example code <filename>rdfprint.c</filename>: Parse an RDF/XML file and print the triples Compile it like this: $ gcc -o rdfprint rdfprint.c `pkg-config raptor2 --cflags --libs` and run it on an RDF file as: $ ./rdfprint raptor.rdf _:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://usefulinc.com/ns/doap#Project> . _:genid1 <http://usefulinc.com/ns/doap#name> "Raptor" . _:genid1 <http://usefulinc.com/ns/doap#homepage> <http://librdf.org/raptor/> . ...