diff options
Diffstat (limited to 'docs/raptor-tutorial-parsing.xml')
-rw-r--r-- | docs/raptor-tutorial-parsing.xml | 642 |
1 files changed, 642 insertions, 0 deletions
diff --git a/docs/raptor-tutorial-parsing.xml b/docs/raptor-tutorial-parsing.xml new file mode 100644 index 0000000..0112731 --- /dev/null +++ b/docs/raptor-tutorial-parsing.xml @@ -0,0 +1,642 @@ +<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" + "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd"> +<chapter id="tutorial-parsing" xmlns:xi="http://www.w3.org/2003/XInclude"> +<title>Parsing syntaxes to RDF Triples</title> + +<section id="tutorial-parsing-intro"> +<title>Introduction</title> + +<para> +The typical sequence of operations to parse is to create a parser +object, set various handlers and options, start the parsing, send +some syntax content to the parser object, finish the parsing and +destroy the parser object.</para> + +<para>Several parts of this process are optional, including actually +using the triple results, which is useful as a syntax checking +process. +</para> +</section> + +<section id="tutorial-parser-create"> +<title>Create the Parser object</title> + +<para>The parser can be created directly from a known name such as +<literal>rdfxml</literal> for the W3C Recommendation RDF/XML syntax: +<programlisting> + raptor_parser* rdf_parser; + + rdf_parser = raptor_new_parser(world, "rdfxml"); +</programlisting> +or the name can be discovered from an <emphasis>description</emphasis> +as discussed in <link linkend="tutorial-querying-functionality">Querying Functionality</link> +</para> + +<para>The parser can also be created by identifying the syntax by a +URI, specifying the syntax by a MIME Type, providng an identifier for +the content such as filename or URI string or giving some initial +content bytes that can be used to guess. +Using the +<link linkend="raptor-new-parser-for-content"><function>raptor_new_parser_for_content()</function></link> +function, all of these can be given as optional parameters, using NULL +or 0 for undefined parameters. The constructor will then use as much of +this information as possible. +</para> +<programlisting> + raptor_parser* rdf_parser; +</programlisting> + +<para>Create a parser that reads the MIME Type for RDF/XML +<literal>application/rdf+xml</literal> +<programlisting> + rdf_parser = raptor_new_parser_for_content(world, NULL, "application/rdf+xml", NULL, 0, NULL); +</programlisting> +</para> + +<para>Create a parser that can read a syntax identified by the URI +for Turtle <literal>http://www.dajobe.org/2004/01/turtle/</literal>, +which has no registered MIME Type at this date: +<programlisting> + syntax_uri = raptor_new_uri(world, "http://www.dajobe.org/2004/01/turtle/"); + + rdf_parser = raptor_new_parser_for_content(world, syntax_uri, NULL, NULL, 0, NULL); +</programlisting> +</para> + +<para>Create a parser that recognises the identifier <literal>foo.rss</literal>: +<programlisting> + rdf_parser = raptor_new_parser_for_content(world, NULL, NULL, NULL, 0, "foo.rss"); +</programlisting> +</para> + +<para>Create a parser that recognises the content in <emphasis>buffer</emphasis>: +<programlisting> + rdf_parser = raptor_new_parser_for_content(world, NULL, NULL, buffer, len, NULL); +</programlisting> +</para> + +<para>Any of the constructor calls can return NULL if no matching +parser could be found, or the construction failed in another way. +</para> + +</section> + + +<section id="tutorial-parser-features"> +<title>Parser options</title> + +<para>There are several +<emphasis> options</emphasis> that can be set on parsers. +The exact list of options can be found at run time via the +<link linkend="tutorial-querying-functionality">Querying Functionality</link> +or in the API reference for +<link linkend="raptor-option"><literal>raptor_option</literal></link>. +</para> + +<para>Options are integer enumerations of the + <link linkend="raptor-option"><type>raptor_option</type></link> enum and have +typed values that are either booleans, integers or strings. +The function that sets options for parsers is +<link linkend="raptor-parser-set-option">raptor_parser_set_option()</link> +used as follows: +<programlisting> + /* Set a boolean or integer valued option to value 1 */ + raptor_parser_set_option(rdf_parser, option, NULL, 1); + + /* Set a string valued option to value "abc" */ + raptor_parser_set_option(rdf_parser, option, "abc", -1); +</programlisting> +</para> + +<para> +There is a corresponding function for reading the values of parser +option +<link linkend="raptor-parser-get-option"><function>raptor_parser_get_option()</function></link> +which takes the option enumeration parameter and returns the boolean / +integer or string value correspondingly into the appropriate pointer +argument. +<programlisting> + /* Get a boolean or integer option value */ + int int_var; + raptor_parser_get_option(rdf_parser, option, NULL, &int_var); + + /* Get a string option value */ + char* string_var; + raptor_parser_get_option(rdf_parser, option, &string_var, NULL); +</programlisting> +</para> + +</section> + + +<section id="tutorial-parser-set-triple-handler"> +<title>Set RDF statement callback handler</title> + +<para>The main reason to parse a syntax is to get RDF triples +returned and these are return by a user-defined handler function +which is called with parameters of a user data pointer and a +raptor statement, which includes the triple terms plus the +optional named graph term. The handler is set with +<link linkend="raptor-parser-set-statement-handler"><function>raptor_parser_set_statement_handler()</function></link> +as follows: +<programlisting> + void + statement_handler(void* user_data, const raptor_statement* statement) + { + /* do something with the statement */ + } + + raptor_parser_set_statement_handler(rdf_parser, user_data, statements_handler); +</programlisting> +</para> + +<para>Setting a stateemnt handler function is optional since parsing +without returning statements is a valid use, such as when parsing in +order to validate a syntax. +</para> +</section> + + +<section id="tutorial-parser-set-error-warning-handlers"> +<title>Set parsing log message handlers</title> + +<para>Any time before parsing is called, a log handler can be set +on the world object via the +<link linkend="raptor-world-set-log-handler"><function>raptor_world_set_log_handler()</function></link> +method to report errors and warnings from parsing. +The method takes a user data argument plus a handler callback of type +<link linkend="raptor-log-handler"><type>raptor_log_handler</type></link> +with a signature that looks like this: +<programlisting> +void +message_handler(void *user_data, raptor_log_message* message) +{ + /* do something with the message */ +} +</programlisting> +The handler gets the user data pointer as well as a +<link linkend="raptor-log-message"><type>raptor_log_handler</type></link> +pointer that includes associated location information, such as the +log level, +<link linkend="raptor-locator"><type>raptor_locator</type></link>, +and the log message itself. The <emphasis>locator</emphasis> +structure contains full information on the details of where in the +file or URI the message occurred. +</para> + +</section> + + +<section id="tutorial-parser-set-id-handler"> +<title>Set the identifier creator handler</title> + +<para>Identifiers are created in some parsers by generating them +automatically or via hints given a syntax. Raptor can customise this +process using a user-supplied identifier handler function. +For example, in RDF/XML generated blank node identifiers and those +those specified <literal>rdf:nodeID</literal> are passed through this +process. Setting a handler allows the identifier generation mechanism to be +fully replaced. A lighter alternative is to use +<link linkend="raptor-world-set-generate-bnodeid-parameters"><function>raptor_world_set_generate_bnodeid_parameters()</function></link> +to adjust the default algorithm for generated identifiers. +</para> + +<para>It is used as follows +<programlisting> + raptor_generate_bnodeid_handler bnodeid_handler; + + raptor_world_set_generate_bnodeid_handler(rdf_parser, user_data, bnodeid_handler); +</programlisting> +</para> + +<para>The <emphasis>bnodeid_handler</emphasis> takes the following signature: +<programlisting> +unsigned char* +generate_id_handler(void* user_data, unsigned char* user_id) +{ + /* return a new generated ID based on user_id (optional) */ +} +</programlisting> +where <emphasis>user_id</emphasis> an optional user-supplied identifier, +such as the value of a <literal>rdf:nodeID</literal> in RDF/XML. +</para> + +</section> + + +<section id="tutorial-parser-set-namespace-handler"> +<title>Set namespace declared handler</title> + +<para>Raptor can report when namespace prefix/URIs are declared in +during parsing a syntax such as those in XML, RDF/XML or Turtle. +A handler function can be set to receive these declarations using +the namespace handler method. +<programlisting> + raptor_namespace_handler namespaces_handler; + + raptor_parser_set_namespace_handler(rdf_parser, user_data, namespaces_handler); +</programlisting> +</para> + +<para>The <emphasis>namespaces_handler</emphasis> takes the following signature: +<programlisting> +void +namespaces_handler(void* user_data, raptor_namespace *nspace) +{ + /* */ +} +</programlisting> +<note>This may be called multiple times with the same namespace, +if the namespace is declared inside different XML sub-trees. +</note> +</para> + +</section> + + +<section id="tutorial-parse-strictness"> +<title>Set the parsing strictness</title> +<para> +<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link> +with option +<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link> +allows setting of the parser strictness flag. The default is lax parsing, +accepting older or deprecated syntax forms but may generate a warning. Setting +to non-0 (true) will cause parser errors to be generated in these cases. +</para> +</section> + + +<section id="tutorial-parser-content"> +<title>Provide syntax content to parse</title> + +<para>The operation of turning syntax into RDF triples has several +alternatives from functions that do most of the work starting from a +URI to functions that allow passing in data buffers.</para> + +<note> +<title>Parsing and MIME Types</title> +The mime type of the retrieved content is not used to choose +a parser unless the parser is of type <literal>guess</literal>. +The guess parser will send an <literal>Accept:</literal> header +for all known parser syntax mime types (if a URI request is made) +and based on the response, including the identifiers used, +pick the appropriate parser to execute. See +<link linkend="raptor-world-guess-parser-name"><function>raptor_world_guess_parser_name()</function></link> +for a full discussion of the inputs to the guessing. +</note> + + +<section id="parse-from-uri"> +<title>Parse the content from a URI (<link linkend="raptor-parser-parse-uri"><function>raptor_parser_parse_uri()</function></link>)</title> + +<para>The URI is resolved and the content read from it and passed to +the parser: +<programlisting> + raptor_parser_parse_uri(rdf_parser, uri, base_uri); +</programlisting> +The <emphasis>base_uri</emphasis> is optional (can be +<literal>NULL</literal>) and will default to the +<emphasis>uri</emphasis>. +</para> +</section> + + +<section id="parse-from-www"> +<title>Parse the content of a URI using an existing WWW connection (<link linkend="raptor-parser-parse-uri-with-connection"><function>raptor_parser_parse_uri_with_connection()</function></link>)</title> + +<para>The URI is resolved using an existing WWW connection (for +example a libcurl CURL handle) to allow for any existing +WWW configuration to be reused. See +<link linkend="raptor-new-www-with-connection"><function>raptor_new_www_with_connection</function></link> +for full details of how this works. The content is then read from the +result of resolving the URI: +<programlisting> + raptor_parser_parse_uri_with_connection(rdf_parser, uri, base_uri, connection); +</programlisting> +The <emphasis>base_uri</emphasis> is optional (can be +<literal>NULL</literal>) and will default to the +<emphasis>uri</emphasis>. +</para> +</section> + + +<section id="parse-from-filehandle"> +<title>Parse the content of a C <literal>FILE*</literal> (<link linkend="raptor-parser-parse-file-stream"><function>raptor_parser_parse_file_stream()</function></link>)</title> + +<para>Parsing can read from a C STDIO file handle: +<programlisting> + stream = fopen(filename, "rb"); + raptor_parser_parse_file_stream(rdf_parser, stream, filename, base_uri); + fclose(stream); +</programlisting> +This function can use take an optional <emphasis>filename</emphasis> which +is used in locator error messages. +The <emphasis>base_uri</emphasis> may be required by some parsers +and if <literal>NULL</literal> will cause the parsing to fail. +This requirement can be checked by looking at the flags in +the parser description using +<link linkend="raptor-world-get-parser-description"><function>raptor_world_get_parser_description()</function></link>. +</para> +</section> + + +<section id="parse-from-file-uri"> +<title>Parse the content of a file URI (<link linkend="raptor-parser-parse-file"><function>raptor_parser_parse_file()</function></link>)</title> + +<para>Parsing can read from a URI known to be a <literal>file:</literal> URI: +<programlisting> + raptor_parser_parse_file(rdf_parser, file_uri, base_uri); +</programlisting> +This function requires that the <emphasis>file_uri</emphasis> is +a file URI, that is +<literal>raptor_uri_uri_string_is_file_uri( raptor_uri_as_string( file_uri) )</literal> +must be true. +The <emphasis>base_uri</emphasis> may be required by some parsers +and if <literal>NULL</literal> will cause the parsing to fail. +</para> +</section> + + +<section id="parse-from-chunks"> +<title>Parse chunks of syntax content provided by the application (<link linkend="raptor-parser-parse-start"><function>raptor_parser_parse_start()</function></link> and <link linkend="raptor-parser-parse-chunk"><function>raptor_parser_parse_chunk()</function></link>)</title> + +<para> +<programlisting> + raptor_parser_parse_start(rdf_parser, base_uri); + while(/* not finished getting content */) { + unsigned char *buffer; + size_t buffer_len; + + /* ... obtain some syntax content in buffer of size buffer_len bytes ... */ + + raptor_parser_parse_chunk(rdf_parser, buffer, buffer_len, 0); + } + raptor_parser_parse_chunk(rdf_parser, NULL, 0, 1); /* no data and is_end = 1 */ +</programlisting> +The <emphasis>base_uri</emphasis> argument to +<link linkend="raptor-parser-parse-start"><function>raptor_parser_parse_start()</function></link> +may be required by some parsers +and if <literal>NULL</literal> will cause the parsing to fail. +</para> + +<para>On the last +<link linkend="raptor-parser-parse-chunk"><function>raptor_parser_parse_chunk()</function></link> +call, or after the loop is ended, the <literal>is_end</literal> +parameter must be set to non-0. Content can be passed with the +final call. If no content is present at the end (such as in +some kind of <quote>end of file</quote> situation), then a 0-length +buffer_len or NULL buffer can be used.</para> + +<para>The minimal case is an entire parse in one chunk as follows:</para> +<programlisting> + raptor_parser_parse_start(rdf_parser, base_uri); + raptor_parser_parse_chunk(rdf_parser, buffer, buffer_len, 1); /* is_end = 1 */ +</programlisting> + +</section> + +</section> + + +<section id="restrict-parser-network-access"> +<title>Restrict parser network access</title> + +<para> +Parsing can cause network requests to be performed, especially +if a URI is given as an argument such as with +<link linkend="raptor-parser-parse-uri"><function>raptor_parser_parse_uri()</function></link> +however there may also be indirect requests such as with the +GRDDL parser that retrieves URIs depending on the results of +initial parse requests. The URIs requested may not be wanted +to be fetched or need to be filtered, and this can be done in +three ways. +</para> + +<section id="tutorial-filter-network-with-feature"> +<title>Filtering parser network requests with option <link linkend="RAPTOR-OPTION-NO-NET:CAPS"><literal>RAPTOR_OPTION_NO_NET</literal></link></title> +<para> +The parser option +<link linkend="RAPTOR-OPTION-NO-NET:CAPS"><literal>RAPTOR_OPTION_NO_NET</literal></link> +can be set with +<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link> +and forbids all network requests. There is no customisation with +this approach, for that see the URI filter in the next section. +</para> + +<programlisting> + rdf_parser = raptor_new_parser(world, "rdfxml"); + + /* Disable internal network requests */ + raptor_parser_set_option(rdf_parser, RAPTOR_OPTION_NO_NET, NULL, 1); +</programlisting> + +</section> + + +<section id="tutorial-filter-network-www-uri-filter"> +<title>Filtering parser network requests with <link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link></title> +<para> +The +<link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link> + +allows setting of a filtering function to operate on all URIs +retrieved by a WWW connection. This connection can be used in +parsing when operated by hand. +</para> + +<programlisting> +void write_bytes_handler(raptor_www* www, void *user_data, + const void *ptr, size_t size, size_t nmemb) { +{ + raptor_parser* rdf_parser = (raptor_parser*)user_data; + + raptor_parser_parse_chunk(rdf_parser, (unsigned char*)ptr, size*nmemb, 0); +} + +int uri_filter(void* filter_user_data, raptor_uri* uri) { + /* return non-0 to forbid the request */ +} + +int main(int argc, char *argv[]) { + ... + + rdf_parser = raptor_new_parser(world, "rdfxml"); + www = raptor_new_www(world); + + /* filter all URI requests */ + raptor_www_set_uri_filter(www, uri_filter, filter_user_data); + + /* make WWW write bytes to parser */ + raptor_www_set_write_bytes_handler(www, write_bytes_handler, rdf_parser); + + raptor_parser_parse_start(rdf_parser, uri); + raptor_www_fetch(www, uri); + /* tell the parser that we are done */ + raptor_parser_parse_chunk(rdf_parser, NULL, 0, 1); + + raptor_free_www(www); + raptor_free_parser(rdf_parser); + + ... +} + +</programlisting> + +</section> + + +<section id="tutorial-filter-network-parser-uri-filter"> +<title>Filtering parser network requests with <link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link></title> + +<para> +The +<link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link> +allows setting of a filtering function to operate on all URIs that +the parser sees. This operates on the internal raptor_www object +used inside parsing to retrieve URIs, similar to that described in +the <link linkend="tutorial-filter-network-www-uri-filter">previous section</link>. +</para> + +<programlisting> + int uri_filter(void* filter_user_data, raptor_uri* uri) { + /* return non-0 to forbid the request */ + } + + rdf_parser = raptor_new_parser(world, "rdfxml"); + + raptor_parser_set_uri_filter(rdf_parser, uri_filter, filter_user_data); + + /* parse content as normal */ + raptor_parser_parse_uri(rdf_parser, uri, base_uri); +</programlisting> + +</section> + + +<section id="tutorial-filter-network-parser-timeout"> +<title>Setting timeout for parser network requests with option <link linkend="RAPTOR-OPTION-WWW-TIMEOUT:CAPS"><literal>RAPTOR_OPTION_WWW_TIMEOUT</literal></link></title> + +<para>If the value of option +<link linkend="RAPTOR-OPTION-WWW-TIMEOUT:CAPS"><literal>RAPTOR_OPTION_WWW_TIMEOUT</literal></link> +if set to a number >0, it is used as the timeout in seconds +for retrieving of URIs during parsing (primarily for GRDDL). +This uses +<link linkend="raptor-www-set-connection-timeout"><function>raptor_www_set_connection_timeout()</function></link> +internally. +</para> + +<programlisting> + rdf_parser = raptor_new_parser(world, "grddl"); + + /* set internal URI retrieval maximum time to 5 seconds */ + raptor_parser_set_option(rdf_parser, RAPTOR_OPTION_WWW_TIMEOUT, NULL, 5); +</programlisting> + +</section> + + +</section> + + +<section id="tutorial-parser-static-info"> +<title>Querying parser static information</title> + +<para> +These methods return information about the constructed parser +implementation corresponding to the information available +via <link linkend="raptor-world-get-parser-description"><function>raptor_world_get_parser_description()</function></link> +for all parsers. +</para> + +<para><link linkend="raptor-parser-get-name"><function>raptor_parser_get_name()</function></link> returns the parser syntax name, +<link linkend="raptor-parser-get-description"><function>raptor_parser_get_description()</function></link> +returns more detailed description fields including the long label and +mime_types for the parser with quality levels. +</para> + +<para><link linkend="raptor-parser-get-accept-header"><function>raptor_parser_get_accept_header()</function></link> +returns a string that would be sent in an HTTP +request <code>Accept:</code> header for the syntaxes accepted by this +parser only. +</para> + +</section> + + +<section id="tutorial-parser-runtime-info"> +<title>Querying parser run-time information</title> + +<para> +<link linkend="raptor-parser-get-locator"><function>raptor_parser_get_locator()</function></link> +returns the <link linkend="raptor-locator"><type>raptor_locator</type></link> +for the current position in the input stream. The <emphasis>locator</emphasis> +structure contains full information on the details of where in the +file or URI the current parser has reached. +</para> +</section> + + +<section id="tutorial-parser-abort"> +<title>Aborting parsing</title> + +<para> +<link linkend="raptor-parser-parse-abort"><function>raptor_parser_parse_abort()</function></link> +allows the current parsing to be aborted, at which point no further +triples will be passed to callbacks and the parser will attempt to +return control to the application. This is most useful when called +inside a handler function which allows the application to decide to stop +an active parsing. +</para> +</section> + + +<section id="tutorial-parser-destroy"> +<title>Destroy the parser</title> + +<para> +To tidy up, delete the parser object as follows: +<programlisting> + raptor_free_parser(rdf_parser); +</programlisting> +</para> + +</section> + + +<section id="tutorial-parser-example"> +<title>Parsing example code</title> + +<example id="raptor-example-rdfprint"> +<title><filename>rdfprint.c</filename>: Parse an RDF/XML file and print the triples</title> +<programlisting> +<xi:include href="rdfprint.c" parse="text"/> +</programlisting> + +<para>Compile it like this: +<screen> +$ gcc -o rdfprint rdfprint.c `pkg-config raptor2 --cflags --libs` +</screen> +and run it on an RDF file as: +<screen> +$ ./rdfprint raptor.rdf +_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://usefulinc.com/ns/doap#Project> . +_:genid1 <http://usefulinc.com/ns/doap#name> "Raptor" . +_:genid1 <http://usefulinc.com/ns/doap#homepage> <http://librdf.org/raptor/> . +... +</screen> +</para> + +</example> + +</section> + +</chapter> + + +<!-- +Local variables: +mode: sgml +sgml-parent-document: ("raptor-docs.xml" "book" "part") +End: +--> |