summaryrefslogtreecommitdiffstats
path: root/docs/raptor-tutorial-parsing.xml
diff options
context:
space:
mode:
Diffstat (limited to 'docs/raptor-tutorial-parsing.xml')
-rw-r--r--docs/raptor-tutorial-parsing.xml642
1 files changed, 642 insertions, 0 deletions
diff --git a/docs/raptor-tutorial-parsing.xml b/docs/raptor-tutorial-parsing.xml
new file mode 100644
index 0000000..0112731
--- /dev/null
+++ b/docs/raptor-tutorial-parsing.xml
@@ -0,0 +1,642 @@
+<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
+ "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd">
+<chapter id="tutorial-parsing" xmlns:xi="http://www.w3.org/2003/XInclude">
+<title>Parsing syntaxes to RDF Triples</title>
+
+<section id="tutorial-parsing-intro">
+<title>Introduction</title>
+
+<para>
+The typical sequence of operations to parse is to create a parser
+object, set various handlers and options, start the parsing, send
+some syntax content to the parser object, finish the parsing and
+destroy the parser object.</para>
+
+<para>Several parts of this process are optional, including actually
+using the triple results, which is useful as a syntax checking
+process.
+</para>
+</section>
+
+<section id="tutorial-parser-create">
+<title>Create the Parser object</title>
+
+<para>The parser can be created directly from a known name such as
+<literal>rdfxml</literal> for the W3C Recommendation RDF/XML syntax:
+<programlisting>
+ raptor_parser* rdf_parser;
+
+ rdf_parser = raptor_new_parser(world, "rdfxml");
+</programlisting>
+or the name can be discovered from an <emphasis>description</emphasis>
+as discussed in <link linkend="tutorial-querying-functionality">Querying Functionality</link>
+</para>
+
+<para>The parser can also be created by identifying the syntax by a
+URI, specifying the syntax by a MIME Type, providng an identifier for
+the content such as filename or URI string or giving some initial
+content bytes that can be used to guess.
+Using the
+<link linkend="raptor-new-parser-for-content"><function>raptor_new_parser_for_content()</function></link>
+function, all of these can be given as optional parameters, using NULL
+or 0 for undefined parameters. The constructor will then use as much of
+this information as possible.
+</para>
+<programlisting>
+ raptor_parser* rdf_parser;
+</programlisting>
+
+<para>Create a parser that reads the MIME Type for RDF/XML
+<literal>application/rdf+xml</literal>
+<programlisting>
+ rdf_parser = raptor_new_parser_for_content(world, NULL, "application/rdf+xml", NULL, 0, NULL);
+</programlisting>
+</para>
+
+<para>Create a parser that can read a syntax identified by the URI
+for Turtle <literal>http://www.dajobe.org/2004/01/turtle/</literal>,
+which has no registered MIME Type at this date:
+<programlisting>
+ syntax_uri = raptor_new_uri(world, "http://www.dajobe.org/2004/01/turtle/");
+
+ rdf_parser = raptor_new_parser_for_content(world, syntax_uri, NULL, NULL, 0, NULL);
+</programlisting>
+</para>
+
+<para>Create a parser that recognises the identifier <literal>foo.rss</literal>:
+<programlisting>
+ rdf_parser = raptor_new_parser_for_content(world, NULL, NULL, NULL, 0, "foo.rss");
+</programlisting>
+</para>
+
+<para>Create a parser that recognises the content in <emphasis>buffer</emphasis>:
+<programlisting>
+ rdf_parser = raptor_new_parser_for_content(world, NULL, NULL, buffer, len, NULL);
+</programlisting>
+</para>
+
+<para>Any of the constructor calls can return NULL if no matching
+parser could be found, or the construction failed in another way.
+</para>
+
+</section>
+
+
+<section id="tutorial-parser-features">
+<title>Parser options</title>
+
+<para>There are several
+<emphasis> options</emphasis> that can be set on parsers.
+The exact list of options can be found at run time via the
+<link linkend="tutorial-querying-functionality">Querying Functionality</link>
+or in the API reference for
+<link linkend="raptor-option"><literal>raptor_option</literal></link>.
+</para>
+
+<para>Options are integer enumerations of the
+ <link linkend="raptor-option"><type>raptor_option</type></link> enum and have
+typed values that are either booleans, integers or strings.
+The function that sets options for parsers is
+<link linkend="raptor-parser-set-option">raptor_parser_set_option()</link>
+used as follows:
+<programlisting>
+ /* Set a boolean or integer valued option to value 1 */
+ raptor_parser_set_option(rdf_parser, option, NULL, 1);
+
+ /* Set a string valued option to value "abc" */
+ raptor_parser_set_option(rdf_parser, option, "abc", -1);
+</programlisting>
+</para>
+
+<para>
+There is a corresponding function for reading the values of parser
+option
+<link linkend="raptor-parser-get-option"><function>raptor_parser_get_option()</function></link>
+which takes the option enumeration parameter and returns the boolean /
+integer or string value correspondingly into the appropriate pointer
+argument.
+<programlisting>
+ /* Get a boolean or integer option value */
+ int int_var;
+ raptor_parser_get_option(rdf_parser, option, NULL, &amp;int_var);
+
+ /* Get a string option value */
+ char* string_var;
+ raptor_parser_get_option(rdf_parser, option, &amp;string_var, NULL);
+</programlisting>
+</para>
+
+</section>
+
+
+<section id="tutorial-parser-set-triple-handler">
+<title>Set RDF statement callback handler</title>
+
+<para>The main reason to parse a syntax is to get RDF triples
+returned and these are return by a user-defined handler function
+which is called with parameters of a user data pointer and a
+raptor statement, which includes the triple terms plus the
+optional named graph term. The handler is set with
+<link linkend="raptor-parser-set-statement-handler"><function>raptor_parser_set_statement_handler()</function></link>
+as follows:
+<programlisting>
+ void
+ statement_handler(void* user_data, const raptor_statement* statement)
+ {
+ /* do something with the statement */
+ }
+
+ raptor_parser_set_statement_handler(rdf_parser, user_data, statements_handler);
+</programlisting>
+</para>
+
+<para>Setting a stateemnt handler function is optional since parsing
+without returning statements is a valid use, such as when parsing in
+order to validate a syntax.
+</para>
+</section>
+
+
+<section id="tutorial-parser-set-error-warning-handlers">
+<title>Set parsing log message handlers</title>
+
+<para>Any time before parsing is called, a log handler can be set
+on the world object via the
+<link linkend="raptor-world-set-log-handler"><function>raptor_world_set_log_handler()</function></link>
+method to report errors and warnings from parsing.
+The method takes a user data argument plus a handler callback of type
+<link linkend="raptor-log-handler"><type>raptor_log_handler</type></link>
+with a signature that looks like this:
+<programlisting>
+void
+message_handler(void *user_data, raptor_log_message* message)
+{
+ /* do something with the message */
+}
+</programlisting>
+The handler gets the user data pointer as well as a
+<link linkend="raptor-log-message"><type>raptor_log_handler</type></link>
+pointer that includes associated location information, such as the
+log level,
+<link linkend="raptor-locator"><type>raptor_locator</type></link>,
+and the log message itself. The <emphasis>locator</emphasis>
+structure contains full information on the details of where in the
+file or URI the message occurred.
+</para>
+
+</section>
+
+
+<section id="tutorial-parser-set-id-handler">
+<title>Set the identifier creator handler</title>
+
+<para>Identifiers are created in some parsers by generating them
+automatically or via hints given a syntax. Raptor can customise this
+process using a user-supplied identifier handler function.
+For example, in RDF/XML generated blank node identifiers and those
+those specified <literal>rdf:nodeID</literal> are passed through this
+process. Setting a handler allows the identifier generation mechanism to be
+fully replaced. A lighter alternative is to use
+<link linkend="raptor-world-set-generate-bnodeid-parameters"><function>raptor_world_set_generate_bnodeid_parameters()</function></link>
+to adjust the default algorithm for generated identifiers.
+</para>
+
+<para>It is used as follows
+<programlisting>
+ raptor_generate_bnodeid_handler bnodeid_handler;
+
+ raptor_world_set_generate_bnodeid_handler(rdf_parser, user_data, bnodeid_handler);
+</programlisting>
+</para>
+
+<para>The <emphasis>bnodeid_handler</emphasis> takes the following signature:
+<programlisting>
+unsigned char*
+generate_id_handler(void* user_data, unsigned char* user_id)
+{
+ /* return a new generated ID based on user_id (optional) */
+}
+</programlisting>
+where <emphasis>user_id</emphasis> an optional user-supplied identifier,
+such as the value of a <literal>rdf:nodeID</literal> in RDF/XML.
+</para>
+
+</section>
+
+
+<section id="tutorial-parser-set-namespace-handler">
+<title>Set namespace declared handler</title>
+
+<para>Raptor can report when namespace prefix/URIs are declared in
+during parsing a syntax such as those in XML, RDF/XML or Turtle.
+A handler function can be set to receive these declarations using
+the namespace handler method.
+<programlisting>
+ raptor_namespace_handler namespaces_handler;
+
+ raptor_parser_set_namespace_handler(rdf_parser, user_data, namespaces_handler);
+</programlisting>
+</para>
+
+<para>The <emphasis>namespaces_handler</emphasis> takes the following signature:
+<programlisting>
+void
+namespaces_handler(void* user_data, raptor_namespace *nspace)
+{
+ /* */
+}
+</programlisting>
+<note>This may be called multiple times with the same namespace,
+if the namespace is declared inside different XML sub-trees.
+</note>
+</para>
+
+</section>
+
+
+<section id="tutorial-parse-strictness">
+<title>Set the parsing strictness</title>
+<para>
+<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link>
+with option
+<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link>
+allows setting of the parser strictness flag. The default is lax parsing,
+accepting older or deprecated syntax forms but may generate a warning. Setting
+to non-0 (true) will cause parser errors to be generated in these cases.
+</para>
+</section>
+
+
+<section id="tutorial-parser-content">
+<title>Provide syntax content to parse</title>
+
+<para>The operation of turning syntax into RDF triples has several
+alternatives from functions that do most of the work starting from a
+URI to functions that allow passing in data buffers.</para>
+
+<note>
+<title>Parsing and MIME Types</title>
+The mime type of the retrieved content is not used to choose
+a parser unless the parser is of type <literal>guess</literal>.
+The guess parser will send an <literal>Accept:</literal> header
+for all known parser syntax mime types (if a URI request is made)
+and based on the response, including the identifiers used,
+pick the appropriate parser to execute. See
+<link linkend="raptor-world-guess-parser-name"><function>raptor_world_guess_parser_name()</function></link>
+for a full discussion of the inputs to the guessing.
+</note>
+
+
+<section id="parse-from-uri">
+<title>Parse the content from a URI (<link linkend="raptor-parser-parse-uri"><function>raptor_parser_parse_uri()</function></link>)</title>
+
+<para>The URI is resolved and the content read from it and passed to
+the parser:
+<programlisting>
+ raptor_parser_parse_uri(rdf_parser, uri, base_uri);
+</programlisting>
+The <emphasis>base_uri</emphasis> is optional (can be
+<literal>NULL</literal>) and will default to the
+<emphasis>uri</emphasis>.
+</para>
+</section>
+
+
+<section id="parse-from-www">
+<title>Parse the content of a URI using an existing WWW connection (<link linkend="raptor-parser-parse-uri-with-connection"><function>raptor_parser_parse_uri_with_connection()</function></link>)</title>
+
+<para>The URI is resolved using an existing WWW connection (for
+example a libcurl CURL handle) to allow for any existing
+WWW configuration to be reused. See
+<link linkend="raptor-new-www-with-connection"><function>raptor_new_www_with_connection</function></link>
+for full details of how this works. The content is then read from the
+result of resolving the URI:
+<programlisting>
+ raptor_parser_parse_uri_with_connection(rdf_parser, uri, base_uri, connection);
+</programlisting>
+The <emphasis>base_uri</emphasis> is optional (can be
+<literal>NULL</literal>) and will default to the
+<emphasis>uri</emphasis>.
+</para>
+</section>
+
+
+<section id="parse-from-filehandle">
+<title>Parse the content of a C <literal>FILE*</literal> (<link linkend="raptor-parser-parse-file-stream"><function>raptor_parser_parse_file_stream()</function></link>)</title>
+
+<para>Parsing can read from a C STDIO file handle:
+<programlisting>
+ stream = fopen(filename, "rb");
+ raptor_parser_parse_file_stream(rdf_parser, stream, filename, base_uri);
+ fclose(stream);
+</programlisting>
+This function can use take an optional <emphasis>filename</emphasis> which
+is used in locator error messages.
+The <emphasis>base_uri</emphasis> may be required by some parsers
+and if <literal>NULL</literal> will cause the parsing to fail.
+This requirement can be checked by looking at the flags in
+the parser description using
+<link linkend="raptor-world-get-parser-description"><function>raptor_world_get_parser_description()</function></link>.
+</para>
+</section>
+
+
+<section id="parse-from-file-uri">
+<title>Parse the content of a file URI (<link linkend="raptor-parser-parse-file"><function>raptor_parser_parse_file()</function></link>)</title>
+
+<para>Parsing can read from a URI known to be a <literal>file:</literal> URI:
+<programlisting>
+ raptor_parser_parse_file(rdf_parser, file_uri, base_uri);
+</programlisting>
+This function requires that the <emphasis>file_uri</emphasis> is
+a file URI, that is
+<literal>raptor_uri_uri_string_is_file_uri( raptor_uri_as_string( file_uri) )</literal>
+must be true.
+The <emphasis>base_uri</emphasis> may be required by some parsers
+and if <literal>NULL</literal> will cause the parsing to fail.
+</para>
+</section>
+
+
+<section id="parse-from-chunks">
+<title>Parse chunks of syntax content provided by the application (<link linkend="raptor-parser-parse-start"><function>raptor_parser_parse_start()</function></link> and <link linkend="raptor-parser-parse-chunk"><function>raptor_parser_parse_chunk()</function></link>)</title>
+
+<para>
+<programlisting>
+ raptor_parser_parse_start(rdf_parser, base_uri);
+ while(/* not finished getting content */) {
+ unsigned char *buffer;
+ size_t buffer_len;
+
+ /* ... obtain some syntax content in buffer of size buffer_len bytes ... */
+
+ raptor_parser_parse_chunk(rdf_parser, buffer, buffer_len, 0);
+ }
+ raptor_parser_parse_chunk(rdf_parser, NULL, 0, 1); /* no data and is_end = 1 */
+</programlisting>
+The <emphasis>base_uri</emphasis> argument to
+<link linkend="raptor-parser-parse-start"><function>raptor_parser_parse_start()</function></link>
+may be required by some parsers
+and if <literal>NULL</literal> will cause the parsing to fail.
+</para>
+
+<para>On the last
+<link linkend="raptor-parser-parse-chunk"><function>raptor_parser_parse_chunk()</function></link>
+call, or after the loop is ended, the <literal>is_end</literal>
+parameter must be set to non-0. Content can be passed with the
+final call. If no content is present at the end (such as in
+some kind of <quote>end of file</quote> situation), then a 0-length
+buffer_len or NULL buffer can be used.</para>
+
+<para>The minimal case is an entire parse in one chunk as follows:</para>
+<programlisting>
+ raptor_parser_parse_start(rdf_parser, base_uri);
+ raptor_parser_parse_chunk(rdf_parser, buffer, buffer_len, 1); /* is_end = 1 */
+</programlisting>
+
+</section>
+
+</section>
+
+
+<section id="restrict-parser-network-access">
+<title>Restrict parser network access</title>
+
+<para>
+Parsing can cause network requests to be performed, especially
+if a URI is given as an argument such as with
+<link linkend="raptor-parser-parse-uri"><function>raptor_parser_parse_uri()</function></link>
+however there may also be indirect requests such as with the
+GRDDL parser that retrieves URIs depending on the results of
+initial parse requests. The URIs requested may not be wanted
+to be fetched or need to be filtered, and this can be done in
+three ways.
+</para>
+
+<section id="tutorial-filter-network-with-feature">
+<title>Filtering parser network requests with option <link linkend="RAPTOR-OPTION-NO-NET:CAPS"><literal>RAPTOR_OPTION_NO_NET</literal></link></title>
+<para>
+The parser option
+<link linkend="RAPTOR-OPTION-NO-NET:CAPS"><literal>RAPTOR_OPTION_NO_NET</literal></link>
+can be set with
+<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link>
+and forbids all network requests. There is no customisation with
+this approach, for that see the URI filter in the next section.
+</para>
+
+<programlisting>
+ rdf_parser = raptor_new_parser(world, "rdfxml");
+
+ /* Disable internal network requests */
+ raptor_parser_set_option(rdf_parser, RAPTOR_OPTION_NO_NET, NULL, 1);
+</programlisting>
+
+</section>
+
+
+<section id="tutorial-filter-network-www-uri-filter">
+<title>Filtering parser network requests with <link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link></title>
+<para>
+The
+<link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link>
+
+allows setting of a filtering function to operate on all URIs
+retrieved by a WWW connection. This connection can be used in
+parsing when operated by hand.
+</para>
+
+<programlisting>
+void write_bytes_handler(raptor_www* www, void *user_data,
+ const void *ptr, size_t size, size_t nmemb) {
+{
+ raptor_parser* rdf_parser = (raptor_parser*)user_data;
+
+ raptor_parser_parse_chunk(rdf_parser, (unsigned char*)ptr, size*nmemb, 0);
+}
+
+int uri_filter(void* filter_user_data, raptor_uri* uri) {
+ /* return non-0 to forbid the request */
+}
+
+int main(int argc, char *argv[]) {
+ ...
+
+ rdf_parser = raptor_new_parser(world, "rdfxml");
+ www = raptor_new_www(world);
+
+ /* filter all URI requests */
+ raptor_www_set_uri_filter(www, uri_filter, filter_user_data);
+
+ /* make WWW write bytes to parser */
+ raptor_www_set_write_bytes_handler(www, write_bytes_handler, rdf_parser);
+
+ raptor_parser_parse_start(rdf_parser, uri);
+ raptor_www_fetch(www, uri);
+ /* tell the parser that we are done */
+ raptor_parser_parse_chunk(rdf_parser, NULL, 0, 1);
+
+ raptor_free_www(www);
+ raptor_free_parser(rdf_parser);
+
+ ...
+}
+
+</programlisting>
+
+</section>
+
+
+<section id="tutorial-filter-network-parser-uri-filter">
+<title>Filtering parser network requests with <link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link></title>
+
+<para>
+The
+<link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link>
+allows setting of a filtering function to operate on all URIs that
+the parser sees. This operates on the internal raptor_www object
+used inside parsing to retrieve URIs, similar to that described in
+the <link linkend="tutorial-filter-network-www-uri-filter">previous section</link>.
+</para>
+
+<programlisting>
+ int uri_filter(void* filter_user_data, raptor_uri* uri) {
+ /* return non-0 to forbid the request */
+ }
+
+ rdf_parser = raptor_new_parser(world, "rdfxml");
+
+ raptor_parser_set_uri_filter(rdf_parser, uri_filter, filter_user_data);
+
+ /* parse content as normal */
+ raptor_parser_parse_uri(rdf_parser, uri, base_uri);
+</programlisting>
+
+</section>
+
+
+<section id="tutorial-filter-network-parser-timeout">
+<title>Setting timeout for parser network requests with option <link linkend="RAPTOR-OPTION-WWW-TIMEOUT:CAPS"><literal>RAPTOR_OPTION_WWW_TIMEOUT</literal></link></title>
+
+<para>If the value of option
+<link linkend="RAPTOR-OPTION-WWW-TIMEOUT:CAPS"><literal>RAPTOR_OPTION_WWW_TIMEOUT</literal></link>
+if set to a number &gt;0, it is used as the timeout in seconds
+for retrieving of URIs during parsing (primarily for GRDDL).
+This uses
+<link linkend="raptor-www-set-connection-timeout"><function>raptor_www_set_connection_timeout()</function></link>
+internally.
+</para>
+
+<programlisting>
+ rdf_parser = raptor_new_parser(world, "grddl");
+
+ /* set internal URI retrieval maximum time to 5 seconds */
+ raptor_parser_set_option(rdf_parser, RAPTOR_OPTION_WWW_TIMEOUT, NULL, 5);
+</programlisting>
+
+</section>
+
+
+</section>
+
+
+<section id="tutorial-parser-static-info">
+<title>Querying parser static information</title>
+
+<para>
+These methods return information about the constructed parser
+implementation corresponding to the information available
+via <link linkend="raptor-world-get-parser-description"><function>raptor_world_get_parser_description()</function></link>
+for all parsers.
+</para>
+
+<para><link linkend="raptor-parser-get-name"><function>raptor_parser_get_name()</function></link> returns the parser syntax name,
+<link linkend="raptor-parser-get-description"><function>raptor_parser_get_description()</function></link>
+returns more detailed description fields including the long label and
+mime_types for the parser with quality levels.
+</para>
+
+<para><link linkend="raptor-parser-get-accept-header"><function>raptor_parser_get_accept_header()</function></link>
+returns a string that would be sent in an HTTP
+request <code>Accept:</code> header for the syntaxes accepted by this
+parser only.
+</para>
+
+</section>
+
+
+<section id="tutorial-parser-runtime-info">
+<title>Querying parser run-time information</title>
+
+<para>
+<link linkend="raptor-parser-get-locator"><function>raptor_parser_get_locator()</function></link>
+returns the <link linkend="raptor-locator"><type>raptor_locator</type></link>
+for the current position in the input stream. The <emphasis>locator</emphasis>
+structure contains full information on the details of where in the
+file or URI the current parser has reached.
+</para>
+</section>
+
+
+<section id="tutorial-parser-abort">
+<title>Aborting parsing</title>
+
+<para>
+<link linkend="raptor-parser-parse-abort"><function>raptor_parser_parse_abort()</function></link>
+allows the current parsing to be aborted, at which point no further
+triples will be passed to callbacks and the parser will attempt to
+return control to the application. This is most useful when called
+inside a handler function which allows the application to decide to stop
+an active parsing.
+</para>
+</section>
+
+
+<section id="tutorial-parser-destroy">
+<title>Destroy the parser</title>
+
+<para>
+To tidy up, delete the parser object as follows:
+<programlisting>
+ raptor_free_parser(rdf_parser);
+</programlisting>
+</para>
+
+</section>
+
+
+<section id="tutorial-parser-example">
+<title>Parsing example code</title>
+
+<example id="raptor-example-rdfprint">
+<title><filename>rdfprint.c</filename>: Parse an RDF/XML file and print the triples</title>
+<programlisting>
+<xi:include href="rdfprint.c" parse="text"/>
+</programlisting>
+
+<para>Compile it like this:
+<screen>
+$ gcc -o rdfprint rdfprint.c `pkg-config raptor2 --cflags --libs`
+</screen>
+and run it on an RDF file as:
+<screen>
+$ ./rdfprint raptor.rdf
+_:genid1 &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://usefulinc.com/ns/doap#Project&gt; .
+_:genid1 &lt;http://usefulinc.com/ns/doap#name&gt; "Raptor" .
+_:genid1 &lt;http://usefulinc.com/ns/doap#homepage&gt; &lt;http://librdf.org/raptor/&gt; .
+...
+</screen>
+</para>
+
+</example>
+
+</section>
+
+</chapter>
+
+
+<!--
+Local variables:
+mode: sgml
+sgml-parent-document: ("raptor-docs.xml" "book" "part")
+End:
+-->