diff options
Diffstat (limited to 'docs/raptor-parsers.xml')
-rw-r--r-- | docs/raptor-parsers.xml | 262 |
1 files changed, 262 insertions, 0 deletions
diff --git a/docs/raptor-parsers.xml b/docs/raptor-parsers.xml new file mode 100644 index 0000000..de4593a --- /dev/null +++ b/docs/raptor-parsers.xml @@ -0,0 +1,262 @@ +<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" + "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd"> +<chapter id="raptor-parsers"> +<title>Parsers in Raptor (syntax to triples)</title> + +<section id="raptor-parsers-intro"> +<title>Introduction</title> + +<para>This section describes the parsers that can be compiled into +Raptor and their options. The exact parsers supported may vary +by different builds of raptor and can be queried at run-time by +use of the +<link linkend="raptor-world-get-parser-description"><function>raptor_world_get_parser_description</function></link> +function</para> + +<para>The options that may be set on parsers can also +be queried at run-time with the +<link linkend="raptor-world-get-option-description"><function>raptor_world_get_option_description</function></link> +function.</para> + +</section> + + +<section id="parser-grddl"> +<title>GRDDL parser (name <literal>grddl</literal>)</title> +<para>A parser for the +<ulink url="http://www.w3.org/TR/2007/PR-grddl-20070716/">Gleaning Resource Descriptions from Dialects of Languages (GRDDL)</ulink>, +W3C Proposed Recommendation of 2007-07-16 which allows reading XHTML +and XML as RDF triples by using profiles in the document that declare +XSLT transforms from the XHTML or XML content into RDF/XML or other +RDF syntax which can then be parsed.</para> + +<para>The GRDDL parser is rather complex and different from the other +parsers in that it retrieves URIs, reads HTML documents (possibly +with errors), transforms the documents with XSLT and turns the result +into a single graph. The default configuration of the GRDDL parser +also reads microformats (hcard, hcalendar) and follows <link> +tags that point to RDF/XML. Parts of the GRDDL process can be +altered by configuration, which are describe below. +</para> + +<para>The GRDDL parser defines 'base', 'Base' and 'url' XSLT parameters +with the value of the base URI to allow some XSLT sheets to work. These +set of parameters cannot be disabled. +</para> + +<para>If the XSLT transform returns an empty string, no further +processing of the result is done, and a warning is generated. The +xsl:output method is mapped to result document mime types as follows: +'text' to text/plain; 'xml' to application/xml and 'html' to text/html. +Any result that is of type 'application/xml' or unknown mime type +is assumed to be RDF/XML. +</para> + +<para>The URIs that are processed during GRDDL operations can be checked +and skipped if required using a handler set with the +<link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link> +function. If the handler returns non-0, the URI is rejected. +This uses +<link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link> +internally. +</para> + +<para>If the value of option +<link linkend="RAPTOR-OPTION-WWW-TIMEOUT:CAPS"><literal>RAPTOR_OPTION_WWW_TIMEOUT</literal></link> +if set to a number >0, it is used as the timeout in seconds +for retrieving of URIs during GRDDL processing. +This uses +<link linkend="raptor-www-set-connection-timeout"><function>raptor_www_set_connection_timeout()</function></link> +internally. +</para> + +<para>The hardcoded support for hcard and hcalendar +microformats can be disabled by setting parser option +<link linkend="RAPTOR-OPTION-MICROFORMATS:CAPS"><literal>RAPTOR_OPTION_MICROFORMATS</literal></link> +to 0 +or using +<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link> +with option +<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link> +and a boolean value of 1. +</para> + +<para>The GRDDL parser by default will try an XML parser on the +content followed by a lax HTML parser. This can be disabled by +setting parser option +<link linkend="RAPTOR-OPTION-HTML-TAG-SOUP:CAPS"><literal>RAPTOR_OPTION_HTML_TAG_SOUP</literal></link> +to 0 +or using +<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link> +with option +<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link> +and a boolean value of 1. +</para> + +<para>The GRDDL parser by default will try to look for an HTML +<link> tag that points to RDF/XML. This can be disabled by +setting parser option +<link linkend="RAPTOR-OPTION-HTML-LINK:CAPS"><literal>RAPTOR_OPTION_HTML_LINK</literal></link> +to 0 +or using +<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link> +with option +<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link> +and a boolean value of 1. +</para> + +</section> + + +<section id="parser-guess"> +<title>Guess parser (name <literal>guess</literal>)</title> +<para> +This is a special parser that picks the actual parser to use based +on the content type, the content bytes or the content identifier. The +content name can be either from a local file or from a URI. +</para> + +<para>If the protocol that delivered the content (such as HTTP) +provided a <emphasis>Content Type</emphasis> (aka MIME Type) then +this will be the primary means for identifying th ecotnent. +</para> + +<para>The secondary means to identify the content are the bytes of +the content (if available), otherwise the content identifier is used, +which is the least reliable. +</para> + +</section> + + +<section id="parser-json"> +<title>JSON parser (name <literal>json</literal>)</title> + +<para>A parser for both the +resource-centric RDF/JSON syntax as defined by Talis at +<ulink url="http://n2.talis.com/wiki/RDF_JSON_Specification">RDF/JSON Specification</ulink> +and the triples-centric format based on the SPARQL results in JSON format. +</para> + +</section> + + +<section id="parser-ntriples"> +<title>N-Triples parser (name <literal>ntriples</literal>)</title> + +<para>A parser for the +<ulink url="http://www.w3.org/TR/rdf-testcases/#ntriples">N-Triples</ulink> +syntax as used by the +<ulink url="http://www.w3.org/2001/sw/RDFCore/">W3C RDF Core working group</ulink> +for the <ulink url="http://www.w3.org/TR/rdf-testcases/">RDF Test Cases</ulink>. +</para> + +</section> + + +<section id="parser-rdfa"> +<title>RDFa parser - (name <literal>rdfa</literal>)</title> +<para> +A parser for the +<ulink url="http://www.w3.org/TR/2008/CR-rdfa-syntax-20080620/">RDFa</ulink> +syntax, W3C Candidate Recommendation 20 June 2008 which allows reading XHTML +and XML as RDF triples by interpreting attributes on elements to +describe which ones have RDF semantics. This is implemented via +<ulink url="http://rdfa.digitalbazaar.com/librdfa/">librdfa</ulink> +linked inside Raptor, written by Manu Sporny of Digital Bazaar, +and licensed with the same license as Raptor. +</para> + +<para> +This parser is beta quality and passes all but 4 of the RDFa tests as +of Raptor 1.4.18. +</para> + +</section> + + +<section id="parser-rdfxml"> +<title>RDF/XML parser - default (name <literal>rdfxml</literal>)</title> +<para> +A parser for the standard +<ulink url="http://www.w3.org/TR/rdf-syntax-grammar/">RDF/XML syntax</ulink> +as revised by the +<ulink url="http://www.w3.org/2001/sw/RDFCore/">W3C RDF Core working group</ulink>.</para> + +<para>This is the default parser in Raptor.</para> + +<para>Features of this parser:</para> +<itemizedlist> +<listitem><para>Fully handles the <ulink url="http://www.w3.org/TR/rdf-syntax-grammar/">RDF/XML syntax updates</ulink> for <ulink url="http://www.w3.org/TR/xmlbase/">XML Base</ulink>, <literal>xml:lang</literal>, RDF datatyping and Collections.</para></listitem> + +<listitem><para>Handles all RDF vocabularies such as <ulink url="http://www.foaf-project.org/">FOAF</ulink>, <ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink>, <ulink url="http://dublincore.org/">Dublin Core</ulink>, <ulink url="http://www.w3.org/TR/owl-features/">OWL</ulink>, <ulink url="http://usefulinc.com/doap">DOAP</ulink></para></listitem> + +<listitem><para>Handles <literal>rdf:resource</literal> / <literal>resource</literal> attributes</para></listitem> + +<listitem><para>Uses <ulink url="http://expat.sourceforge.net/">expat</ulink> and/or (GNOME) <ulink url="http://xmlsoft.org/">libxml</ulink> XML parsers as available or required</para></listitem> + +</itemizedlist> + +</section> + + +<section id="parser-rss-tag-soup"> +<title>RSS Tag Soup parser (name <literal>rss-tag-soup</literal>)</title> + +<para>A parser for the multiple XML RSS formats that use the elements +such as <literal>channel</literal>, <literal>item</literal>, +<literal>title</literal>, <literal>description</literal> +in different ways. +This includes support for the Atom 1.0 syndication format defined in IETF +<ulink url="http://www.ietf.org/rfc/rfc4287.txt">RFC 4287</ulink> +</para> + +<para>The parser attempts to turn the input into +<ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink> +RDF triples in the RSS 1.0 model of a syndication feed. +This includes triples for RSS Enclosures. +</para> + +<para> +True <ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink> when +wanted to be used as a full RDF vocabulary, is best parsed by the +RDF/XML parser (name <literal>rdfxml</literal>). +</para> + +</section> + + +<section id="parser-trig"> +<title>TRiG parser (name <literal>trig</literal>)</title> + +<para>A parser for the +<ulink url="http://www.wiwiss.fu-berlin.de/suhl/bizer/TriG/Spec/">TriG - Turtle with Named Graphs</ulink> +syntax. +</para> + +<para>The parser is alpha quality and may not support the entire TRiG +specification.</para> + +</section> + + +<section id="parser-turtle"> +<title>Turtle Terse RDF Triple Language parser (name <literal>turtle</literal>)</title> + +<para>A parser for the +<ulink url="http://www.dajobe.org/2004/01/turtle/">Turtle Terse RDF Triple Language</ulink> +syntax, designed as a useful subset of +<ulink url="http://www.w3.org/DesignIssues/Notation3">Notation 3</ulink>. +</para> + +</section> + + +</chapter> + +<!-- +Local variables: +mode: sgml +sgml-parent-document: ("raptor-docs.xml" "book" "part") +End: +--> |