summaryrefslogtreecommitdiffstats
path: root/docs/raptor-parsers.xml
diff options
context:
space:
mode:
Diffstat (limited to 'docs/raptor-parsers.xml')
-rw-r--r--docs/raptor-parsers.xml262
1 files changed, 262 insertions, 0 deletions
diff --git a/docs/raptor-parsers.xml b/docs/raptor-parsers.xml
new file mode 100644
index 0000000..de4593a
--- /dev/null
+++ b/docs/raptor-parsers.xml
@@ -0,0 +1,262 @@
+<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
+ "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd">
+<chapter id="raptor-parsers">
+<title>Parsers in Raptor (syntax to triples)</title>
+
+<section id="raptor-parsers-intro">
+<title>Introduction</title>
+
+<para>This section describes the parsers that can be compiled into
+Raptor and their options. The exact parsers supported may vary
+by different builds of raptor and can be queried at run-time by
+use of the
+<link linkend="raptor-world-get-parser-description"><function>raptor_world_get_parser_description</function></link>
+function</para>
+
+<para>The options that may be set on parsers can also
+be queried at run-time with the
+<link linkend="raptor-world-get-option-description"><function>raptor_world_get_option_description</function></link>
+function.</para>
+
+</section>
+
+
+<section id="parser-grddl">
+<title>GRDDL parser (name <literal>grddl</literal>)</title>
+<para>A parser for the
+<ulink url="http://www.w3.org/TR/2007/PR-grddl-20070716/">Gleaning Resource Descriptions from Dialects of Languages (GRDDL)</ulink>,
+W3C Proposed Recommendation of 2007-07-16 which allows reading XHTML
+and XML as RDF triples by using profiles in the document that declare
+XSLT transforms from the XHTML or XML content into RDF/XML or other
+RDF syntax which can then be parsed.</para>
+
+<para>The GRDDL parser is rather complex and different from the other
+parsers in that it retrieves URIs, reads HTML documents (possibly
+with errors), transforms the documents with XSLT and turns the result
+into a single graph. The default configuration of the GRDDL parser
+also reads microformats (hcard, hcalendar) and follows &lt;link&gt;
+tags that point to RDF/XML. Parts of the GRDDL process can be
+altered by configuration, which are describe below.
+</para>
+
+<para>The GRDDL parser defines 'base', 'Base' and 'url' XSLT parameters
+with the value of the base URI to allow some XSLT sheets to work. These
+set of parameters cannot be disabled.
+</para>
+
+<para>If the XSLT transform returns an empty string, no further
+processing of the result is done, and a warning is generated. The
+xsl:output method is mapped to result document mime types as follows:
+'text' to text/plain; 'xml' to application/xml and 'html' to text/html.
+Any result that is of type 'application/xml' or unknown mime type
+is assumed to be RDF/XML.
+</para>
+
+<para>The URIs that are processed during GRDDL operations can be checked
+and skipped if required using a handler set with the
+<link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link>
+function. If the handler returns non-0, the URI is rejected.
+This uses
+<link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link>
+internally.
+</para>
+
+<para>If the value of option
+<link linkend="RAPTOR-OPTION-WWW-TIMEOUT:CAPS"><literal>RAPTOR_OPTION_WWW_TIMEOUT</literal></link>
+if set to a number &gt;0, it is used as the timeout in seconds
+for retrieving of URIs during GRDDL processing.
+This uses
+<link linkend="raptor-www-set-connection-timeout"><function>raptor_www_set_connection_timeout()</function></link>
+internally.
+</para>
+
+<para>The hardcoded support for hcard and hcalendar
+microformats can be disabled by setting parser option
+<link linkend="RAPTOR-OPTION-MICROFORMATS:CAPS"><literal>RAPTOR_OPTION_MICROFORMATS</literal></link>
+to 0
+or using
+<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link>
+with option
+<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link>
+and a boolean value of 1.
+</para>
+
+<para>The GRDDL parser by default will try an XML parser on the
+content followed by a lax HTML parser. This can be disabled by
+setting parser option
+<link linkend="RAPTOR-OPTION-HTML-TAG-SOUP:CAPS"><literal>RAPTOR_OPTION_HTML_TAG_SOUP</literal></link>
+to 0
+or using
+<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link>
+with option
+<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link>
+and a boolean value of 1.
+</para>
+
+<para>The GRDDL parser by default will try to look for an HTML
+&lt;link&gt; tag that points to RDF/XML. This can be disabled by
+setting parser option
+<link linkend="RAPTOR-OPTION-HTML-LINK:CAPS"><literal>RAPTOR_OPTION_HTML_LINK</literal></link>
+to 0
+or using
+<link linkend="raptor-parser-set-option"><function>raptor_parser_set_option()</function></link>
+with option
+<link linkend="RAPTOR-OPTION-STRICT:CAPS"><literal>RAPTOR_OPTION_STRICT</literal></link>
+and a boolean value of 1.
+</para>
+
+</section>
+
+
+<section id="parser-guess">
+<title>Guess parser (name <literal>guess</literal>)</title>
+<para>
+This is a special parser that picks the actual parser to use based
+on the content type, the content bytes or the content identifier. The
+content name can be either from a local file or from a URI.
+</para>
+
+<para>If the protocol that delivered the content (such as HTTP)
+provided a <emphasis>Content Type</emphasis> (aka MIME Type) then
+this will be the primary means for identifying th ecotnent.
+</para>
+
+<para>The secondary means to identify the content are the bytes of
+the content (if available), otherwise the content identifier is used,
+which is the least reliable.
+</para>
+
+</section>
+
+
+<section id="parser-json">
+<title>JSON parser (name <literal>json</literal>)</title>
+
+<para>A parser for both the
+resource-centric RDF/JSON syntax as defined by Talis at
+<ulink url="http://n2.talis.com/wiki/RDF_JSON_Specification">RDF/JSON Specification</ulink>
+and the triples-centric format based on the SPARQL results in JSON format.
+</para>
+
+</section>
+
+
+<section id="parser-ntriples">
+<title>N-Triples parser (name <literal>ntriples</literal>)</title>
+
+<para>A parser for the
+<ulink url="http://www.w3.org/TR/rdf-testcases/#ntriples">N-Triples</ulink>
+syntax as used by the
+<ulink url="http://www.w3.org/2001/sw/RDFCore/">W3C RDF Core working group</ulink>
+for the <ulink url="http://www.w3.org/TR/rdf-testcases/">RDF Test Cases</ulink>.
+</para>
+
+</section>
+
+
+<section id="parser-rdfa">
+<title>RDFa parser - (name <literal>rdfa</literal>)</title>
+<para>
+A parser for the
+<ulink url="http://www.w3.org/TR/2008/CR-rdfa-syntax-20080620/">RDFa</ulink>
+syntax, W3C Candidate Recommendation 20 June 2008 which allows reading XHTML
+and XML as RDF triples by interpreting attributes on elements to
+describe which ones have RDF semantics. This is implemented via
+<ulink url="http://rdfa.digitalbazaar.com/librdfa/">librdfa</ulink>
+linked inside Raptor, written by Manu Sporny of Digital Bazaar,
+and licensed with the same license as Raptor.
+</para>
+
+<para>
+This parser is beta quality and passes all but 4 of the RDFa tests as
+of Raptor 1.4.18.
+</para>
+
+</section>
+
+
+<section id="parser-rdfxml">
+<title>RDF/XML parser - default (name <literal>rdfxml</literal>)</title>
+<para>
+A parser for the standard
+<ulink url="http://www.w3.org/TR/rdf-syntax-grammar/">RDF/XML syntax</ulink>
+as revised by the
+<ulink url="http://www.w3.org/2001/sw/RDFCore/">W3C RDF Core working group</ulink>.</para>
+
+<para>This is the default parser in Raptor.</para>
+
+<para>Features of this parser:</para>
+<itemizedlist>
+<listitem><para>Fully handles the <ulink url="http://www.w3.org/TR/rdf-syntax-grammar/">RDF/XML syntax updates</ulink> for <ulink url="http://www.w3.org/TR/xmlbase/">XML Base</ulink>, <literal>xml:lang</literal>, RDF datatyping and Collections.</para></listitem>
+
+<listitem><para>Handles all RDF vocabularies such as <ulink url="http://www.foaf-project.org/">FOAF</ulink>, <ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink>, <ulink url="http://dublincore.org/">Dublin Core</ulink>, <ulink url="http://www.w3.org/TR/owl-features/">OWL</ulink>, <ulink url="http://usefulinc.com/doap">DOAP</ulink></para></listitem>
+
+<listitem><para>Handles <literal>rdf:resource</literal> / <literal>resource</literal> attributes</para></listitem>
+
+<listitem><para>Uses <ulink url="http://expat.sourceforge.net/">expat</ulink> and/or (GNOME) <ulink url="http://xmlsoft.org/">libxml</ulink> XML parsers as available or required</para></listitem>
+
+</itemizedlist>
+
+</section>
+
+
+<section id="parser-rss-tag-soup">
+<title>RSS Tag Soup parser (name <literal>rss-tag-soup</literal>)</title>
+
+<para>A parser for the multiple XML RSS formats that use the elements
+such as <literal>channel</literal>, <literal>item</literal>,
+<literal>title</literal>, <literal>description</literal>
+in different ways.
+This includes support for the Atom 1.0 syndication format defined in IETF
+<ulink url="http://www.ietf.org/rfc/rfc4287.txt">RFC 4287</ulink>
+</para>
+
+<para>The parser attempts to turn the input into
+<ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink>
+RDF triples in the RSS 1.0 model of a syndication feed.
+This includes triples for RSS Enclosures.
+</para>
+
+<para>
+True <ulink url="http://www.purl.org/rss/1.0/">RSS 1.0</ulink> when
+wanted to be used as a full RDF vocabulary, is best parsed by the
+RDF/XML parser (name <literal>rdfxml</literal>).
+</para>
+
+</section>
+
+
+<section id="parser-trig">
+<title>TRiG parser (name <literal>trig</literal>)</title>
+
+<para>A parser for the
+<ulink url="http://www.wiwiss.fu-berlin.de/suhl/bizer/TriG/Spec/">TriG - Turtle with Named Graphs</ulink>
+syntax.
+</para>
+
+<para>The parser is alpha quality and may not support the entire TRiG
+specification.</para>
+
+</section>
+
+
+<section id="parser-turtle">
+<title>Turtle Terse RDF Triple Language parser (name <literal>turtle</literal>)</title>
+
+<para>A parser for the
+<ulink url="http://www.dajobe.org/2004/01/turtle/">Turtle Terse RDF Triple Language</ulink>
+syntax, designed as a useful subset of
+<ulink url="http://www.w3.org/DesignIssues/Notation3">Notation 3</ulink>.
+</para>
+
+</section>
+
+
+</chapter>
+
+<!--
+Local variables:
+mode: sgml
+sgml-parent-document: ("raptor-docs.xml" "book" "part")
+End:
+-->