You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by bu...@apache.org on 2011/10/10 00:33:54 UTC

svn commit: r796822 - /websites/staging/jena/trunk/content/jena/documentation/larq/index.html

Author: buildbot
Date: Sun Oct  9 22:33:54 2011
New Revision: 796822

Log:
Staging update by buildbot

Modified:
    websites/staging/jena/trunk/content/jena/documentation/larq/index.html

Modified: websites/staging/jena/trunk/content/jena/documentation/larq/index.html
==============================================================================
--- websites/staging/jena/trunk/content/jena/documentation/larq/index.html (original)
+++ websites/staging/jena/trunk/content/jena/documentation/larq/index.html Sun Oct  9 22:33:54 2011
@@ -138,9 +138,143 @@
 
   <div id="content">
     <h1 class="title">LARQ - adding text search indexes to Jena</h1>
-    <p>Documentation is in the process of being ported from the previous
-OpenJena site. In the meantime, please refer to the
-<a href="http://openjena.org/ARQ/lucene-arq.html">LARQ web page</a></p>
+    <div class="codehilite"><pre><span class="err">WARNING:</span> <span class="err">LARQ</span> <span class="err">used</span> <span class="err">to</span> <span class="err">be</span> <span class="err">included</span> <span class="err">with</span> <span class="err">ARQ,</span> <span class="err">but</span> <span class="err">we</span> <span class="err">are</span> <span class="err">making</span> <span class="err">it</span> <span class="err">a</span> <span class="err">separate</span> <span class="err">module.</span> <span class="err">So</span> <span class="err">there</span> <span class="err">are</span> <span class="err">some</span> <span class="err">changes</span> <span class="err">coming,</span> <span class="err">please,</span> <span class="err">bear</span> <span class="err">with</span> <span class="err">us</span> <span class="err">while</span> <span class="err">things</span> <span class="err">settle</span> <span class="err">down.</span> <span class="err">If</span> <span class="
 err">you</span> <span class="err">want</span> <span class="err">the</span> <span class="err">old</span> <span class="err">LARQ</span> <span class="err">documentation</span> <span class="err">is</span> <span class="err">still</span> <span class="err">available</span> <span class="err">[here](http:</span><span class="c1">//openjena.org/ARQ/lucene-arq.html).</span>
+</pre></div>
+
+
+<p>LARQ is a combination of <a href="../query/index.html">ARQ</a> and <a href="http://lucene.apache.org/java/docs/index.html">Lucene</a>. It gives users the ability to perform free text searches within their SPARQL queries. Lucene indexes are additional information for accessing the RDF graph, not storage for the graph itself.</p>
+<p>Some example code is available here: <a href="https://svn.apache.org/repos/asf/incubator/jena/Jena2/LARQ/trunk/src/test/java/org/apache/jena/larq/examples/">https://svn.apache.org/repos/asf/incubator/jena/Jena2/LARQ/trunk/src/test/java/org/apache/jena/larq/examples/</a>.</p>
+<p>Two helper commands are provided: <code>larq.larqbuilder</code> and <code>larq.larq</code> used respectively for updating and querying LARQ indexes.</p>
+<p>A full description of the free text query language syntax is given in the <a href="http://lucene.apache.org/java/3_0_0/queryparsersyntax.html">Lucene query syntax</a> document.</p>
+<h3 id="usage_patterns">Usage Patterns</h3>
+<p>There are three basic usage patterns supported:</p>
+<ul>
+<li>Pattern 1 : index string literals. The index will return the literals matching the Lucene search pattern.</li>
+<li>Pattern 2 : index subject resources by string literal. The index returns the subjects with property value matching a text query.</li>
+<li>Pattern 3 : index graph nodes based on strings not present in the graph.</li>
+</ul>
+<p>Patterns 1 and 2 have the indexed content in the graph. Both 1 and 2 can be modified by specifying a property so that only values of a given property are indexed. Pattern 2 is less flexible as discussed below. Pattern 3 is covered "External Content" section below.</p>
+<p>LARQ can be used in other ways as well but the classes for these patterns are supplied. In both patterns 1 and 2, strings are indexed, being plain strings, string with any language tag or any literal with datatype XSD string.</p>
+<h3 id="index_creation">Index Creation</h3>
+<p>There are many ways to use Lucene, which can be set up to handle particular features or languages. The creation of the index is done outside of the ARQ query system proper and only accessed at query time. LARQ includes some platform classes and also utility classes to create indexes on string literals for the use cases above. Indexing can be performed as the graph is read in, or to built from an existing graph.</p>
+<h2 id="index_builders">Index Builders</h2>
+<p>An index builder is a class to create a Lucene index from RDF data.</p>
+<ul>
+<li><code>IndexBuilderString</code>: This is the most commonly used index builder. It indexes plain literals (with or without language tags) and XSD strings and stores the complete literal. Optionally, a property can be supplied which restricts indexing to strings in statements using that property.</li>
+<li><code>IndexBuilderSubject</code>: Index the subject resource by a string literal, an store the subject resource, possibly restricted by a specified property.</li>
+</ul>
+<p>Lucene has many ways to create indexes and the index builder classes do not attempt to provide all possible Lucene features. Applications may need to extend or modify the standard index builders provided by LARQ.</p>
+<h2 id="index_creation_1">Index Creation</h2>
+<p>An index can be built while reading RDF into a model:</p>
+<div class="codehilite"><pre><span class="sr">//</span> <span class="o">--</span> <span class="n">Read</span> <span class="ow">and</span> <span class="nb">index</span> <span class="n">all</span> <span class="n">literal</span> <span class="n">strings</span><span class="o">.</span>
+<span class="n">IndexBuilderString</span> <span class="n">larqBuilder</span> <span class="o">=</span> <span class="k">new</span> <span class="n">IndexBuilderString</span><span class="p">()</span> <span class="p">;</span>
+
+<span class="sr">//</span> <span class="o">--</span> <span class="n">Index</span> <span class="n">statements</span> <span class="n">as</span> <span class="n">they</span> <span class="n">are</span> <span class="n">added</span> <span class="n">to</span> <span class="n">the</span> <span class="n">model</span><span class="o">.</span>
+<span class="n">model</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="n">larqBuilder</span><span class="p">)</span> <span class="p">;</span>
+
+<span class="n">FileManager</span><span class="o">.</span><span class="n">get</span><span class="p">()</span><span class="o">.</span><span class="n">readModel</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">datafile</span><span class="p">)</span> <span class="p">;</span>
+
+<span class="sr">//</span> <span class="o">--</span> <span class="n">Finish</span> <span class="n">indexing</span>
+<span class="n">larqBuilder</span><span class="o">.</span><span class="n">closeWriter</span><span class="p">()</span> <span class="p">;</span>
+<span class="n">model</span><span class="o">.</span><span class="n">unregister</span><span class="p">(</span><span class="n">larqBuilder</span><span class="p">)</span> <span class="p">;</span>
+
+<span class="sr">//</span> <span class="o">--</span> <span class="n">Create</span> <span class="n">the</span> <span class="n">access</span> <span class="nb">index</span>  
+<span class="n">IndexLARQ</span> <span class="nb">index</span> <span class="o">=</span> <span class="n">larqBuilder</span><span class="o">.</span><span class="n">getIndex</span><span class="p">()</span> <span class="p">;</span>
+</pre></div>
+
+
+<p>or it can be created from an existing model:</p>
+<div class="codehilite"><pre><span class="sr">//</span> <span class="o">--</span> <span class="n">Create</span> <span class="n">an</span> <span class="nb">index</span> <span class="n">based</span> <span class="n">on</span> <span class="n">existing</span> <span class="n">statements</span>
+<span class="n">larqBuilder</span><span class="o">.</span><span class="n">indexStatements</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">listStatements</span><span class="p">())</span> <span class="p">;</span>
+<span class="sr">//</span> <span class="o">--</span> <span class="n">Finish</span> <span class="n">indexing</span>
+<span class="n">larqBuilder</span><span class="o">.</span><span class="n">closeWriter</span><span class="p">()</span> <span class="p">;</span>
+<span class="sr">//</span> <span class="o">--</span> <span class="n">Create</span> <span class="n">the</span> <span class="n">access</span> <span class="nb">index</span>  
+<span class="n">IndexLARQ</span> <span class="nb">index</span> <span class="o">=</span> <span class="n">larqBuilder</span><span class="o">.</span><span class="n">getIndex</span><span class="p">()</span> <span class="p">;</span>
+</pre></div>
+
+
+<h3 id="index_registration">Index Registration</h3>
+<p>Next the index is made available to ARQ. This can be done globally:</p>
+<div class="codehilite"><pre><span class="sr">//</span> <span class="o">--</span> <span class="n">Make</span> <span class="n">globally</span> <span class="n">available</span>
+<span class="n">LARQ</span><span class="o">.</span><span class="n">setDefaultIndex</span><span class="p">(</span><span class="nb">index</span><span class="p">)</span> <span class="p">;</span>
+</pre></div>
+
+
+<p>or it can be set on a per-query execution basis.</p>
+<div class="codehilite"><pre><span class="n">QueryExecution</span> <span class="n">qExec</span> <span class="o">=</span> <span class="n">QueryExecutionFactory</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">model</span><span class="p">)</span> <span class="p">;</span>
+<span class="sr">//</span> <span class="o">--</span> <span class="n">Make</span> <span class="n">available</span> <span class="n">to</span> <span class="n">this</span> <span class="n">query</span> <span class="n">execution</span> <span class="n">only</span>
+<span class="n">LARQ</span><span class="o">.</span><span class="n">setDefaultIndex</span><span class="p">(</span><span class="n">qExec</span><span class="o">.</span><span class="n">getContext</span><span class="p">(),</span> <span class="nb">index</span><span class="p">)</span> <span class="p">;</span>
+</pre></div>
+
+
+<p>In both these cases, the default index is set, which is the one expected by property function <code>pf:textMatch</code>. Use of multiple indexes in the same query can be achieved by introducing new properties.  The application can subclass the search class <code>org.apache.jena.larq.LuceneSearch</code> to set different indexes with different property names.</p>
+<h3 id="query_using_a_lucene_index">Query using a Lucene index</h3>
+<p>Query execution is as usual using the property function pf:textMatch. "textMatch" can be thought of as an implied relationship in the data. Note the prefix ends in ".".</p>
+<div class="codehilite"><pre><span class="n">String</span> <span class="n">queryString</span> <span class="o">=</span> <span class="n">StringUtils</span><span class="o">.</span><span class="nb">join</span><span class="p">(</span><span class="s">&quot;\n&quot;</span><span class="p">,</span> <span class="k">new</span> <span class="n">String</span><span class="o">[]</span><span class="p">{</span>
+        <span class="s">&quot;PREFIX pf: &lt;http://jena.hpl.hp.com/ARQ/property#&gt;&quot;</span><span class="p">,</span>
+        <span class="s">&quot;SELECT * {&quot;</span> <span class="p">,</span>
+        <span class="s">&quot;    ?lit pf:textMatch &#39;+text&#39;&quot;</span><span class="p">,</span>
+        <span class="s">&quot;}&quot;</span>
+    <span class="p">})</span> <span class="p">;</span>
+<span class="n">Query</span> <span class="n">query</span> <span class="o">=</span> <span class="n">QueryFactory</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">queryString</span><span class="p">)</span> <span class="p">;</span>
+<span class="n">QueryExecution</span> <span class="n">qExec</span> <span class="o">=</span> <span class="n">QueryExecutionFactory</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">model</span><span class="p">)</span> <span class="p">;</span>
+<span class="n">ResultSetFormatter</span><span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="n">System</span><span class="o">.</span><span class="n">out</span><span class="p">,</span> <span class="n">qExec</span><span class="o">.</span><span class="n">execSelect</span><span class="p">(),</span> <span class="n">query</span><span class="p">)</span> <span class="p">;</span>
+</pre></div>
+
+
+<p>The subjects with a property value of the matched literals can be retrieved by looking up the literals in the model:</p>
+<div class="codehilite"><pre><span class="n">PREFIX</span> <span class="n">pf:</span> <span class="sr">&lt;http://jena.hpl.hp.com/ARQ/property#&gt;</span>
+<span class="n">SELECT</span> <span class="p">?</span><span class="n">doc</span>
+<span class="p">{</span>
+    <span class="p">?</span><span class="n">lit</span> <span class="n">pf:textMatch</span> <span class="s">&#39;+text&#39;</span> <span class="o">.</span>
+    <span class="p">?</span><span class="n">doc</span> <span class="p">?</span><span class="n">p</span> <span class="p">?</span><span class="n">lit</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>This is a more flexible way of achieving the effect of using a <code>IndexBuilderSubject</code>. <code>IndexBuilderSubject</code> can be more compact when there are many large literals (it stores the subject not the literal) but does not work for blank node subjects without extremely careful co-ordination with a persistent model. Looking the literal up in the model does not have this complication.</p>
+<h3 id="accessing_the_lucene_score">Accessing the Lucene Score</h3>
+<p>The application can get access to the Lucene match score by using a list argument for the subject of <code>pf:textMatch</code>. The list must have two arguments, both unbound variables at the time of the query.</p>
+<div class="codehilite"><pre><span class="n">PREFIX</span> <span class="n">pf:</span> <span class="sr">&lt;http://jena.hpl.hp.com/ARQ/property#&gt;</span>
+<span class="n">SELECT</span> <span class="p">?</span><span class="n">doc</span> <span class="p">?</span><span class="n">score</span> 
+<span class="p">{</span>
+    <span class="p">(?</span><span class="n">lit</span> <span class="p">?</span><span class="n">score</span> <span class="p">)</span> <span class="n">pf:textMatch</span> <span class="s">&#39;+text&#39;</span> <span class="o">.</span>
+    <span class="p">?</span><span class="n">doc</span> <span class="p">?</span><span class="n">p</span> <span class="p">?</span><span class="n">lit</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<h3 id="limiting_the_number_of_matches">Limiting the number of matches</h3>
+<p>When used with just a query string, pf:textMatch returns all the Lucene matches. In many applications, the application is only interested in the first few matches (Lucene returns matches in order, highest scoring first), or only matches above some score threshold. The query argument that forms the object of the pf:textMatch property can also be a list, including a score threshold and a total limit on the number of results matched.</p>
+<div class="codehilite"><pre><span class="p">?</span><span class="n">lit</span> <span class="n">pf:textMatch</span> <span class="p">(</span> <span class="s">&#39;+text&#39;</span> <span class="mi">100</span> <span class="p">)</span> <span class="o">.</span>        <span class="c1"># Limit to at most 100 hits</span>
+
+<span class="p">?</span><span class="n">lit</span> <span class="n">pf:textMatch</span> <span class="p">(</span> <span class="s">&#39;+text&#39;</span> <span class="mf">0.5</span> <span class="p">)</span> <span class="o">.</span>        <span class="c1"># Limit to Lucene scores of 0.5 and over.</span>
+
+<span class="p">?</span><span class="n">lit</span> <span class="n">pf:textMatch</span> <span class="p">(</span> <span class="s">&#39;+text&#39;</span> <span class="mf">0.5</span> <span class="mi">100</span> <span class="p">)</span> <span class="o">.</span>    <span class="c1"># Limit to scores of 0.5 and limit to 100 hits</span>
+</pre></div>
+
+
+<h3 id="direct_application_use">Direct Application Use</h3>
+<p>The IndexLARQ class provides the ability to search programmatically, not just from ARQ. The searchModelByIndex method returns an iterator over RDFNodes.</p>
+<div class="codehilite"><pre><span class="sr">//</span> <span class="o">--</span> <span class="n">Create</span> <span class="n">the</span> <span class="n">access</span> <span class="nb">index</span>  
+<span class="n">IndexLARQ</span> <span class="nb">index</span> <span class="o">=</span> <span class="n">larqBuilder</span><span class="o">.</span><span class="n">getIndex</span><span class="p">()</span> <span class="p">;</span>
+
+<span class="n">NodeIterator</span> <span class="n">nIter</span> <span class="o">=</span> <span class="nb">index</span><span class="o">.</span><span class="n">searchModelByIndex</span><span class="p">(</span><span class="s">&quot;+text&quot;</span><span class="p">)</span> <span class="p">;</span>
+<span class="k">for</span> <span class="p">(</span> <span class="p">;</span> <span class="n">nIter</span><span class="o">.</span><span class="n">hasNext</span><span class="p">()</span> <span class="p">;</span> <span class="p">)</span>
+<span class="p">{</span>
+    <span class="sr">//</span> <span class="k">if</span> <span class="n">it</span><span class="err">&#39;</span><span class="n">s</span> <span class="n">an</span> <span class="nb">index</span> <span class="n">storing</span> <span class="n">literals</span> <span class="o">...</span>
+    <span class="n">Literal</span> <span class="n">lit</span> <span class="o">=</span> <span class="p">(</span><span class="n">Literal</span><span class="p">)</span><span class="n">nIter</span><span class="o">.</span><span class="n">nextNode</span><span class="p">()</span> <span class="p">;</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<h3 id="external_content">External Content</h3>
+<ul>
+<li>Pattern 3: index graph nodes based on strings not present in the graph.</li>
+</ul>
+<p>Sometimes, the index needs to be created based on external material and the index gives nodes in the graph. This can be done by using <code>IndexBuilderNode</code> which is a helper class to relate external material to some RDF node.</p>
+<p>Here, the indexed content is not in the RDF graph at all.  For example, the indexed content may come from HTML.XHTML, PDFs or XML documents and the RDF graph only holds the metadata about these content items. </p>
+<p>The <a href="http://lucene.apache.org/java/docs/contributions.html">Lucene contributions page</a> lists some content converters.</p>
   </div>
 
   <div id="footer">