You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by bu...@apache.org on 2015/06/17 23:36:39 UTC

svn commit: r955199 - in /websites/staging/jena/trunk/content: ./ documentation/query/text-query.html

Author: buildbot
Date: Wed Jun 17 21:36:39 2015
New Revision: 955199

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/query/text-query.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Jun 17 21:36:39 2015
@@ -1 +1 @@
-1683701
+1686115

Modified: websites/staging/jena/trunk/content/documentation/query/text-query.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/query/text-query.html (original)
+++ websites/staging/jena/trunk/content/documentation/query/text-query.html Wed Jun 17 21:36:39 2015
@@ -19,7 +19,7 @@
     limitations under the License.
 -->
 
-  <title>Apache Jena - Text searches with SPARQL</title>
+  <title>Apache Jena - </title>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
@@ -143,7 +143,7 @@
     <div class="row">
     <div class="col-md-12">
     <div id="breadcrumbs"></div>
-    <h1 class="title">Text searches with SPARQL</h1>
+    <h1 class="title"></h1>
   <p>This module was first released with Jena 2.11.0.</p>
 <p>This extension to ARQ combines SPARQL and text search.</p>
 <p>It gives applications the ability to perform free text searches within
@@ -178,6 +178,7 @@ the actual label.  More details are give
 <li><a href="#configuring-an-analyzer">Configuring an analyzer</a></li>
 <li><a href="#configuration-by-code">Configuration by Code</a></li>
 <li><a href="#graph-specific-indexing">Graph-specific Indexing</a></li>
+<li><a href="#linguistic-support-with-lucene-index">Linguistic Support with Lucene Index</a></li>
 </ul>
 </li>
 <li><a href="#working-with-fuseki">Working with Fuseki</a></li>
@@ -198,11 +199,11 @@ or
 properties work with.  When data is added, any properties matching the
 description cause an entry to be added from analysed text from the triple
 object and mapping to the subject.</p>
-<h3 id="pattern-a-wzxhzdk22-rdf-data">Pattern A &ndash; RDF data</h3>
+<h3 id="pattern-a-wzxhzdk31-rdf-data">Pattern A &ndash; RDF data</h3>
 <p>In this pattern, the data in the text index is indexing literals in the RDF data.<br />
 Additions to the RDF data are reflected in additions to the index.</p>
 <p>(Deletes do not remove text index entries - <a href="#deletion-of-indexed-entities">see below</a>)</p>
-<h3 id="pattern-b-wzxhzdk23-external-content">Pattern B &ndash; External content</h3>
+<h3 id="pattern-b-wzxhzdk32-external-content">Pattern B &ndash; External content</h3>
 <p>There is no requirement that the text data indexed is present in the RDF
 data.  As long as the index contains the index text documents to match the
 index description, then text search can be performed.</p>
@@ -234,14 +235,12 @@ conveniently written:</p>
 
 
 <p>The most general form is:</p>
-<div class="codehilite"><pre><span class="p">(</span>?<span class="n">s</span> ?<span class="n">score</span><span class="p">)</span> <span class="n">text</span><span class="p">:</span><span class="n">query</span> <span class="p">(</span><span class="n">property</span> <span class="s">&#39;query string&#39;</span> <span class="s">&#39;limit&#39;</span><span class="p">)</span>
+<div class="codehilite"><pre> <span class="p">(</span>?<span class="n">s</span> ?<span class="n">score</span><span class="p">)</span> <span class="n">text</span><span class="p">:</span><span class="n">query</span> <span class="p">(</span><span class="n">property</span> <span class="s">&#39;query string&#39;</span> <span class="s">&#39;limit&#39;</span><span class="p">)</span>
 </pre></div>
 
 
 <p>Only the query string is required, and if it is the only argument the
 surrounding <code>( )</code> can be omitted.</p>
-<p>When a 2-element list is used as the subject, the second variable gets
-assigned the raw score from the text index as a float value.</p>
 <p>The property URI is only necessary if multiple properties have been indexed.</p>
 <table>
 <thead>
@@ -268,7 +267,7 @@ assigned the raw score from the text ind
 <h3 id="good-practice">Good practice</h3>
 <p>The query execution does not know the selectivity of the text index.  It is
 better to use one of two styles.</p>
-<h4 id="query-pattern-1-wzxhzdk24-find-in-the-text-index-and-enhance-results">Query pattern 1 &ndash; Find in the text index and enhance results</h4>
+<h4 id="query-pattern-1-wzxhzdk33-find-in-the-text-index-and-enhance-results">Query pattern 1 &ndash; Find in the text index and enhance results</h4>
 <p>Access to the index is first in the query and used to find a number of
 items of interest; further information is obtained about these items from
 the RDF data.</p>
@@ -282,7 +281,7 @@ the RDF data.</p>
 
 <p>Limit is useful here when working with large indexes to limit results to the
 more higher scoring results.</p>
-<h4 id="query-pattern-2-wzxhzdk25-filter">Query pattern 2 &ndash; Filter</h4>
+<h4 id="query-pattern-2-wzxhzdk34-filter">Query pattern 2 &ndash; Filter</h4>
 <p>By finding items of interest first in the RDF data, the text search can be
 used to restrict the items found still further.</p>
 <div class="codehilite"><pre><span class="n">SELECT</span> ?<span class="n">s</span>
@@ -378,9 +377,20 @@ needs to identify the text dataset by it
 <code>http://localhost/jena_example/#text_dataset</code>.</p>
 <h3 id="configuring-an-analyzer">Configuring an Analyzer</h3>
 <p>Text to be indexed is passed through a text analyzer that divides it into tokens 
-and may perform other transformations such as eliminating stop words.  If a Lucene
-text index is used then, by default a <code>StandardAnalyzer</code> is used.  If a Solr text
-index is used, the analyzer used is determined by the Solr configuration.</p>
+and may perform other transformations such as eliminating stop words. If a Solr text
+index is used, the analyzer used is determined by the Solr configuration.
+If a Lucene text index is used, then by default a <code>StandardAnalyzer</code> is used. However, 
+it can be replaced by another analyzer with the <code>text:analyzer</code> property. 
+For example with a <code>SimpleAnalyzer</code>:   </p>
+<div class="codehilite"><pre><span class="o">&lt;</span>#<span class="n">indexLucene</span><span class="o">&gt;</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">TextIndexLucene</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span class="n">directory</span> <span class="o">&lt;</span><span class="n">file</span><span class="p">:</span><span class="n">Lucene</span><span class="o">&gt;</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span class="n">analyzer</span> <span class="p">[</span>
+            <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">SimpleAnalyzer</span>
+        <span class="p">]</span>
+        <span class="p">.</span>
+</pre></div>
+
+
 <p>It is possible to configure an alternative analyzer for each field indexed in a
 Lucene index.  For example:</p>
 <div class="codehilite"><pre><span class="o">&lt;</span>#<span class="n">entMap</span><span class="o">&gt;</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">EntityMap</span> <span class="p">;</span>
@@ -405,7 +415,13 @@ neither of which has any configuration p
 for details of what these analyzers do. 
 In addition, Jena provides <code>LowerCaseKeywordAnalyzer</code>,
 which is a case-insensitive version of <code>KeywordAnalyzer</code>.</p>
-<p>New in Jena 2.13.0:</p>
+<p>In Jena 3.0.0:</p>
+<p>Support for the new <code>LocalizedAnalyzer</code> has been introduced to deal with Lucene 
+language specific analyzers. 
+See <a href="#linguistic-support-with-lucene-index">Linguistic Support with Lucene Index</a>
+part for details.</p>
+<h4 id="analyzer-for-query">Analyzer for Query</h4>
+<p>New in Jena 2.13.0.</p>
 <p>There is an ability to specify an analyzer to be used for the
 query string itself.  It will find terms in the query text.  If not set, then the
 analyzer used for the document will be used.  The query analyzer is specified on
@@ -470,6 +486,108 @@ EntityDefinition constructors that suppo
 
 <p><strong>Note:</strong> If you migrate from a global (non-graph-aware) index to a graph-aware index,
 you need to rebuild the index to ensure that the graph information is stored.</p>
+<h3 id="linguistic-support-with-lucene-index">Linguistic support with Lucene index</h3>
+<p>It is now possible to take advantage of languages of triple literals to enhance 
+index and queries. Sub-sections below detail different settings with the index, 
+and use cases with SPARQL queries.</p>
+<h4 id="explicit-language-field-in-the-index">Explicit Language Field in the Index</h4>
+<p>Literals' languages of triples can be stored (during triple addition phase) into the 
+index to extend query capabilities. 
+For that, the new <code>text:langField</code> property must be set in the EntityMap assembler :</p>
+<div class="codehilite"><pre><span class="o">&lt;</span>#<span class="n">entMap</span><span class="o">&gt;</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">EntityMap</span> <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">entityField</span>      &quot;<span class="n">uri</span>&quot; <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">defaultField</span>     &quot;<span class="n">text</span>&quot; <span class="p">;</span>        
+    <span class="n">text</span><span class="p">:</span><span class="n">langField</span>        &quot;<span class="n">lang</span>&quot; <span class="p">;</span>       
+    <span class="p">.</span>
+</pre></div>
+
+
+<p>If you configure the index via Java code, you need to set this parameter to the 
+EntityDefinition instance, e.g.</p>
+<div class="codehilite"><pre><span class="n">EntityDefinition</span> <span class="n">docDef</span> <span class="p">=</span> <span class="n">new</span> <span class="n">EntityDefinition</span><span class="p">(</span><span class="n">entityField</span><span class="p">,</span> <span class="n">defaultField</span><span class="p">);</span>
+<span class="n">docDef</span><span class="p">.</span><span class="n">setLangField</span><span class="p">(</span>&quot;<span class="n">lang</span>&quot;<span class="p">);</span>
+</pre></div>
+
+
+<h4 id="sparql-linguistic-clause-forms">SPARQL Linguistic Clause Forms</h4>
+<p>Once the <code>langField</code> is set, you can use it directly inside SPARQL queries, for that the <code>'lang:xx'</code>
+argument allows you to target specific localized values. For example:</p>
+<div class="codehilite"><pre><span class="c1">//target english literals</span>
+<span class="o">?</span><span class="n">s</span> <span class="nl">text:</span><span class="n">query</span> <span class="p">(</span><span class="nl">rdfs:</span><span class="n">label</span> <span class="p">&#39;</span><span class="n">word</span><span class="sc">&#39; &#39;</span><span class="nl">lang:</span><span class="n">en</span><span class="p">&#39;</span> <span class="p">)</span>
+
+<span class="c1">//target unlocalized literals</span>
+<span class="o">?</span><span class="n">s</span> <span class="nl">text:</span><span class="n">query</span> <span class="p">(</span><span class="nl">rdfs:</span><span class="n">label</span> <span class="p">&#39;</span><span class="n">word</span><span class="sc">&#39; &#39;</span><span class="nl">lang:</span><span class="n">none</span><span class="p">&#39;)</span>
+
+<span class="c1">//ignore language field</span>
+<span class="o">?</span><span class="n">s</span> <span class="nl">text:</span><span class="n">query</span> <span class="p">(</span><span class="nl">rdfs:</span><span class="n">label</span> <span class="p">&#39;</span><span class="n">word</span><span class="p">&#39;)</span>
+</pre></div>
+
+
+<h4 id="localizedanalyzer">LocalizedAnalyzer</h4>
+<p>You can specify and use a LocalizedAnalyzer in order to benefit from Lucene language 
+specific analyzers (stemming, stop words,...). Like any others analyzers, it can 
+be done for default text indexation, for each different field or for query.</p>
+<p>With an assembler configuration, the <code>text:language</code> property needs to be provided, e.g :</p>
+<div class="codehilite"><pre><span class="o">&lt;</span>#<span class="n">indexLucene</span><span class="o">&gt;</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">TextIndexLucene</span> <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">directory</span> <span class="o">&lt;</span><span class="n">file</span><span class="p">:</span><span class="n">Lucene</span><span class="o">&gt;</span> <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">entityMap</span> <span class="o">&lt;</span>#<span class="n">entMap</span><span class="o">&gt;</span> <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">analyzer</span> <span class="p">[</span>
+        <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">LocalizedAnalyzer</span> <span class="p">;</span>
+        <span class="n">text</span><span class="p">:</span><span class="n">language</span> &quot;<span class="n">fr</span>&quot;
+    <span class="p">]</span>
+    <span class="p">.</span>
+</pre></div>
+
+
+<p>will configure the index to analyze values of the 'text' field using a FrenchAnalyzer.</p>
+<p>To configure the same example via Java code, you need to provide the analyzer to the
+index configuration object:</p>
+<div class="codehilite"><pre>    <span class="n">TextIndexConfig</span> <span class="n">config</span> <span class="p">=</span> <span class="n">new</span> <span class="n">TextIndexConfig</span><span class="p">(</span><span class="n">def</span><span class="p">);</span>
+    <span class="n">Analyzer</span> <span class="n">analyzer</span> <span class="p">=</span> <span class="n">Util</span><span class="p">.</span><span class="n">getLocalizedAnalyzer</span><span class="p">(</span>&quot;<span class="n">fr</span>&quot;<span class="p">);</span>
+    <span class="n">config</span><span class="p">.</span><span class="n">setAnalyzer</span><span class="p">(</span><span class="n">analyzer</span><span class="p">);</span>
+    <span class="n">Dataset</span> <span class="n">ds</span> <span class="p">=</span> <span class="n">TextDatasetFactory</span><span class="p">.</span><span class="n">createLucene</span><span class="p">(</span><span class="n">ds1</span><span class="p">,</span> <span class="n">dir</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span> <span class="p">;</span>
+</pre></div>
+
+
+<p>Where <code>def</code>, <code>ds1</code> and <code>dir</code> are instances of <code>EntityDefinition</code>, <code>Dataset</code> and 
+<code>Directory</code> classes.</p>
+<p><strong>Note</strong>: You do not have to set the <code>text:langField</code> property with a single 
+localized analyzer.</p>
+<h4 id="multilingual-support">Multilingual Support</h4>
+<p>Let us suppose that we have many triples with many localized literals in many different 
+languages. It is possible to take all this languages into account for future mixed localized queries.
+Just set the <code>text:multilingualSupport</code> property at <code>true</code> to automatically enable the localized
+indexation (and also the localized analyzer for query) :</p>
+<div class="codehilite"><pre><span class="o">&lt;</span>#<span class="n">indexLucene</span><span class="o">&gt;</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">TextIndexLucene</span> <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">directory</span> &quot;<span class="n">mem</span>&quot; <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">multilingualSupport</span> <span class="n">true</span><span class="p">;</span>     
+    <span class="p">.</span>
+</pre></div>
+
+
+<p>Via Java code, set the multilingual support flag : </p>
+<div class="codehilite"><pre>    <span class="n">TextIndexConfig</span> <span class="n">config</span> <span class="p">=</span> <span class="n">new</span> <span class="n">TextIndexConfig</span><span class="p">(</span><span class="n">def</span><span class="p">);</span>
+    <span class="n">config</span><span class="p">.</span><span class="n">setMultilingualSupport</span><span class="p">(</span><span class="n">true</span><span class="p">);</span>
+    <span class="n">Dataset</span> <span class="n">ds</span> <span class="p">=</span> <span class="n">TextDatasetFactory</span><span class="p">.</span><span class="n">createLucene</span><span class="p">(</span><span class="n">ds1</span><span class="p">,</span> <span class="n">dir</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span> <span class="p">;</span>
+</pre></div>
+
+
+<p>Thus, this multilingual index combines dynamically all localized analyzers of existing languages and 
+the storage of langField properties.</p>
+<p>For example, it is possible to involve different languages into the same text search query :</p>
+<div class="codehilite"><pre><span class="n">SELECT</span> ?<span class="n">s</span>
+<span class="n">WHERE</span> <span class="p">{</span>
+    <span class="p">{</span> ?<span class="n">s</span> <span class="n">text</span><span class="p">:</span><span class="n">query</span> <span class="p">(</span> <span class="n">rdfs</span><span class="p">:</span><span class="n">label</span> <span class="s">&#39;institut&#39;</span> <span class="s">&#39;lang:fr&#39;</span> <span class="p">)</span> <span class="p">}</span>
+    <span class="n">UNION</span>
+    <span class="p">{</span> ?<span class="n">s</span> <span class="n">text</span><span class="p">:</span><span class="n">query</span> <span class="p">(</span> <span class="n">rdfs</span><span class="p">:</span><span class="n">label</span> <span class="s">&#39;institute&#39;</span> <span class="s">&#39;lang:en&#39;</span> <span class="p">)</span> <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>Hence, the result set of the query will contain "institute" related subjects 
+(institution, institutional,...) in French and in English.</p>
+<p><strong>Note</strong>: If the <code>text:langField</code> property is not set, the "lang" field will be
+used anyway by default, because multilingual index cannot work without it.</p>
 <h2 id="working-with-fuseki">Working with Fuseki</h2>
 <p>The Fuseki configuration simply points to the text dataset as the
 <code>fuseki:dataset</code> of the service.</p>