You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by bu...@apache.org on 2018/03/22 16:53:07 UTC

svn commit: r1027141 - in /websites/staging/jena/trunk/content: ./ documentation/query/text-query.html

Author: buildbot
Date: Thu Mar 22 16:53:07 2018
New Revision: 1027141

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/query/text-query.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Mar 22 16:53:07 2018
@@ -1 +1 @@
-1827504
+1827514

Modified: websites/staging/jena/trunk/content/documentation/query/text-query.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/query/text-query.html (original)
+++ websites/staging/jena/trunk/content/documentation/query/text-query.html Thu Mar 22 16:53:07 2018
@@ -156,7 +156,6 @@
   visibility: hidden;
 }
 h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
-<p>Title: Jena Full Text Search</p>
 <p>This extension to ARQ combines SPARQL and full text search via
 <a href="https://lucene.apache.org">Lucene</a> 6.4.1 or
 <a href="https://www.elastic.co">ElasticSearch</a> 5.2.1 (which is built on
@@ -731,7 +730,7 @@ are involved.</strong></p>
 </pre></div>
 
 
-<p>The <code>RIGHT_ARROW</code> is Unicode, \u21a6, and the <code>LEFT_ARROW</code> is Unicode, \u21a4. These are chosen to be single characters that in most situations will be very unlikely to occur in resulting literals. The <code>fragSize</code> of 128 is chosen to be large enough that in many situations the matches will result in single fragments. If the literal is larger than 128 characters and there are several matches in the literal then there may be additional fragments separated by the <code>DIVIDES</code>, Unicode, \u2223.</p>
+<p>The <code>RIGHT_ARROW</code> is Unicode \u21a6 and the <code>LEFT_ARROW</code> is Unicode \u21a4. These are chosen to be single characters that in most situations will be very unlikely to occur in resulting literals. The <code>fragSize</code> of 128 is chosen to be large enough that in many situations the matches will result in single fragments. If the literal is larger than 128 characters and there are several matches in the literal then there may be additional fragments separated by the <code>DIVIDES</code>, Unicode \u2223.</p>
 <p>Depending on the analyzer used and the tokenizer, the highlighting will result in marking each token rather than an entire phrase. The <code>joinHi</code> option is by default <code>true</code> so that entire phrases are highlighted together rather than as individual tokens as in:</p>
 <div class="codehilite"><pre>&quot;<span class="n">the</span> <span class="n">quick</span> ↦<span class="n">brown</span>↤ ↦<span class="n">fox</span>↤ <span class="n">jumped</span> <span class="n">over</span> <span class="n">the</span> <span class="n">lazy</span> <span class="n">baboon</span>&quot;
 </pre></div>
@@ -852,6 +851,7 @@ indexed as well.</p>
     <span class="n">text</span><span class="p">:</span><span class="n">analyzer</span> <span class="p">[</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">StandardAnalyzer</span> <span class="p">]</span> <span class="p">;</span>
     <span class="n">text</span><span class="p">:</span><span class="n">queryAnalyzer</span> <span class="p">[</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">KeywordAnalyzer</span> <span class="p">]</span> <span class="p">;</span>
     <span class="n">text</span><span class="p">:</span><span class="n">queryParser</span> <span class="n">text</span><span class="p">:</span><span class="n">AnalyzingQueryParser</span> <span class="p">;</span>
+    <span class="n">text</span><span class="p">:</span><span class="n">defineAnalyzers</span> <span class="p">[</span> <span class="p">.</span> <span class="p">.</span> <span class="p">.</span> <span class="p">]</span> <span class="p">;</span>
     <span class="n">text</span><span class="p">:</span><span class="n">multilingualSupport</span> <span class="n">true</span> <span class="p">;</span>
  <span class="p">.</span>
 </pre></div>
@@ -898,6 +898,9 @@ used to analyze the query string. If not
 <p><code>text:queryParser</code> is optional and specifies an <a href="#alternative-query-parsers">alternative query parser</a></p>
 </li>
 <li>
+<p><code>text:defineAnalyzers</code> is optional and allows specification of <a href="#defined-analyzers">additional analyzers, tokenizers and filters</a></p>
+</li>
+<li>
 <p><code>text:multilingualSupport</code> enables <a href="#multilingual-support">Multilingual Support</a></p>
 </li>
 </ul>
@@ -1060,8 +1063,10 @@ tokenizer and token filter does.</p>
 </pre></div>
 
 
-<p>Here, <code>text:tokenizer</code> must be one of the four tokenizers listed above and
-the optional <code>text:filters</code> property specifies a list of token filters.</p>
+<p>From Jena 3.7.0, it is possible to define tokenizers and filters in addition to the <em>built-in</em>
+choices above that may be used with the <code>ConfigurableAnalyzer</code>. Tokenizers and filters are 
+defined via <code>text:defineAnalyzers</code> in the <code>text:TextIndexLucene</code> assembler section
+using <a href="#generic-analyzers-tokenizers-and-filters"><code>text:GenericTokenizer</code> and <code>text:GenericFilter</code></a>.</p>
 <h4 id="analyzer-for-query">Analyzer for Query<a class="headerlink" href="#analyzer-for-query" title="Permanent link">&para;</a></h4>
 <p>New in Jena 2.13.0.</p>
 <p>There is an ability to specify an analyzer to be used for the query
@@ -1272,14 +1277,19 @@ supported, e.g., a stop words <code>File
 make use of Analyzers not included in the bundled Lucene distribution,
 e.g., a <code>SanskritIASTAnalyzer</code>. Two features have been added to enhance
 the utility of jena-text: 1) <code>text:GenericAnalyzer</code>; and 2)
-<code>text:DefinedAnalyzer</code>.</p>
-<h4 id="generic-analyzer">Generic Analyzer<a class="headerlink" href="#generic-analyzer" title="Permanent link">&para;</a></h4>
+<code>text:DefinedAnalyzer</code>. Further, since Jena 3.7.0, features to allow definition of
+tokenizers and filters are included.</p>
+<h4 id="generic-analyzers-tokenizers-and-filters">Generic Analyzers, Tokenizers and Filters<a class="headerlink" href="#generic-analyzers-tokenizers-and-filters" title="Permanent link">&para;</a></h4>
 <p>A <code>text:GenericAnalyzer</code> includes a <code>text:class</code> which is the fully
 qualified class name of an Analyzer that is accessible on the jena
 classpath. This is trivial for Analyzer classes that are included in the
 bundled Lucene distribution and for other custom Analyzers a simple
 matter of including a jar containing the custom Analyzer and any
 associated Tokenizer and Filters on the classpath.</p>
+<p>Similarly, <code>text:GenericTokenizer</code> and <code>text:GenericFilter</code> allow to access any tokenizers
+or filters that are available on the Jena classpath. These two types are used <em>only</em> to define
+tokenizer and filter configurations that may be referred to when specifying a
+<a href="#configurableanalyzer">ConfigurableAnalyzer</a>.</p>
 <p>In addition to the <code>text:class</code> it is generally useful to include an
 ordered list of <code>text:params</code> that will be used to select an appropriate
 constructor of the Analyzer class. If there are no <code>text:params</code> in the
@@ -1289,7 +1299,7 @@ the list of <code>text:params</code> inc
 <ul>
 <li>an optional <code>text:paramName</code> of type <code>Literal</code> that is useful to identify the purpose of a 
 parameter in the assembler configuration</li>
-<li>a required <code>text:paramType</code> which is one of:</li>
+<li>a <code>text:paramType</code> which is one of:</li>
 </ul>
 <table class="table">
 <thead>
@@ -1325,6 +1335,9 @@ parameter in the assembler configuration
 </tr>
 </tbody>
 </table>
+<p>and is required for the types <code>text:TypeAnalyzer</code>, <code>text:TypeFile</code> and <code>text:TypeSet</code>, but,
+since Jena 3.7.0, may be implied by the form of the literal for the types: <code>text:TypeBoolean</code>,
+<code>text:TypeInt</code> and <code>text:TypeString</code>.</p>
 <ul>
 <li>a required <code>text:paramValue</code> with an object of the type corresponding to <code>text:paramType</code></li>
 </ul>
@@ -1374,10 +1387,20 @@ one approach is to define an <code>Analy
 <code>file</code>, to collect the information needed to instantiate the desired analyzer. An example of
 such an analyzer is the Kuromoji morphological analyzer for Japanese text that uses constructor 
 parameters of types: <code>UserDictionary</code>, <code>JapaneseTokenizer.Mode</code>, <code>CharArraySet</code> and <code>Set&lt;String&gt;</code>.</p>
+<p>As mentioned above, the simple types: <code>TypeInt</code>, <code>TypeBoolean</code>, and <code>TypeString</code> may be written
+without explicitly including <code>text:paramType</code> in the parameter specification. For example:</p>
+<div class="codehilite"><pre>                <span class="p">[</span> <span class="n">text</span><span class="p">:</span><span class="n">paramName</span> &quot;<span class="n">maxShingleSize</span>&quot; <span class="p">;</span>
+                  <span class="n">text</span><span class="p">:</span><span class="n">paramValue</span> 3 <span class="p">]</span>
+</pre></div>
+
+
+<p>is sufficient to specify the parameter.</p>
 <h4 id="defined-analyzers">Defined Analyzers<a class="headerlink" href="#defined-analyzers" title="Permanent link">&para;</a></h4>
 <p>The <code>text:defineAnalyzers</code> feature allows to extend the <a href="#multilingual-support">Multilingual Support</a>
 defined above. Further, this feature can also be used to name analyzers defined via <code>text:GenericAnalyzer</code>
 so that a single (perhaps complex) analyzer configuration can be used is several places.</p>
+<p>Further, since Jena 3.7.0, this feature is also used to name tokenizers and filters that
+can be referred to in the specification of a <code>ConfigurableAnalyzer</code>.</p>
 <p>The <code>text:defineAnalyzers</code> is used with <code>text:TextIndexLucene</code> to provide a list of analyzer
 definitions:</p>
 <div class="codehilite"><pre><span class="o">&lt;</span>#<span class="n">indexLucene</span><span class="o">&gt;</span> <span class="n">a</span> <span class="n">text</span><span class="p">:</span><span class="n">TextIndexLucene</span> <span class="p">;</span>
@@ -1400,6 +1423,36 @@ definitions:</p>
 </pre></div>
 
 
+<p>Since Jena 3.7.0, a <code>ConfigurableAnalyzer</code> specification can refer to any defined tokenizer 
+and filters, as in:</p>
+<div class="codehilite"><pre><span class="n">text</span><span class="o">:</span><span class="n">defineAnalyzers</span> <span class="o">(</span>
+     <span class="o">[</span> <span class="n">text</span><span class="o">:</span><span class="n">defineAnalyzer</span> <span class="o">:</span><span class="n">configuredAnalyzer</span> <span class="o">;</span>
+       <span class="n">text</span><span class="o">:</span><span class="n">analyzer</span> <span class="o">[</span>
+            <span class="n">a</span> <span class="n">text</span><span class="o">:</span><span class="n">ConfigurableAnalyzer</span> <span class="o">;</span>
+            <span class="n">text</span><span class="o">:</span><span class="n">tokenizer</span> <span class="o">:</span><span class="n">ngram</span> <span class="o">;</span>
+            <span class="n">text</span><span class="o">:</span><span class="n">filters</span> <span class="o">(</span> <span class="o">:</span><span class="n">asciiff</span> <span class="n">text</span><span class="o">:</span><span class="n">LowerCaseFilter</span> <span class="o">)</span> <span class="o">]</span> <span class="o">]</span>
+     <span class="o">[</span> <span class="n">text</span><span class="o">:</span><span class="n">defineTokenizer</span> <span class="o">:</span><span class="n">ngram</span> <span class="o">;</span>
+       <span class="n">text</span><span class="o">:</span><span class="n">tokenizer</span> <span class="o">[</span>
+            <span class="n">a</span> <span class="n">text</span><span class="o">:</span><span class="n">GenericTokenizer</span> <span class="o">;</span>
+            <span class="n">text</span><span class="o">:</span><span class="kd">class</span> <span class="s2">&quot;org.apache.lucene.analysis.ngram.NGramTokenizer&quot;</span> <span class="o">;</span>
+            <span class="n">text</span><span class="o">:</span><span class="n">params</span> <span class="o">(</span>
+                 <span class="o">[</span> <span class="n">text</span><span class="o">:</span><span class="n">paramName</span> <span class="s2">&quot;minGram&quot;</span> <span class="o">;</span>
+                   <span class="n">text</span><span class="o">:</span><span class="n">paramValue</span> <span class="mi">3</span> <span class="o">]</span>
+                 <span class="o">[</span> <span class="n">text</span><span class="o">:</span><span class="n">paramName</span> <span class="s2">&quot;maxGram&quot;</span> <span class="o">;</span>
+                   <span class="n">text</span><span class="o">:</span><span class="n">paramValue</span> <span class="mi">7</span> <span class="o">]</span>
+                 <span class="o">)</span> <span class="o">]</span> <span class="o">]</span>
+     <span class="o">[</span> <span class="n">text</span><span class="o">:</span><span class="n">defineFilter</span> <span class="o">:</span><span class="n">asciiff</span> <span class="o">;</span>
+       <span class="n">text</span><span class="o">:</span><span class="n">filter</span> <span class="o">[</span>
+            <span class="n">a</span> <span class="n">text</span><span class="o">:</span><span class="n">GenericFilter</span> <span class="o">;</span>
+            <span class="n">text</span><span class="o">:</span><span class="kd">class</span> <span class="s2">&quot;org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter&quot;</span> <span class="o">;</span>
+            <span class="n">text</span><span class="o">:</span><span class="n">params</span> <span class="o">(</span>
+                 <span class="o">[</span> <span class="n">text</span><span class="o">:</span><span class="n">paramName</span> <span class="s2">&quot;preserveOriginal&quot;</span> <span class="o">;</span>
+                   <span class="n">text</span><span class="o">:</span><span class="n">paramValue</span> <span class="kc">true</span> <span class="o">]</span>
+                 <span class="o">)</span> <span class="o">]</span> <span class="o">]</span>
+     <span class="o">)</span> <span class="o">;</span>
+</pre></div>
+
+
 <h5 id="extending-multilingual-support">Extending multilingual support<a class="headerlink" href="#extending-multilingual-support" title="Permanent link">&para;</a></h5>
 <p>The <a href="#multilingual-support">Multilingual Support</a> described above allows for a limited set of 
 ISO 2-letter codes to be used to select from among built-in analyzers using the nullary constructor 
@@ -1410,24 +1463,25 @@ associated with each analyzer. So if one
 <li>refer to custom analyzers that might be associated with generalized BCP-47 language tags, 
 such as, <code>sa-x-iast</code> for Sanskrit in the IAST transliteration, </li>
 </ul>
-<p>then <code>text:defineAnalyzers</code> with <code>text:addLang</code> will add the desired analyzers to the multilingual 
-support so that fields with the appropriate language tags will use the appropriate custom analyzer.</p>
-<p>When <code>text:defineAnalyzers</code> is used with <code>text:addLang</code> then <code>text:multilingualSupport</code> is implicitly
-added if not already specified and a warning is put in the log:</p>
+<p>then <code>text:defineAnalyzers</code> with <code>text:addLang</code> will add the desired analyzers to the
+multilingual support so that fields with the appropriate language tags will use the appropriate 
+custom analyzer.</p>
+<p>When <code>text:defineAnalyzers</code> is used with <code>text:addLang</code> then <code>text:multilingualSupport</code> is 
+implicitly added if not already specified and a warning is put in the log:</p>
 <div class="codehilite"><pre>    <span class="n">text</span><span class="p">:</span><span class="n">defineAnalyzers</span> <span class="p">(</span>
         <span class="p">[</span> <span class="n">text</span><span class="p">:</span><span class="n">addLang</span> &quot;<span class="n">sa</span><span class="o">-</span><span class="n">x</span><span class="o">-</span><span class="n">iast</span>&quot; <span class="p">;</span>
           <span class="n">text</span><span class="p">:</span><span class="n">analyzer</span> <span class="p">[</span> <span class="p">.</span> <span class="p">.</span> <span class="p">.</span> <span class="p">]</span> <span class="p">]</span>
 </pre></div>
 
 
-<p>this adds an analyzer to be used when the <code>text:langField</code> has the value <code>sa-x-iast</code> during indexing
-and search.</p>
+<p>this adds an analyzer to be used when the <code>text:langField</code> has the value <code>sa-x-iast</code> during 
+indexing and search.</p>
 <h5 id="naming-analyzers-for-later-use">Naming analyzers for later use<a class="headerlink" href="#naming-analyzers-for-later-use" title="Permanent link">&para;</a></h5>
 <p>Repeating a <code>text:GenericAnalyzer</code> specification for use with multiple fields in an entity map
-may be cumbersome. The <code>text:defineAnalyzer</code> is used in an element of a <code>text:defineAnalyzers</code> list
-to associate a resource with an analyzer so that it may be referred to later in a <code>text:analyzer</code>
-object. Assuming that an analyzer definition such as the following has appeared among the
-<code>text:defineAnalyzers</code> list:</p>
+may be cumbersome. The <code>text:defineAnalyzer</code> is used in an element of a <code>text:defineAnalyzers</code> 
+list to associate a resource with an analyzer so that it may be referred to later in a 
+<code>text:analyzer</code> object. Assuming that an analyzer definition such as the following has appeared 
+among the <code>text:defineAnalyzers</code> list:</p>
 <div class="codehilite"><pre><span class="p">[</span> <span class="n">text</span><span class="p">:</span><span class="n">defineAnalyzer</span> <span class="o">&lt;</span>#<span class="n">foo</span><span class="o">&gt;</span>
   <span class="n">text</span><span class="p">:</span><span class="n">analyzer</span> <span class="p">[</span> <span class="p">.</span> <span class="p">.</span> <span class="p">.</span> <span class="p">]</span> <span class="p">]</span>
 </pre></div>