You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by an...@apache.org on 2015/06/17 23:36:17 UTC

svn commit: r1686115 - /jena/site/trunk/content/documentation/query/text-query.mdtext

Author: andy
Date: Wed Jun 17 21:36:17 2015
New Revision: 1686115

URL: http://svn.apache.org/r1686115
Log:
Updates for Linguistic Support

Modified:
    jena/site/trunk/content/documentation/query/text-query.mdtext

Modified: jena/site/trunk/content/documentation/query/text-query.mdtext
URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/query/text-query.mdtext?rev=1686115&r1=1686114&r2=1686115&view=diff
==============================================================================
--- jena/site/trunk/content/documentation/query/text-query.mdtext (original)
+++ jena/site/trunk/content/documentation/query/text-query.mdtext Wed Jun 17 21:36:17 2015
@@ -1,5 +1,3 @@
-Title: Text searches with SPARQL
-
 This module was first released with Jena 2.11.0.
 
 This extension to ARQ combines SPARQL and text search.
@@ -40,6 +38,7 @@ the actual label.  More details are give
     -   [Configuring an analyzer](#configuring-an-analyzer)
     -   [Configuration by Code](#configuration-by-code)
     -   [Graph-specific Indexing](#graph-specific-indexing)
+    -   [Linguistic Support with Lucene Index](#linguistic-support-with-lucene-index)
 - [Working with Fuseki](#working-with-fuseki)
 - [Building a Text Index](#building-a-text-index)
 - [Deletion of Indexed Entities](#deletion-of-indexed-entities)
@@ -105,17 +104,14 @@ The following forms are all legal:
     ?s text:query (rdfs:label 'word') # query specific property if multiple
     ?s text:query ('word' 10)         # with limit on results
     (?s ?score) text:query 'word'     # query capturing also the score
-
+    
 The most general form is:
    
-    (?s ?score) text:query (property 'query string' 'limit')
+     (?s ?score) text:query (property 'query string' 'limit')
 
 Only the query string is required, and if it is the only argument the
 surrounding `( )` can be omitted.
 
-When a 2-element list is used as the subject, the second variable gets
-assigned the raw score from the text index as a float value.
-
 The property URI is only necessary if multiple properties have been indexed.
 
 |  Argument   |   Definition     |
@@ -246,9 +242,18 @@ needs to identify the text dataset by it
 ### Configuring an Analyzer
 
 Text to be indexed is passed through a text analyzer that divides it into tokens 
-and may perform other transformations such as eliminating stop words.  If a Lucene
-text index is used then, by default a `StandardAnalyzer` is used.  If a Solr text
+and may perform other transformations such as eliminating stop words. If a Solr text
 index is used, the analyzer used is determined by the Solr configuration.
+If a Lucene text index is used, then by default a `StandardAnalyzer` is used. However, 
+it can be replaced by another analyzer with the `text:analyzer` property. 
+For example with a `SimpleAnalyzer`:   
+
+    <#indexLucene> a text:TextIndexLucene ;
+            text:directory <file:Lucene> ;
+            text:analyzer [
+                a text:SimpleAnalyzer
+            ]
+            . 
 
 It is possible to configure an alternative analyzer for each field indexed in a
 Lucene index.  For example:
@@ -275,7 +280,16 @@ for details of what these analyzers do.
 In addition, Jena provides `LowerCaseKeywordAnalyzer`,
 which is a case-insensitive version of `KeywordAnalyzer`.
 
-New in Jena 2.13.0:
+In Jena 3.0.0:
+
+Support for the new `LocalizedAnalyzer` has been introduced to deal with Lucene 
+language specific analyzers. 
+See [Linguistic Support with Lucene Index](#linguistic-support-with-lucene-index)
+part for details.
+
+#### Analyzer for Query
+
+New in Jena 2.13.0.
 
 There is an ability to specify an analyzer to be used for the
 query string itself.  It will find terms in the query text.  If not set, then the
@@ -342,6 +356,116 @@ EntityDefinition constructors that suppo
 **Note:** If you migrate from a global (non-graph-aware) index to a graph-aware index,
 you need to rebuild the index to ensure that the graph information is stored.
 
+### Linguistic support with Lucene index
+
+It is now possible to take advantage of languages of triple literals to enhance 
+index and queries. Sub-sections below detail different settings with the index, 
+and use cases with SPARQL queries.
+
+#### Explicit Language Field in the Index 
+
+Literals' languages of triples can be stored (during triple addition phase) into the 
+index to extend query capabilities. 
+For that, the new `text:langField` property must be set in the EntityMap assembler :
+
+    <#entMap> a text:EntityMap ;
+        text:entityField      "uri" ;
+        text:defaultField     "text" ;        
+        text:langField        "lang" ;       
+        . 
+
+If you configure the index via Java code, you need to set this parameter to the 
+EntityDefinition instance, e.g.
+
+    EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
+    docDef.setLangField("lang");
+
+ 
+#### SPARQL Linguistic Clause Forms
+
+Once the `langField` is set, you can use it directly inside SPARQL queries, for that the `'lang:xx'`
+argument allows you to target specific localized values. For example:
+
+    //target english literals
+    ?s text:query (rdfs:label 'word' 'lang:en' ) 
+    
+    //target unlocalized literals
+    ?s text:query (rdfs:label 'word' 'lang:none') 
+    
+    //ignore language field
+    ?s text:query (rdfs:label 'word')
+
+
+#### LocalizedAnalyzer
+
+You can specify and use a LocalizedAnalyzer in order to benefit from Lucene language 
+specific analyzers (stemming, stop words,...). Like any others analyzers, it can 
+be done for default text indexation, for each different field or for query.
+
+With an assembler configuration, the `text:language` property needs to be provided, e.g :
+
+    <#indexLucene> a text:TextIndexLucene ;
+        text:directory <file:Lucene> ;
+        text:entityMap <#entMap> ;
+        text:analyzer [
+            a text:LocalizedAnalyzer ;
+            text:language "fr"
+        ]
+        .
+
+will configure the index to analyze values of the 'text' field using a FrenchAnalyzer.
+
+To configure the same example via Java code, you need to provide the analyzer to the
+index configuration object:
+
+        TextIndexConfig config = new TextIndexConfig(def);
+        Analyzer analyzer = Util.getLocalizedAnalyzer("fr");
+        config.setAnalyzer(analyzer);
+        Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Where `def`, `ds1` and `dir` are instances of `EntityDefinition`, `Dataset` and 
+`Directory` classes.
+
+**Note**: You do not have to set the `text:langField` property with a single 
+localized analyzer.
+
+#### Multilingual Support
+
+Let us suppose that we have many triples with many localized literals in many different 
+languages. It is possible to take all this languages into account for future mixed localized queries.
+Just set the `text:multilingualSupport` property at `true` to automatically enable the localized
+indexation (and also the localized analyzer for query) :
+
+    <#indexLucene> a text:TextIndexLucene ;
+        text:directory "mem" ;
+        text:multilingualSupport true;     
+        .
+
+Via Java code, set the multilingual support flag : 
+
+        TextIndexConfig config = new TextIndexConfig(def);
+        config.setMultilingualSupport(true);
+        Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Thus, this multilingual index combines dynamically all localized analyzers of existing languages and 
+the storage of langField properties.
+
+For example, it is possible to involve different languages into the same text search query :
+
+    SELECT ?s
+    WHERE {
+        { ?s text:query ( rdfs:label 'institut' 'lang:fr' ) }
+        UNION
+        { ?s text:query ( rdfs:label 'institute' 'lang:en' ) }
+    }
+
+Hence, the result set of the query will contain "institute" related subjects 
+(institution, institutional,...) in French and in English.
+
+**Note**: If the `text:langField` property is not set, the "lang" field will be
+used anyway by default, because multilingual index cannot work without it.
+
+
 ## Working with Fuseki
 
 The Fuseki configuration simply points to the text dataset as the