You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Alexis Miara <an...@apache.org> on 2015/06/02 17:19:07 UTC

CMS diff: Text searches with SPARQL

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Alexis Miara

Index: trunk/content/documentation/query/text-query.mdtext
===================================================================
--- trunk/content/documentation/query/text-query.mdtext	(revision 1682942)
+++ trunk/content/documentation/query/text-query.mdtext	(working copy)
@@ -40,6 +40,7 @@
     -   [Configuring an analyzer](#configuring-an-analyzer)
     -   [Configuration by Code](#configuration-by-code)
     -   [Graph-specific Indexing](#graph-specific-indexing)
+    -   [Linguistic Support with Lucene Index](#linguistic-support-with-lucene-index)
 - [Working with Fuseki](#working-with-fuseki)
 - [Building a Text Index](#building-a-text-index)
 - [Deletion of Indexed Entities](#deletion-of-indexed-entities)
@@ -242,11 +243,20 @@
 ### Configuring an Analyzer
 
 Text to be indexed is passed through a text analyzer that divides it into tokens 
-and may perform other transformations such as eliminating stop words.  If a Lucene
-text index is used then, by default a `StandardAnalyzer` is used.  If a Solr text
+and may perform other transformations such as eliminating stop words. If a Solr text
 index is used, the analyzer used is determined by the Solr configuration.
+If a Lucene text index is used, then by default a `StandardAnalyzer` is used. However, 
+it can be replaced by another analyzer with the `text:analyzer` property. 
+For example with a `SimpleAnalyzer`:   
 
-It is possible to configure an alternative analyzer for each field indexed in a
+    <#indexLucene> a text:TextIndexLucene ;
+            text:directory <file:Lucene> ;
+            text:analyzer [
+                a text:SimpleAnalyzer
+            ]
+            . 
+
+It is also possible to configure an alternative analyzer for each field indexed in a
 Lucene index.  For example:
 
     <#entMap> a text:EntityMap ;
@@ -271,9 +281,15 @@
 In addition, Jena provides `LowerCaseKeywordAnalyzer`,
 which is a case-insensitive version of `KeywordAnalyzer`.
 
-New in Jena 2.13.0:
+In Jena 3.0.0, the new `LocalizedAnalyzer` has been introduced to deal with Lucene 
+language specific analyzers. 
+See [Linguistic Support with Lucene Index](#linguistic-support-with-lucene-index)
+part for details.
 
-There is an ability to specify an analyzer to be used for the
+
+#### Analyzer for Query
+
+New in Jena 2.13.0 is the optional ability to specify an analyzer to be used for the
 query string itself.  It will find terms in the query text.  If not set, then the
 analyzer used for the document will be used.  The query analyzer is specified on
 the `TextIndexLucene` resource:
@@ -338,6 +354,116 @@
 **Note:** If you migrate from a global (non-graph-aware) index to a graph-aware index,
 you need to rebuild the index to ensure that the graph information is stored.
 
+### Linguistic support with Lucene index
+
+It is now possible to take advantage of languages of triple literals to enhance 
+index and queries. Sub-sections below detail different settings with the index, 
+and use cases with SPARQL queries.
+
+#### Explicit Language Field in the Index 
+
+Literals' languages of triples can be stored (during triple addition phase) into the 
+index to extend query capabilities. 
+For that, the new `text:langField` property must be set in the EntityMap assembler :
+
+    <#entMap> a text:EntityMap ;
+        text:entityField      "uri" ;
+        text:defaultField     "text" ;        
+        text:langField        "lang" ;       
+        . 
+
+If you configure the index via Java code, you need to set this parameter to the 
+EntityDefinition instance, e.g.
+
+    EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
+    docDef.setLangField("lang");
+
+ 
+#### SPARQL Linguistic Clause Forms
+
+Once the `langField` is set, you can use it directly inside SPARQL queries, for that the `'lang:xx'`
+argument allows you to target specific localized values. For example:
+
+    //target english literals
+    ?s text:query (rdfs:label 'word' 'lang:en' ) 
+    
+    //target unlocalized literals
+    ?s text:query (rdfs:label 'word' 'lang:none') 
+    
+    //ignore language field
+    ?s text:query (rdfs:label 'word')
+
+
+#### LocalizedAnalyzer
+
+You can specify and use a LocalizedAnalyzer in order to benefit from Lucene language 
+specific analyzers (stemming, stop words,...). Like any others analyzers, it can 
+be done for default text indexation, for each different field or for query.
+
+With an assembler configuration, the `text:language` property needs to be provided, e.g :
+
+    <#indexLucene> a text:TextIndexLucene ;
+        text:directory <file:Lucene> ;
+        text:entityMap <#entMap> ;
+        text:analyzer [
+            a text:LocalizedAnalyzer ;
+            text:language "fr"
+        ]
+        .
+
+will configure the index to analyze values of the 'text' field using a FrenchAnalyzer.
+
+To configure the same example via Java code, you need to provide the analyzer to the
+index configuration object:
+
+        TextIndexConfig config = new TextIndexConfig(def);
+        Analyzer analyzer = Util.getLocalizedAnalyzer("fr");
+        config.setAnalyzer(analyzer);
+        Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Where `def`, `ds1` and `dir` are instances of `EntityDefinition`, `Dataset` and 
+`Directory` classes.
+
+**Note**: You do not have to set the `text:langField` property with a single 
+localized analyzer.
+
+#### Multilingual Support
+
+Let us suppose that we have many triples with many localized literals in many different 
+languages. It is possible to take all this languages into account for future mixed localized queries.
+Just set the `text:multilingualSupport` property at `true` to automatically enable the localized
+indexation (and also the localized analyzer for query) :
+
+    <#indexLucene> a text:TextIndexLucene ;
+        text:directory "mem" ;
+        text:multilingualSupport true;     
+        .
+
+Via Java code, set the multilingual support flag : 
+
+        TextIndexConfig config = new TextIndexConfig(def);
+        config.setMultilingualSupport(true);
+        Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Thus, this multilingual index combines dynamically all localized analyzers of existing languages and 
+the storage of langField properties.
+
+For example, it is possible to involve different languages into the same text search query :
+
+    SELECT ?s
+    WHERE {
+        { ?s text:query ( rdfs:label 'institut' 'lang:fr' ) }
+        UNION
+        { ?s text:query ( rdfs:label 'institute' 'lang:en' ) }
+    }
+
+Hence, the result set of the query will contain "institute" related subjects 
+(institution, institutional,...) in French and in English.
+
+**Note**: If the `text:langField` property is not set, the "lang" field will be
+used anyway by default, because multilingual index cannot work without it.
+
+
 ## Working with Fuseki
 
 The Fuseki configuration simply points to the text dataset as the
@@ -500,3 +626,6 @@
 
 adjusting the version <code>X.Y.Z</code> as necessary.  This will automatically
 include a compatible version of Lucene and the Solr java client, but not Solr server.
+
+
+