You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by an...@apache.org on 2017/06/29 06:46:48 UTC
svn commit: r1800234 -
/jena/site/trunk/content/documentation/query/text-query-new.mdtext
Author: andy
Date: Thu Jun 29 06:46:47 2017
New Revision: 1800234
URL: http://svn.apache.org/viewvc?rev=1800234&view=rev
Log:
WIP: New version for comparison
Added:
jena/site/trunk/content/documentation/query/text-query-new.mdtext
Added: jena/site/trunk/content/documentation/query/text-query-new.mdtext
URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/query/text-query-new.mdtext?rev=1800234&view=auto
==============================================================================
--- jena/site/trunk/content/documentation/query/text-query-new.mdtext (added)
+++ jena/site/trunk/content/documentation/query/text-query-new.mdtext Thu Jun 29 06:46:47 2017
@@ -0,0 +1,1005 @@
+Title: Jena Full Text Search
+
+This extension to ARQ combines SPARQL and full text search via [Lucene](https://lucene.apache.org) 6.4.1 or
+[ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on Lucene). It gives applications the ability
+to perform indexed full text searches within SPARQL queries.
+
+Recall that SPARQL allows the use of [regex](https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#func-regex)
+in `FILTER`s; however, such use _is not indexed_. For example, if you're searching for occurrences of `"printer"` in
+the `rdfs:label` of a bunch of products:
+
+ PREFIX ex: <http://www.example.org/resources#>
+ PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+
+ SELECT ?s ?lbl
+ WHERE {
+ ?s a ex:Product ;
+ rdfs:label ?lbl
+ FILTER regex(?lbl, "printer", "i")
+ }
+
+then the search will need to examine _all_ selected `rdfs:label` statements and apply the regular expression
+to each label in turn. If there are many such statements and many such uses of `regex`, then it may be appropriate
+to consider using this extension to take advantage of the performance potential of full text indexing.
+
+Text indexes provide additional information for accessing the RDF graph by allowing the application to have _indexed
+access_ to the internal structure of string literals rather than treating such literals as opaque items.
+Assuming appropriate [configuration](#configuration), the above query can use full text search via the
+[ARQ property function extension](https://jena.apache.org/documentation/query/extension.html#property-functions),
+`text:query`:
+
+ PREFIX ex: <http://www.example.org/resources#>
+ PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+ PREFIX text: <http://jena.apache.org/text#>
+
+ SELECT ?s ?lbl
+ WHERE {
+ ?s a ex:Product ;
+ text:query (rdfs:label 'printer') ;
+ rdfs:label ?lbl
+ }
+
+This query makes a text query for `'printer'` on the `rdfs:label` property; and then looks in the RDF data and retrieves
+the complete label for each match.
+
+The full text engine can be either [Apache Lucene](http://lucene.apache.org/core) hosted with Jena on
+a single machine, or [Elasticsearch](https://www.elastic.co/) for a large scale enterprise search application
+where the full text engine is potentially distributed across separate machines.
+
+This [example code](https://github.com/apache/jena/tree/master/jena-text/src/main/java/examples/) illustrates
+creating an in-memory dataset with a Lucene index.
+
+This module was first released with Jena 2.11.0.
+
+This module is not compatible with the much older LARQ module.
+
+## Table of Contents
+
+- [Architecture](#architecture)
+- [Query with SPARQL](#query-with-sparql)
+- [Configuration](#configuration)
+ - [Text Dataset Assembler](#text-dataset-assembler)
+ - [Configuring an analyzer](#configuring-an-analyzer)
+ - [Configuration by Code](#configuration-by-code)
+ - [Graph-specific Indexing](#graph-specific-indexing)
+ - [Linguistic Support with Lucene Index](#linguistic-support-with-lucene-index)
+ - [Generic and Defined Analyzer Support](#generic-and-defined-analyzer-support)
+ - [Storing Literal Values](#storing-literal-values)
+- [Working with Fuseki](#working-with-fuseki)
+- [Building a Text Index](#building-a-text-index)
+- [Configuring Alternative TextDocProducers](#configuring-alternative-textdocproducers)
+- [Maven Dependency](#maven-dependency)
+
+## Architecture
+
+In general, a text index engine (Lucene or Elasticsearch) indexes _documents_ where each document is
+a collection of _fields_, the values of which are indexed so that searches matching contents of specified
+fields can return a reference to the document containing the fields with matching values.
+
+The basic idea of the Jena text extension is to associate a triple with a document and the _property_
+of the triple with a _field_ of a document and the _object_ of the triple (which must be a literal) with
+the value of the field in the document. The _subject_ of the triple then becomes another field of the
+document that is returned as the result of a search match to identify what was matched. (NB, the
+particular triple that matched is not identified. Only, its subject.)
+
+In this manner, the text index provides an inverted index that maps query string matches to subject URIs.
+
+A text-indexed dataset is configured with a description of which properties are to be indexed. When triples
+are added, any properties matching the description cause a document to be added to the index
+by analyzing the literal value of the triple object and mapping to the subject URI. On the other hand, it is
+necessary to specifically configure the text-indexed dataset to [delete index entries](#entity-map-definition)
+when the corresponding triples are dropped from the RDF store.
+
+The text index uses the native query language of the index:
+[Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+or
+[Elasticsearch query language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
+
+### External content
+
+It is also possible that the indexed text is content external to the RDF store with only additional triples
+(about the indexed text) in the RDF store. The subject URI returned as a search result may then be considered
+to refer via the indexed property to the external content.
+
+There is no requirement that the text data indexed is present in the RDF
+data. As long as the index contains the index text documents to match the
+index description, then text search can be performed.
+
+For example, if the content of a collection of documents is indexed and the
+URI naming the document is the result of the text search, then an RDF
+dataset with the document metadata can be combined with accessing the
+content by URI.
+
+The maintenance of the index is external to the RDF data store.
+
+### External applications
+
+By using Elasticsearch, other applications can share the text index with SPARQL search.
+
+## Query with SPARQL
+
+The URI of the text extenion property function is `http://jena.apache.org/text#query` more
+conveniently written:
+
+ PREFIX text: <http://jena.apache.org/text#>
+
+ ... text:query ...
+
+| Argument | Definition |
+|-------------------|--------------------------------|
+| property | (optional) URI (including prefix name form) |
+| query string | The native query string |
+| limit | (optional) `int` limit on the number of results |
+
+The following forms are all legal:
+
+ ?s text:query 'word' # query
+ ?s text:query (rdfs:label 'word') # query specific property if multiple
+ ?s text:query ('word' 10) # with limit on results
+ (?s ?score) text:query 'word' # query capturing also the score
+ (?s ?score ?literal) text:query 'word' # ... and original literal value
+
+The most general form is:
+
+ (?s ?score ?literal) text:query (property 'query string' limit)
+
+Only the query string is required, and if it is the only argument the
+surrounding `( )` can be omitted.
+
+The `property` URI is only necessary if multiple properties have been indexed and the property
+being searched over is not the [default field of the index](#entity-map-definition).
+Also the `property` URI **must not** be used when the `query string` refers explicitly to one or more
+fields.
+
+The results include the subject URI, `?s`; the `?score` assigned by the text search engine;
+and the entire matched `?literal`
+(if the index has been [configured to store literal values](#text-dataset-assembler)).
+
+If the `query string` refers to more than one field, e.g.,
+
+ "label: printer AND description: \"large capacity cartridge\""
+
+then the `?literal` in the results will not be bound since there is no single field that contains
+the match – the match is separated over two fields.
+
+### Good practice
+
+The query engine does not have information about the selectivity of the text index and so effective
+query plans cannot be determined programmatically. It is helpful to be aware of the following two
+general query patterns.
+
+#### Query pattern 1 – Find in the text index and refine results
+
+Access to the text index is first in the query and used to find a number of
+items of interest; further information is obtained about these items from
+the RDF data.
+
+ SELECT ?s
+ { ?s text:query (rdfs:label 'word' 10) ;
+ rdfs:label ?label ;
+ rdf:type ?type
+ }
+
+The `text:query` limit argument is useful when working with large indexes to limit results to the
+higher scoring results – results are returned in the order of scoring by the text search engine.
+
+#### Query pattern 2 – Filter results via the text index
+
+By finding items of interest first in the RDF data, the text search can be
+used to restrict the items found still further.
+
+ SELECT ?s
+ { ?s rdf:type :book ;
+ dc:creator "John" .
+ ?s text:query (dc:title 'word') ;
+ }
+
+## Configuration
+
+The usual way to describe a text index is with a
+[Jena assembler description](../assembler/index.html). Configurations can
+also be built with code. The assembler describes a 'text
+dataset' which has an underlying RDF dataset and a text index. The text
+index describes the text index technology (Lucene or Elasticsearch) and the details
+needed for each.
+
+A text index has an "entity map" which defines the properties to
+index, the name of the Lucene/Elasticsearch field and field used for storing the URI
+itself.
+
+For simple RDF use, there will be one field, mapping a property to a text
+index field. More complex setups, with multiple properties per entity
+(URI) are possible.
+
+Once configured, any data added to the text dataset is automatically
+indexed as well.
+
+### Text Dataset Assembler
+
+The following is an example of a TDB dataset with a text index.
+
+ @prefix : <http://localhost/jena_example/#> .
+ @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
+ @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
+ @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
+ @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
+ @prefix text: <http://jena.apache.org/text#> .
+
+ ## Example of a TDB dataset and text index
+ ## Initialize TDB
+ [] ja:loadClass "org.apache.jena.tdb.TDB" .
+ tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
+ tdb:GraphTDB rdfs:subClassOf ja:Model .
+
+ ## Initialize text query
+ [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
+ # A TextDataset is a regular dataset with a text index.
+ text:TextDataset rdfs:subClassOf ja:RDFDataset .
+ # Lucene index
+ text:TextIndexLucene rdfs:subClassOf text:TextIndex .
+ # Elasticsearch index
+ text:TextIndexES rdfs:subClassOf text:TextIndex .
+
+ ## ---------------------------------------------------------------
+ ## This URI must be fixed - it's used to assemble the text dataset.
+
+ :text_dataset rdf:type text:TextDataset ;
+ text:dataset <#dataset> ;
+ text:index <#indexLucene> ;
+ .
+
+ # A TDB datset used for RDF storage
+ <#dataset> rdf:type tdb:DatasetTDB ;
+ tdb:location "DB" ;
+ tdb:unionDefaultGraph true ; # Optional
+ .
+
+ # Text index description
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:/some/path/lucene-index> ;
+ text:entityMap <#entMap> ;
+ text:storeValues true ;
+ text:analyzer [ a text:StandardAnalyzer ] ;
+ text:queryAnalyzer [ a text:KeywordAnalyzer ] ;
+ text:queryParser text:AnalyzingQueryParser ;
+ text:multilingualSupport true ;
+ .
+
+The `text:TextDataset` has two properties:
+
+- a `text:dataset`, e.g., a `tdb:DatasetTDB`, to contain
+the RDF triples; and
+
+- an index configured to use either `text:TextIndexLucene` or `text:TextIndexES`.
+
+The `<#indexLucene>` instance of `text:TextIndexLucene`, above, has two required properties:
+
+- the `text:directory`
+file URI which specifies the directory that will contain the Lucene index files – if this has the
+value `"mem"` then the index resides in memory;
+
+- the `text:entityMap`, `<#entMap>` that will define
+what properties are to be indexed and other features of the index; and
+
+and several optional properties:
+
+- `text:storeValues` controls the [storing of literal values](#storing-literal-values).
+It indicates whether values are stored or not – values must be stored for the
+[`?literal` return value](#query-with-sparql) to be available in `text:query` in SPARQL.
+
+- `text:analyzer` specifies the default [analyzer configuration](#configuring-an-analyzer) to used
+during indexing and querying. The default analyzer defaults to Lucene's `StandardAnalyzer`.
+
+- `text:queryAnalyzer` specifies an optional [analyzer for query](#analyzer-for-query) that will be
+used to analyze the query string. If not set the analyzer used to index a given field is used.
+
+- `text:queryParser` is optional and specifies an [alternative query parser](#alternative-query-parsers)
+
+- `text:multilingualSupport` enables [Multilingual Support](#multilingual-support)
+
+If using Elasticsearch then an index would be configured as follows:
+
+ <#indexES> a text:TextIndexES ;
+ text:serverList "127.0.0.1:9300" ; # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
+ text:clusterName "elasticsearch" ; # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
+ text:shards "1" ; # The number of shards for the index. Defaults to 1
+ text:replicas "1" ; # The number of replicas for the index. Defaults to 1
+ text:indexName "jena-text" ; # Name of the Index. defaults to jena-text
+ text:entityMap <#entMap> ;
+ .
+
+and `text:index <#indexES> ;` would be used in the configuration of `:text_dataset`.
+
+To use a text index assembler configuration in Java code is it necessary to identify the dataset URI to
+be assembled, such as in:
+
+ Dataset ds = DatasetFactory.assemble(
+ "text-config.ttl",
+ "http://localhost/jena_example/#text_dataset") ;
+
+since the assembler contains two dataset definitions, one for
+the text dataset, one for the base data. Therefore, the application
+needs to identify the text dataset by it's URI
+`http://localhost/jena_example/#text_dataset`.
+
+### Entity Map definition
+
+A `text:EntityMap` has several properties that condition what is indexed, what information is stored, and
+what analyzers are used.
+
+ <#entMap> a text:EntityMap ;
+ text:defaultField "label" ;
+ text:entityField "uri" ;
+ text:uidField "uid" ;
+ text:langField "lang" ;
+ text:graphField "graph" ;
+ text:map (
+ [ text:field "label" ;
+ text:predicate rdfs:label ]
+ ) .
+
+#### Default text field
+
+The `text:defaultField` specifies the default field name that Lucene will use in a query that does
+not otherwise specify a field. For example,
+
+ ?s text:query "\"bread and butter\""
+
+will perform a search in the `label` field for the phrase `"bread and butter"`
+
+#### Entity field
+
+The `text:entityField ` specifies the field name of the field that will contain the subject URI that
+is returned on a match. The value of the property is arbitrary so long as it is unique among the
+defined names.
+
+#### Automatic document deletion
+
+When the `text:uidField` is defined in the `EntityMap` then dropping a triple will result in the
+corresponding document, if any, being deleted from the text index. The value, `"uid"`, is arbitrary
+and defines the name of a stored field in Lucene that holds a unique ID that represents the triple.
+
+If you configure the index via Java code, you need to set this parameter to the
+EntityDefinition instance, e.g.
+
+ EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
+ docDef.setUidField("uid");
+
+**Note**: If you migrate from an index without deletion support to an index with automatic deletion,
+you will need to rebuild the index to ensure that the uid information is stored.
+
+#### Language Field
+
+The `text:langField` is the name of the field that will store the language attribute of the literal
+in the case of an `rdf:langString`. This Entity Map property is a key element of the
+[Linguistic support with Lucene index](#linguistic-support-with-lucene-index)
+
+#### Graph Field
+
+Setting the `text:graphField` allows [graph-specific indexing](#graph-specific-indexing) of the text
+index to limit searching to a specified graph when a SPARQL query targets a single named graph. The
+field value is arbitrary and serves to store the graph ID that a triple belongs to when the index is
+updated.
+
+#### The Analyzer Map
+
+The `text:map` is a list of [analyzer specifications](#configuring-an-analyzer) as described below.
+
+### Configuring an Analyzer
+
+Text to be indexed is passed through a text analyzer that divides it into tokens
+and may perform other transformations such as eliminating stop words. If a Lucene
+or Elasticsearch text index is used, then by default the Lucene `StandardAnalyzer` is used.
+
+In case of a `TextIndexLucene` the default analyzer can be replaced by another analyzer with
+the `text:analyzer` property on the `text:TextIndexLucene` resource in the
+[text dataset assembler](#text-dataset-assembler), for example with a `SimpleAnalyzer`:
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ text:analyzer [
+ a text:SimpleAnalyzer
+ ]
+ .
+
+It is possible to configure an alternative analyzer for each field indexed in a
+Lucene index. For example:
+
+ <#entMap> a text:EntityMap ;
+ text:entityField "uri" ;
+ text:defaultField "text" ;
+ text:map (
+ [ text:field "text" ;
+ text:predicate rdfs:label ;
+ text:analyzer [
+ a text:StandardAnalyzer ;
+ text:stopWords ("a" "an" "and" "but")
+ ]
+ ]
+ ) .
+
+will configure the index to analyze values of the 'text' field
+using a `StandardAnalyzer` with the given list of stop words.
+
+Other analyzer types that may be specified are `SimpleAnalyzer` and
+`KeywordAnalyzer`, neither of which has any configuration parameters. See
+the Lucene documentation for details of what these analyzers do. Jena also
+provides `LowerCaseKeywordAnalyzer`, which is a case-insensitive version of
+`KeywordAnalyzer`, and [`ConfigurableAnalyzer`](#configurableanalyzer).
+
+Support for the new `LocalizedAnalyzer` has been introduced in Jena 3.0.0 to
+deal with Lucene language specific analyzers. See [Linguistic Support with
+Lucene Index](#linguistic-support-with-lucene-index) part for details.
+
+Support for `GenericAnalyzer`s has been introduced in Jena 3.4.0 to allow
+the use of Analyzers that do not have built-in support, e.g., `BrazilianAnalyzer`;
+require constructor parameters not otherwise supported, e.g., a stop words `FileReader` or
+a `stemExclusionSet`; and finally use of Analyzers not included in the bundled
+Lucene distribution, e.g., a `SanskritIASTAnalyzer`. See [Generic and Defined
+Analyzer Support](#generic-and-defined-analyzer-support)
+
+#### ConfigurableAnalyzer
+
+`ConfigurableAnalyzer` was introduced in Jena 3.0.1. It allows more detailed
+configuration of text analysis parameters by independently selecting a
+`Tokenizer` and zero or more `TokenFilter`s which are applied in order after
+tokenization. See the Lucene documentation for details on what each
+tokenizer and token filter does.
+
+The available `Tokenizer` implementations are:
+
+* `StandardTokenizer`
+* `KeywordTokenizer`
+* `WhitespaceTokenizer`
+* `LetterTokenizer`
+
+The available `TokenFilter` implementations are:
+
+* `StandardFilter`
+* `LowerCaseFilter`
+* `ASCIIFoldingFilter`
+
+Configuration is done using Jena assembler like this:
+
+ text:analyzer [
+ a text:ConfigurableAnalyzer ;
+ text:tokenizer text:KeywordTokenizer ;
+ text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
+ ]
+
+Here, `text:tokenizer` must be one of the four tokenizers listed above and
+the optional `text:filters` property specifies a list of token filters.
+
+#### Analyzer for Query
+
+New in Jena 2.13.0.
+
+There is an ability to specify an analyzer to be used for the
+query string itself. It will find terms in the query text. If not set, then the
+analyzer used for the document will be used. The query analyzer is specified on
+the `TextIndexLucene` resource:
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ text:entityMap <#entMap> ;
+ text:queryAnalyzer [
+ a text:KeywordAnalyzer
+ ]
+ .
+
+#### Alternative Query Parsers
+
+New in Jena 3.1.0.
+
+It is possible to select a query parser other than the default QueryParser.
+
+The available `QueryParser` implementations are:
+
+* `AnalyzingQueryParser`: Performs analysis for wildcard queries . This is useful in combination
+with accent-insensitive wildcard queries.
+* `ComplexPhraseQueryParser`: Permits complex phrase query syntax. Eg: "(john jon jonathan~) peters*".
+This is useful for performing wildcard or fuzzy queries on individual terms in a phrase.
+
+The query parser is specified on
+the `TextIndexLucene` resource:
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ text:entityMap <#entMap> ;
+ text:queryParser text:AnalyzingQueryParser .
+
+
+Elasticsearch currently doesn't support Analyzers beyond Standard Analyzer.
+
+### Configuration by Code
+
+A text dataset can also be constructed in code as might be done for a
+purely in-memory setup:
+
+ // Example of building a text dataset with code.
+ // Example is in-memory.
+ // Base dataset
+ Dataset ds1 = DatasetFactory.createMem() ;
+
+ EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label) ;
+
+ // Lucene, in memory.
+ Directory dir = new RAMDirectory();
+
+ // Join together into a dataset
+ Dataset ds = TextDatasetFactory.createLucene(ds1, dir, entDef) ;
+
+### Graph-specific Indexing
+
+Starting with version 1.0.1, jena-text supports
+storing information about the source graph into the text index. This allows
+for more efficient text queries when the query targets only a single named
+graph. Without graph-specific indexing, text queries do not distinguish named
+graphs and will always return results from all graphs.
+
+Support for graph-specific indexing is enabled by defining the name of the
+index field to use for storing the graph identifier.
+
+If you use an assembler configuration, set the graph field using the
+text:graphField property on the EntityMap, e.g.
+
+ # Mapping in the index
+ # URI stored in field "uri"
+ # Graph stored in field "graph"
+ # rdfs:label is mapped to field "text"
+ <#entMap> a text:EntityMap ;
+ text:entityField "uri" ;
+ text:graphField "graph" ;
+ text:defaultField "text" ;
+ text:map (
+ [ text:field "text" ; text:predicate rdfs:label ]
+ ) .
+
+If you configure the index in Java code, you need to use one of the
+EntityDefinition constructors that support the graphField parameter, e.g.
+
+ EntityDefinition entDef = new EntityDefinition("uri", "text", "graph", RDFS.label.asNode()) ;
+
+**Note:** If you migrate from a global (non-graph-aware) index to a graph-aware index,
+you need to rebuild the index to ensure that the graph information is stored.
+
+### Linguistic support with Lucene index
+
+It is now possible to take advantage of languages of triple literals to enhance
+index and queries. Sub-sections below detail different settings with the index,
+and use cases with SPARQL queries.
+
+#### Explicit Language Field in the Index
+
+Literals' languages of triples can be stored (during triple addition phase) into the
+index to extend query capabilities.
+For that, the new `text:langField` property must be set in the EntityMap assembler :
+
+ <#entMap> a text:EntityMap ;
+ text:entityField "uri" ;
+ text:defaultField "text" ;
+ text:langField "lang" ;
+ .
+
+If you configure the index via Java code, you need to set this parameter to the
+EntityDefinition instance, e.g.
+
+ EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
+ docDef.setLangField("lang");
+
+
+#### SPARQL Linguistic Clause Forms
+
+Once the `langField` is set, you can use it directly inside SPARQL queries, for that the `'lang:xx'`
+argument allows you to target specific localized values. For example:
+
+ //target english literals
+ ?s text:query (rdfs:label 'word' 'lang:en' )
+
+ //target unlocalized literals
+ ?s text:query (rdfs:label 'word' 'lang:none')
+
+ //ignore language field
+ ?s text:query (rdfs:label 'word')
+
+
+#### LocalizedAnalyzer
+
+You can specify a LocalizedAnalyzer in order to benefit from Lucene language
+specific analyzers (stemming, stop words,...). Like any other analyzers, it can
+be done for default text indexing, for each different field or for query.
+
+With an assembler configuration, the `text:language` property needs to be provided, e.g :
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ text:entityMap <#entMap> ;
+ text:analyzer [
+ a text:LocalizedAnalyzer ;
+ text:language "fr"
+ ]
+ .
+
+will configure the index to analyze values of the 'text' field using a FrenchAnalyzer.
+
+To configure the same example via Java code, you need to provide the analyzer to the
+index configuration object:
+
+ TextIndexConfig config = new TextIndexConfig(def);
+ Analyzer analyzer = Util.getLocalizedAnalyzer("fr");
+ config.setAnalyzer(analyzer);
+ Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Where `def`, `ds1` and `dir` are instances of `EntityDefinition`, `Dataset` and
+`Directory` classes.
+
+**Note**: You do not have to set the `text:langField` property with a single
+localized analyzer.
+
+#### Multilingual Support
+
+Let us suppose that we have many triples with many localized literals in many different
+languages. It is possible to take all these languages into account for future mixed localized queries.
+Just set the `text:multilingualSupport` property at `true` to automatically enable the localized
+indexing (and also the localized analyzer for query) :
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory "mem" ;
+ text:multilingualSupport true;
+ .
+
+Via Java code, set the multilingual support flag :
+
+ TextIndexConfig config = new TextIndexConfig(def);
+ config.setMultilingualSupport(true);
+ Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Thus, this multilingual index combines dynamically all localized analyzers of existing languages and
+the storage of langField properties.
+
+For example, it is possible to refer to different languages in the same text search query :
+
+ SELECT ?s
+ WHERE {
+ { ?s text:query ( rdfs:label 'institut' 'lang:fr' ) }
+ UNION
+ { ?s text:query ( rdfs:label 'institute' 'lang:en' ) }
+ }
+
+Hence, the result set of the query will contain "institute" related subjects
+(institution, institutional,...) in French and in English.
+
+**Note**: If the `text:langField` property is not set, the `text:langField` will default to"lang".
+
+### Generic and Defined Analyzer Support
+
+There are many Analyzers that do not have built-in support, e.g., `BrazilianAnalyzer`;
+require constructor parameters not otherwise supported, e.g., a stop words `FileReader` or
+a `stemExclusionSet`; or make use of Analyzers not included in the bundled
+Lucene distribution, e.g., a `SanskritIASTAnalyzer`. Two features have been added to enhance
+the utility of jena-text: 1) `text:GenericAnalyzer`; and 2) `text:DefinedAnalyzer`.
+
+#### Generic Analyzer
+
+A `text:GenericAnalyzer` includes a `text:class` which is the fully qualified class name of an
+Analyzer that is accessible on the jena classpath. This is trivial for Analyzer classes that are
+included in the bundled Lucene distribution and for other custom Analyzers a simple matter of
+including a jar containing the custom Analyzer and any associated Tokenizer and Filters on
+the classpath.
+
+In addition to the `text:class` it is generally useful to include an ordered list of `text:params`
+that will be used to select an appropriate constructor of the Analyzer class. If there are no
+`text:params` in the analyzer specification or if the `text:params` is an empty list then the
+nullary constructor is used to instantiate the analyzer. Each element of the list of `text:params`
+includes:
+
+* an optional `text:paramName` of type `Literal` that is useful to identify the purpose of a
+parameter in the assembler configuration
+* a required `text:paramType` which is one of:
+
+| Type | Description |
+|-------------------|--------------------------------|
+|`text:TypeAnalyzer`|a subclass of `org.apache.lucene.analysis.Analyzer`|
+|`text:TypeBoolean`|a java `boolean`|
+|`text:TypeFile`|the `String` path to a file materialized as a `java.io.FileReader`|
+|`text:TypeInt`|a java `int`|
+|`text:TypeString`|a java `String`|
+|`text:TypeSet`|an `org.apache.lucene.analysis.CharArraySet`|
+
+* a required `text:paramValue` with an object of the type corresponding to `text:paramType`
+
+In the case of an `analyzer` parameter the `text:paramValue` is any `text:analyzer` resource as
+describe throughout this document.
+
+An example of the use of `text:GenericAnalyzer` to configure an `EnglishAnalyzer` with stop
+words and stem exclusions is:
+
+ text:map (
+ [ text:field "text" ;
+ text:predicate rdfs:label;
+ text:analyzer [
+ a text:GenericAnalyzer ;
+ text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
+ text:params (
+ [ text:paramName "stopwords" ;
+ text:paramType text:TypeSet ;
+ text:paramValue ("the" "a" "an") ]
+ [ text:paramName "stemExclusionSet" ;
+ text:paramType text:TypeSet ;
+ text:paramValue ("ing" "ed") ]
+ )
+ ] .
+
+Here is an example of defining an instance of `ShingleAnalyzerWrapper`:
+
+ text:map (
+ [ text:field "text" ;
+ text:predicate rdfs:label;
+ text:analyzer [
+ a text:GenericAnalyzer ;
+ text:class "org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper" ;
+ text:params (
+ [ text:paramName "defaultAnalyzer" ;
+ text:paramType text:TypeAnalyzer ;
+ text:paramValue [ a text:SimpleAnalyzer ] ]
+ [ text:paramName "maxShingleSize" ;
+ text:paramType text:TypeInt ;
+ text:paramValue 3 ]
+ )
+ ] .
+
+If there is need of using an analyzer with constructor parameter types not included here then
+one approach is to define an `AnalyzerWrapper` that uses available parameter types, such as
+`file`, to collect the information needed to instantiate the desired analyzer. An example of
+such an analyzer is the Kuromoji morphological analyzer for Japanese text that uses constructor
+parameters of types: `UserDictionary`, `JapaneseTokenizer.Mode`, `CharArraySet` and `Set<String>`.
+
+#### Defined Analyzers
+
+The `text:defineAnalyzers` feature allows to extend the [Multilingual Support](#multilingual-support)
+defined above. Further, this feature can also be used to name analyzers defined via `text:GenericAnalyzer`
+so that a single (perhaps complex) analyzer configuration can be used is several places.
+
+The `text:defineAnalyzers` is used with `text:TextIndexLucene` to provide a list of analyzer
+definitions:
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ text:entityMap <#entMap> ;
+ text:defineAnalyzers (
+ [ text:addLang "sa-x-iast" ;
+ text:analyzer [ . . . ] ]
+ [ text:defineAnalyzer <#foo> ;
+ text:analyzer [ . . . ] ]
+ )
+ .
+
+References to a defined analyzer may be made in the entity map like:
+
+ text:analyzer [
+ a text:DefinedAnalyzer
+ text:useAnalyzer <#foo> ]
+
+##### Extending multilingual support
+
+The [Multilingual Support](#multilingual-support) described above allows for a limited set of
+ISO 2-letter codes to be used to select from among built-in analyzers using the nullary constructor
+associated with each analyzer. So if one is wanting to use:
+
+* a language not included, e.g., Brazilian; or
+* use additional constructors defining stop words, stem exclusions and so on; or
+* refer to custom analyzers that might be associated with generalized BCP-47 language tags,
+such as, `sa-x-iast` for Sanskrit in the IAST transliteration,
+
+then `text:defineAnalyzers` with `text:addLang` will add the desired analyzers to the multilingual
+support so that fields with the appropriate language tags will use the appropriate custom analyzer.
+
+When `text:defineAnalyzers` is used with `text:addLang` then `text:multilingualSupport` is implicitly
+added if not already specified and a warning is put in the log:
+
+ text:defineAnalyzers (
+ [ text:addLang "sa-x-iast" ;
+ text:analyzer [ . . . ] ]
+
+this adds an analyzer to be used when the `text:langField` has the value `sa-x-iast` during indexing
+and search.
+
+##### Naming analyzers for later use
+
+Repeating a `text:GenericAnalyzer` specification for use with multiple fields in an entity map
+may be cumbersome. The `text:defineAnalyzer` is used in an element of a `text:defineAnalyzers` list
+to associate a resource with an analyzer so that it may be referred to later in a `text:analyzer`
+object. Assuming that an analyzer definition such as the following has appeared among the
+`text:defineAnalyzers` list:
+
+ [ text:defineAnalyzer <#foo>
+ text:analyzer [ . . . ] ]
+
+then in a `text:analyzer` specification in an entity map, for example, a reference to analyzer `<#foo>`
+is made via:
+
+ text:map (
+ [ text:field "text" ;
+ text:predicate rdfs:label;
+ text:analyzer [
+ a text:DefinedAnalyzer
+ text:useAnalyzer <#foo> ]
+
+This makes it straightforward to refer to the same (possibly complex) analyzer definition in multiple fields.
+
+### Storing Literal Values
+
+New in Jena 3.0.0.
+
+It is possible to configure the text index to store enough information in the
+text index to be able to access the original indexed literal values at query time.
+This is controlled by two configuration options. First, the `text:storeValues` property
+must be set to `true` for the text index:
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory "mem" ;
+ text:storeValues true;
+ .
+
+Or using Java code, used the `setValueStored` method of `TextIndexConfig`:
+
+ TextIndexConfig config = new TextIndexConfig(def);
+ config.setValueStored(true);
+
+Additionally, setting the `langField` configuration option is recommended. See
+[Linguistic Support with Lucene Index](#linguistic-support-with-lucene-index)
+for details. Without the `langField` setting, the stored literals will not have
+language tag or datatype information.
+
+At query time, the stored literals can be accessed by using a 3-element list
+of variables as the subject of the `text:query` property function. The literal
+value will be bound to the third variable:
+
+ (?s ?score ?literal) text:query 'word'
+
+## Working with Fuseki
+
+The Fuseki configuration simply points to the text dataset as the
+`fuseki:dataset` of the service.
+
+ <#service_text_tdb> rdf:type fuseki:Service ;
+ rdfs:label "TDB/text service" ;
+ fuseki:name "ds" ;
+ fuseki:serviceQuery "query" ;
+ fuseki:serviceQuery "sparql" ;
+ fuseki:serviceUpdate "update" ;
+ fuseki:serviceUpload "upload" ;
+ fuseki:serviceReadGraphStore "get" ;
+ fuseki:serviceReadWriteGraphStore "data" ;
+ fuseki:dataset :text_dataset ;
+ .
+
+## Building a Text Index
+
+When working at scale, or when preparing a published, read-only, SPARQL
+service, creating the index by loading the text dataset is impractical.
+The index and the dataset can be built using command line tools in two
+steps: first load the RDF data, second create an index from the existing
+RDF dataset.
+
+### Step 1 - Building a TDB dataset
+
+**Note:** If you have an existing TDB dataset then you can skip this step
+
+Build the TDB dataset:
+
+ java -cp $FUSEKI_HOME/fuseki-server.jar tdb.tdbloader --tdb=assembler_file data_file
+
+using the copy of TDB included with Fuseki.
+
+Alternatively, use one of the
+[TDB utilities](../tdb/commands.html) `tdbloader` or `tdbloader2` which are better for bulk loading:
+
+ $JENA_HOME/bin/tdbloader --loc=directory data_file
+
+### Step 2 - Build the Text Index
+
+You can then build the text index with the `jena.textindexer` tool:
+
+ java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --desc=assembler_file
+
+Because a Fuseki assembler description can have several datasets descriptions,
+and several text indexes, it may be necessary to extract a single dataset and index description
+into a separate assembler file for use in loading.
+
+#### Updating the index
+
+If you allow updates to the dataset through Fuseki, the configured index
+will automatically be updated on every modification. This means that you
+do not have to run the above mentioned `jena.textindexer` after updates,
+only when you want to rebuild the index from scratch.
+
+# Configuring Alternative TextDocProducers
+
+The default behaviour when text indexing is to index a single
+property as a single field, generating a different `Document`
+for each indexed triple. To change this behaviour requires
+writing and configuring an alternative `TextDocProducer`.
+
+To configure a `TextDocProducer`, say `net.code.MyProducer` in a dataset assembly,
+use the property `textDocProducer`, eg:
+
+ <#ds-with-lucene> rdf:type text:TextDataset;
+ text:index <#indexLucene> ;
+ text:dataset <#ds> ;
+ text:textDocProducer <java:net.code.MyProducer> ;
+ .
+
+where `CLASSNAME` is the full java class name. It must have either
+a single-argument constructor of type `TextIndex`, or a two-argument
+constructor `(DatasetGraph, TextIndex)`. The `TextIndex` argument
+will be the configured text index, and the `DatasetGraph` argument
+will be the graph of the configured dataset.
+
+For example, to explicitly create the default `TextDocProducer` use:
+
+ ...
+ text:textDocProducer <java:org.apache.jena.query.text.TextDocProducerTriples> ;
+ ...
+
+`TextDocProducerTriples` produces a new document for each subject/field
+added to the dataset, using `TextIndex.addEntity(Entity)`.
+
+## Example
+
+The example class below is a `TextDocProducer` that only indexes
+`ADD`s of quads for which the subject already had at least one
+property-value. It uses the two-argument constructor to give it
+access to the dataset so that it count the `(?G, S, P, ?O)` quads
+with that subject and predicate, and delegates the indexing to
+`TextDocProducerTriples` if there are at least two values for
+that property (one of those values, of course, is the one that
+gives rise to this `change()`).
+
+ public class Example extends TextDocProducerTriples {
+
+ final DatasetGraph dg;
+
+ public Example(DatasetGraph dg, TextIndex indexer) {
+ super(indexer);
+ this.dg = dg;
+ }
+
+ public void change(QuadAction qaction, Node g, Node s, Node p, Node o) {
+ if (qaction == QuadAction.ADD) {
+ if (alreadyHasOne(s, p)) super.change(qaction, g, s, p, o);
+ }
+ }
+
+ private boolean alreadyHasOne(Node s, Node p) {
+ int count = 0;
+ Iterator<Quad> quads = dg.find( null, s, p, null );
+ while (quads.hasNext()) { quads.next(); count += 1; }
+ return count > 1;
+ }
+ }
+
+
+## Maven Dependency
+
+The <code>jena-text</code> module is included in Fuseki. To use it within application code,
+then use the following maven dependency:
+
+ <dependency>
+ <groupId>org.apache.jena</groupId>
+ <artifactId>jena-text</artifactId>
+ <version>X.Y.Z</version>
+ </dependency>
+
+adjusting the version <code>X.Y.Z</code> as necessary. This will automatically
+include a compatible version of Lucene.
+
+For Elasticsearch implementation, you can include the following Maven Dependency:
+
+ <dependency>
+ <groupId>org.apache.jena</groupId>
+ <artifactId>jena-text-es</artifactId>
+ <version>X.Y.Z</version>
+ </dependency>
+
+adjusting the version <code>X.Y.Z</code> as necessary.
\ No newline at end of file