You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by an...@apache.org on 2017/06/29 08:04:25 UTC
svn commit: r1800246 - in /jena/site/trunk/content/documentation/query: text-query-new.mdtext text-query.mdtext

Author: andy
Date: Thu Jun 29 08:04:25 2017
New Revision: 1800246

URL: http://svn.apache.org/viewvc?rev=1800246&view=rev
Log:
JENA-1326: Revised text search documentation

Added:
    jena/site/trunk/content/documentation/query/text-query.mdtext
      - copied, changed from r1800245, jena/site/trunk/content/documentation/query/text-query-new.mdtext
Removed:
    jena/site/trunk/content/documentation/query/text-query-new.mdtext

Copied: jena/site/trunk/content/documentation/query/text-query.mdtext (from r1800245, jena/site/trunk/content/documentation/query/text-query-new.mdtext)
URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/query/text-query.mdtext?p2=jena/site/trunk/content/documentation/query/text-query.mdtext&p1=jena/site/trunk/content/documentation/query/text-query-new.mdtext&r1=1800245&r2=1800246&rev=1800246&view=diff
==============================================================================
--- jena/site/trunk/content/documentation/query/text-query-new.mdtext (original)
+++ jena/site/trunk/content/documentation/query/text-query.mdtext Thu Jun 29 08:04:25 2017
@@ -1,12 +1,17 @@
 Title: Jena Full Text Search
 
-This extension to ARQ combines SPARQL and full text search via [Lucene](https://lucene.apache.org) 6.4.1 or 
-[ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on Lucene). It gives applications the ability 
-to perform indexed full text searches within SPARQL queries.
-
-Recall that SPARQL allows the use of [regex](https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#func-regex) 
-in `FILTER`s; however, such use _is not indexed_. For example, if you're searching for occurrences of `"printer"` in
-the `rdfs:label` of a bunch of products:
+This extension to ARQ combines SPARQL and full text search via
+[Lucene](https://lucene.apache.org) 6.4.1 or
+[ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
+Lucene). It gives applications the ability to perform indexed full text
+searches within SPARQL queries.
+
+SPARQL allows the use of 
+[regex](https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#func-regex) 
+in `FILTER`s which is a test on a value retrieved earlier in the query
+so its use _is not indexed_. For example, if you're
+searching for occurrences of `"printer"` in the `rdfs:label` of a bunch
+of products:
 
     PREFIX   ex: <http://www.example.org/resources#>
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
@@ -18,15 +23,19 @@ the `rdfs:label` of a bunch of products:
     	FILTER regex(?lbl, "printer", "i")
     }
 
-then the search will need to examine _all_ selected `rdfs:label` statements and apply the regular expression 
-to each label in turn. If there are many such statements and many such uses of `regex`, then it may be appropriate 
-to consider using this extension to take advantage of the performance potential of full text indexing.
-
-Text indexes provide additional information for accessing the RDF graph by allowing the application to have _indexed 
-access_ to the internal structure of string literals rather than treating such literals as opaque items. 
-Assuming appropriate [configuration](#configuration), the above query can use full text search via the 
-[ARQ property function extension](https://jena.apache.org/documentation/query/extension.html#property-functions), 
-`text:query`:
+then the search will need to examine _all_ selected `rdfs:label`
+statements and apply the regular expression to each label in turn. If
+there are many such statements and many such uses of `regex`, then it
+may be appropriate to consider using this extension to take advantage of
+the performance potential of full text indexing.
+
+Text indexes provide additional information for accessing the RDF graph
+by allowing the application to have _indexed access_ to the internal
+structure of string literals rather than treating such literals as
+opaque items.  Unlike `FILTER`, an index can set the values of variables.
+Assuming appropriate [configuration](#configuration), the
+above query can use full text search via the
+[ARQ property function extension](https://jena.apache.org/documentation/query/extension.html#property-functions), `text:query`:
 
     PREFIX   ex: <http://www.example.org/resources#>
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
@@ -39,19 +48,18 @@ Assuming appropriate [configuration](#co
     	   rdfs:label ?lbl
     }
 
-This query makes a text query for `'printer'` on the `rdfs:label` property; and then looks in the RDF data and retrieves 
-the complete label for each match.
+This query makes a text query for `'printer'` on the `rdfs:label`
+property; and then looks in the RDF data and retrieves the complete
+label for each match.
+
+The full text engine can be either [Apache
+Lucene](http://lucene.apache.org/core) hosted with Jena on a single
+machine, or [Elasticsearch](https://www.elastic.co/) for a large scale
+enterprise search application where the full text engine is potentially
+distributed across separate machines.
 
-The full text engine can be either [Apache Lucene](http://lucene.apache.org/core) hosted with Jena on
-a single machine, or [Elasticsearch](https://www.elastic.co/) for a large scale enterprise search application
-where the full text engine is potentially distributed across separate machines.
-
-This [example code](https://github.com/apache/jena/tree/master/jena-text/src/main/java/examples/) illustrates
-creating an in-memory dataset with a Lucene index.
-
-This module was first released with Jena 2.11.0.
-
-This module is not compatible with the much older LARQ module.
+This [example code](https://github.com/apache/jena/tree/master/jena-text/src/main/java/examples/)
+illustrates creating an in-memory dataset with a Lucene index.
 
 ## Table of Contents
 
@@ -72,23 +80,31 @@ This module is not compatible with the m
 
 ## Architecture
 
-In general, a text index engine (Lucene or Elasticsearch) indexes _documents_ where each document is
-a collection of _fields_, the values of which are indexed so that searches matching contents of specified 
-fields can return a reference to the document containing the fields with matching values.
-
-The basic idea of the Jena text extension is to associate a triple with a document and the _property_ 
-of the triple with a _field_ of a document and the _object_ of the triple (which must be a literal) with 
-the value of the field in the document. The _subject_ of the triple then becomes another field of the 
-document that is returned as the result of a search match to identify what was matched. (NB, the
-particular triple that matched is not identified. Only, its subject.)
-
-In this manner, the text index provides an inverted index that maps query string matches to subject URIs.
-
-A text-indexed dataset is configured with a description of which properties are to be indexed. When triples 
-are added, any properties matching the description cause a document to be added to the index 
-by analyzing the literal value of the triple object and mapping to the subject URI. On the other hand, it is
-necessary to specifically configure the text-indexed dataset to [delete index entries](#entity-map-definition)
-when the corresponding triples are dropped from the RDF store.
+In general, a text index engine (Lucene or Elasticsearch) indexes
+_documents_ where each document is a collection of _fields_, the values
+of which are indexed so that searches matching contents of specified
+fields can return a reference to the document containing the fields with
+matching values.
+
+The basic idea of the Jena text extension is to associate a triple with
+a document and the _property_ of the triple with a _field_ of a document
+and the _object_ of the triple (which must be a literal) with the value
+of the field in the document. The _subject_ of the triple then becomes
+another field of the document that is returned as the result of a search
+match to identify what was matched. (NB, the particular triple that
+matched is not identified. Only, its subject.)
+
+In this manner, the text index provides an inverted index that maps
+query string matches to subject URIs.
+
+A text-indexed dataset is configured with a description of which
+properties are to be indexed. When triples are added, any properties
+matching the description cause a document to be added to the index by
+analyzing the literal value of the triple object and mapping to the
+subject URI. On the other hand, it is necessary to specifically
+configure the text-indexed dataset to [delete index
+entries](#entity-map-definition) when the corresponding triples are
+dropped from the RDF store.
 
 The text index uses the native query language of the index:
 [Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
@@ -97,9 +113,10 @@ or
 
 ### External content
 
-It is also possible that the indexed text is content external to the RDF store with only additional triples 
-(about the indexed text) in the RDF store. The subject URI returned as a search result may then be considered 
-to refer via the indexed property to the external content.
+It is also possible that the indexed text is content external to the RDF
+store with only additional triples (about the indexed text) in the RDF
+store. The subject URI returned as a search result may then be
+considered to refer via the indexed property to the external content.
 
 There is no requirement that the text data indexed is present in the RDF
 data.  As long as the index contains the index text documents to match the
@@ -114,22 +131,18 @@ The maintenance of the index is external
 
 ### External applications
 
-By using Elasticsearch, other applications can share the text index with SPARQL search.
+By using Elasticsearch, other applications can share the text index with
+SPARQL search.
 
 ## Query with SPARQL
 
-The URI of the text extenion property function is `http://jena.apache.org/text#query` more
-conveniently written:
+The URI of the text extension property function is
+`http://jena.apache.org/text#query` more conveniently written:
 
     PREFIX text: <http://jena.apache.org/text#>
 
     ...   text:query ...
 
-| &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
-|-------------------|--------------------------------|
-| property          | (optional) URI (including prefix name form) |
-| query string      | The native query string        |
-| limit             | (optional) `int` limit on the number of results       |
 
 The following forms are all legal:
 
@@ -146,26 +159,49 @@ The most general form is:
 Only the query string is required, and if it is the only argument the
 surrounding `( )` can be omitted.
 
-The `property` URI is only necessary if multiple properties have been indexed and the property 
-being searched over is not the [default field of the index](#entity-map-definition).
-Also the `property` URI **must not** be used when the `query string` refers explicitly to one or more 
+Input arguments:
+
+| &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
+|-------------------|--------------------------------|
+| property          | (optional) URI (including prefix name form) |
+| query string      | The native query string        |
+| limit             | (optional) `int` limit on the number of results       |
+
+Output arguments:
+
+| &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
+|-------------------|--------------------------------|
+| indexed term      | The indexed RDF term.          |
+| score             | (optional) The score for the match. |
+| hit               | (optional) The literal matched. |
+
+The `property` URI is only necessary if multiple properties have been
+indexed and the property being searched over is not the [default field
+of the index](#entity-map-definition).  Also the `property` URI **must
+not** be used when the `query string` refers explicitly to one or more
 fields.
 
-The results include the subject URI, `?s`; the `?score` assigned by the text search engine;
-and the entire matched `?literal` 
-(if the index has been [configured to store literal values](#text-dataset-assembler)).
+The results include the subject URI, `?s`; the `?score` assigned by the
+text search engine; and the entire matched `?literal` (if the index has
+been [configured to store literal values](#text-dataset-assembler)).
 
 If the `query string` refers to more than one field, e.g.,
 
     "label: printer AND description: \"large capacity cartridge\""
 
-then the `?literal` in the results will not be bound since there is no single field that contains
-the match &ndash; the match is separated over two fields.
+then the `?literal` in the results will not be bound since there is no
+single field that contains the match &ndash; the match is separated over
+two fields.
+
+If an output indexed term is already a known value, either as a constant
+in the query or variable already set, then the index lookup becomes a
+check that this is a match for the input arguments.
 
 ### Good practice
 
-The query engine does not have information about the selectivity of the text index and so effective
-query plans cannot be determined programmatically.  It is helpful to be aware of the following two
+The query engine does not have information about the selectivity of the
+text index and so effective query plans cannot be determined
+programmatically.  It is helpful to be aware of the following two
 general query patterns.
 
 #### Query pattern 1 &ndash; Find in the text index and refine results
@@ -300,26 +336,31 @@ used to analyze the query string. If not
 If using Elasticsearch then an index would be configured as follows:
 
     <#indexES> a text:TextIndexES ;
-        text:serverList "127.0.0.1:9300" ; # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
-        text:clusterName "elasticsearch" ; # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
-        text:shards "1" ;                  # The number of shards for the index. Defaults to 1
-        text:replicas "1" ;                # The number of replicas for the index. Defaults to 1
-        text:indexName "jena-text" ;       # Name of the Index. defaults to jena-text
+          # A comma-separated list of Host:Port values of the ElasticSearch Cluster nodes.
+        text:serverList "127.0.0.1:9300" ; 
+          # Name of the ElasticSearch Cluster. If not specified defaults to 'elasticsearch'
+        text:clusterName "elasticsearch" ; 
+          # The number of shards for the index. Defaults to 1
+        text:shards "1" ;
+          # The number of replicas for the index. Defaults to 1
+        text:replicas "1" ;         
+          # Name of the Index. defaults to jena-text
+        text:indexName "jena-text" ;
         text:entityMap <#entMap> ;
         .
 
 and `text:index  <#indexES> ;` would be used in the configuration of `:text_dataset`.
 
-To use a text index assembler configuration in Java code is it necessary to identify the dataset URI to 
-be assembled, such as in:
+To use a text index assembler configuration in Java code is it necessary
+to identify the dataset URI to be assembled, such as in:
 
     Dataset ds = DatasetFactory.assemble(
         "text-config.ttl", 
         "http://localhost/jena_example/#text_dataset") ;
 
-since the assembler contains two dataset definitions, one for
-the text dataset, one for the base data.  Therefore, the application
-needs to identify the text dataset by it's URI
+since the assembler contains two dataset definitions, one for the text
+dataset, one for the base data.  Therefore, the application needs to
+identify the text dataset by it's URI
 `http://localhost/jena_example/#text_dataset`.
 
 ### Entity Map definition
@@ -474,10 +515,10 @@ the optional `text:filters` property spe
 
 New in Jena 2.13.0.
 
-There is an ability to specify an analyzer to be used for the
-query string itself.  It will find terms in the query text.  If not set, then the
-analyzer used for the document will be used.  The query analyzer is specified on
-the `TextIndexLucene` resource:
+There is an ability to specify an analyzer to be used for the query
+string itself.  It will find terms in the query text.  If not set, then
+the analyzer used for the document will be used.  The query analyzer is
+specified on the `TextIndexLucene` resource:
 
     <#indexLucene> a text:TextIndexLucene ;
         text:directory <file:Lucene> ;
@@ -495,10 +536,12 @@ It is possible to select a query parser
 
 The available `QueryParser` implementations are:
 
-* `AnalyzingQueryParser`: Performs analysis for wildcard queries . This is useful in combination
-with accent-insensitive wildcard queries.
-* `ComplexPhraseQueryParser`: Permits complex phrase query syntax. Eg: "(john jon jonathan~) peters*".
-This is useful for performing wildcard or fuzzy queries on individual terms in a phrase.
+* `AnalyzingQueryParser`: Performs analysis for wildcard queries . This
+is useful in combination with accent-insensitive wildcard queries.
+
+* `ComplexPhraseQueryParser`: Permits complex phrase query syntax. Eg:
+"(john jon jonathan~) peters*".  This is useful for performing wildcard
+or fuzzy queries on individual terms in a phrase.
 
 The query parser is specified on
 the `TextIndexLucene` resource:
@@ -508,7 +551,6 @@ the `TextIndexLucene` resource:
         text:entityMap <#entMap> ;
         text:queryParser text:AnalyzingQueryParser .
 
-
 Elasticsearch currently doesn't support Analyzers beyond Standard Analyzer. 
 
 ### Configuration by Code
@@ -531,11 +573,11 @@ purely in-memory setup:
 
 ### Graph-specific Indexing
 
-Starting with version 1.0.1, jena-text supports
-storing information about the source graph into the text index. This allows
-for more efficient text queries when the query targets only a single named
-graph. Without graph-specific indexing, text queries do not distinguish named
-graphs and will always return results from all graphs.
+jena-text supports storing information about the source graph into the
+text index. This allows for more efficient text queries when the query
+targets only a single named graph. Without graph-specific indexing, text
+queries do not distinguish named graphs and will always return results
+from all graphs.
 
 Support for graph-specific indexing is enabled by defining the name of the
 index field to use for storing the graph identifier.
@@ -609,7 +651,8 @@ You can specify a LocalizedAnalyzer in o
 specific analyzers (stemming, stop words,...). Like any other analyzers, it can 
 be done for default text indexing, for each different field or for query.
 
-With an assembler configuration, the `text:language` property needs to be provided, e.g :
+With an assembler configuration, the `text:language` property needs to
+be provided, e.g :
 
     <#indexLucene> a text:TextIndexLucene ;
         text:directory <file:Lucene> ;
@@ -620,7 +663,8 @@ With an assembler configuration, the `te
         ]
         .
 
-will configure the index to analyze values of the 'text' field using a FrenchAnalyzer.
+will configure the index to analyze values of the 'text' field using a
+FrenchAnalyzer.
 
 To configure the same example via Java code, you need to provide the analyzer to the
 index configuration object:
@@ -638,10 +682,11 @@ localized analyzer.
 
 #### Multilingual Support
 
-Let us suppose that we have many triples with many localized literals in many different 
-languages. It is possible to take all these languages into account for future mixed localized queries.
-Just set the `text:multilingualSupport` property at `true` to automatically enable the localized
-indexing (and also the localized analyzer for query) :
+Let us suppose that we have many triples with many localized literals in
+many different languages. It is possible to take all these languages
+into account for future mixed localized queries.  Just set the
+`text:multilingualSupport` property at `true` to automatically enable
+the localized indexing (and also the localized analyzer for query) :
 
     <#indexLucene> a text:TextIndexLucene ;
         text:directory "mem" ;
@@ -666,32 +711,36 @@ For example, it is possible to refer to
         { ?s text:query ( rdfs:label 'institute' 'lang:en' ) }
     }
 
-Hence, the result set of the query will contain "institute" related subjects 
-(institution, institutional,...) in French and in English.
+Hence, the result set of the query will contain "institute" related
+subjects (institution, institutional,...) in French and in English.
 
 **Note**: If the `text:langField` property is not set, the `text:langField` will default to"lang".
 
 ### Generic and Defined Analyzer Support
 
-There are many Analyzers that do not have built-in support, e.g., `BrazilianAnalyzer`; 
-require constructor parameters not otherwise supported, e.g., a stop words `FileReader` or
-a `stemExclusionSet`; or make use of Analyzers not included in the bundled
-Lucene distribution, e.g., a `SanskritIASTAnalyzer`. Two features have been added to enhance
-the utility of jena-text: 1) `text:GenericAnalyzer`; and 2) `text:DefinedAnalyzer`.
+There are many Analyzers that do not have built-in support, e.g.,
+`BrazilianAnalyzer`; require constructor parameters not otherwise
+supported, e.g., a stop words `FileReader` or a `stemExclusionSet`; or
+make use of Analyzers not included in the bundled Lucene distribution,
+e.g., a `SanskritIASTAnalyzer`. Two features have been added to enhance
+the utility of jena-text: 1) `text:GenericAnalyzer`; and 2)
+`text:DefinedAnalyzer`.
 
 #### Generic Analyzer
 
-A `text:GenericAnalyzer` includes a `text:class` which is the fully qualified class name of an
-Analyzer that is accessible on the jena classpath. This is trivial for Analyzer classes that are
-included in the bundled Lucene distribution and for other custom Analyzers a simple matter of
-including a jar containing the custom Analyzer and any associated Tokenizer and Filters on
-the classpath.
-
-In addition to the `text:class` it is generally useful to include an ordered list of `text:params`
-that will be used to select an appropriate constructor of the Analyzer class. If there are no
-`text:params` in the analyzer specification or if the `text:params` is an empty list then the 
-nullary constructor is used to instantiate the analyzer. Each element of the list of `text:params` 
-includes:
+A `text:GenericAnalyzer` includes a `text:class` which is the fully
+qualified class name of an Analyzer that is accessible on the jena
+classpath. This is trivial for Analyzer classes that are included in the
+bundled Lucene distribution and for other custom Analyzers a simple
+matter of including a jar containing the custom Analyzer and any
+associated Tokenizer and Filters on the classpath.
+
+In addition to the `text:class` it is generally useful to include an
+ordered list of `text:params` that will be used to select an appropriate
+constructor of the Analyzer class. If there are no `text:params` in the
+analyzer specification or if the `text:params` is an empty list then the
+nullary constructor is used to instantiate the analyzer. Each element of
+the list of `text:params` includes:
 
 * an optional `text:paramName` of type `Literal` that is useful to identify the purpose of a 
 parameter in the assembler configuration
@@ -1002,4 +1051,4 @@ For Elasticsearch implementation, you ca
       <version>X.Y.Z</version>
     </dependency>
 
-adjusting the version <code>X.Y.Z</code> as necessary.
\ No newline at end of file
+adjusting the version <code>X.Y.Z</code> as necessary.