You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Chris Tomlinson <an...@apache.org> on 2017/11/20 01:51:08 UTC

CMS diff: Jena Full Text Search

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===================================================================
--- trunk/content/documentation/query/text-query.mdtext	(revision 1815762)
+++ trunk/content/documentation/query/text-query.mdtext	(working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,20 @@
 ## Table of Contents
 
 -   [Architecture](#architecture)
+    -   [External content](#external-content)
+    -   [External applications](#external-applications)
+    -   [Document structure](#document-structure)
 -   [Query with SPARQL](#query-with-sparql)
+    -   [Syntax](#syntax)
+        -   [Input arguments](#input-arguments)
+        -   [Output arguments](#output-arguments)
+    -   [Query strings](#query-strings)
+        -   [Simple queries](#simple-queries)
+        -   [Queries with language tags](#queries-with-language-tags)
+        -   [Queries that retrieve literals](#queries-that-retrieve-literals)
+        -   [Queries across multiple `Field`s](#queries-across-multiple-fields)
+        -   [Queries within a `Field`](#queries-within-a-field)
+    -   [Good practice](#good-practice)
 -   [Configuration](#configuration)
     -   [Text Dataset Assembler](#text-dataset-assembler)
     -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -134,6 +149,69 @@
 By using Elasticsearch, other applications can share the text index with
 SPARQL search.
 
+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are performed 
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene the 
+_property_ of the triple must be 
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the configuration, 
+that is the field to search if not otherwise named in the query. In jena-text 
+this field is configured via the `text:defaultField` property which is then mapped 
+to a specific RDF property via `text:predicate` (see [entity map](#entity-map-definition) 
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally 
+searchable per se.
+
+The most important of these additional `Field`s is the `text:entityField`.
+This configuration property defines the name of the `Field` that will contain
+the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. This property does
+not have a default and must be specified for most uses of `jena-text`. This
+`Field` is often given the name, `uri`, in examples. It is via this `Field`
+that `?s` is bound in a typical use such as:
+
+    select ?s
+    where {
+        ?s text:query "some text"
+    }
+
+Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
+and so on are discussed below.
+
+Given the triple:
+
+    ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
+
+The following illustrates a Lucene document that Jena will create and
+request Lucene to index:
+
+    Document<
+        stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne> 
+        indexed, omitNorms, indexOptions=DOCS <graph:urn:x-arq:DefaultGraphNode> 
+        stored, indexed, tokenized <label:zorn protégé a prés> 
+        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> 
+        stored, indexed, tokenized <label_fr:zorn protégé a prés> 
+        stored, indexed, omitNorms, indexOptions=DOCS <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00> 
+        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode> 
+        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> 
+        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode> 
+        stored, indexed, omitNorms, indexOptions=DOCS <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
+        >
+
+It may be instructive to refer back to this example when considering the various
+points below.
+
 ## Query with SPARQL
 
 The URI of the text extension property function is
@@ -143,63 +221,248 @@
 
     ...   text:query ...
 
+### Syntax
 
 The following forms are all legal:
 
-    ?s text:query 'word'                   # query
-    ?s text:query (rdfs:label 'word')      # query specific property if multiple
-    ?s text:query ('word' 10)              # with limit on results
-    (?s ?score) text:query 'word'          # query capturing also the score
-    (?s ?score ?literal) text:query 'word' # ... and original literal value
+    ?s text:query 'word'                              # query
+    ?s text:query ('word' 10)                         # with limit on results
+    ?s text:query (rdfs:label 'word')                 # query specific property if multiple
+    ?s text:query (rdfs:label 'protégé' 'lang:fr')    # restrict search to French
+    (?s ?score) text:query 'word'                     # query capturing also the score
+    (?s ?score ?literal) text:query 'word'            # ... and original literal value
     
 The most general form is:
    
-     (?s ?score ?literal) text:query (property 'query string' limit)
+     (?s ?score ?literal) text:query (property 'query string' limit 'lang:xx')
 
-Only the query string is required, and if it is the only argument the
-surrounding `( )` can be omitted.
+#### Input arguments:
 
-Input arguments:
-
 | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
 |-------------------|--------------------------------|
 | property          | (optional) URI (including prefix name form) |
-| query string      | The native query string        |
+| query string      | Lucene query string fragment       |
 | limit             | (optional) `int` limit on the number of results       |
+| lang:xx           | (optional) language tag spec       |
 
-Output arguments:
+The `property` URI is only necessary if multiple properties have been
+indexed and the property being searched over is not the [default field
+of the index](#entity-map-definition).  Also the `property` URI **must
+not** be used when the `query string` refers explicitly to one or more
+fields. The optional `limit` indicates the maximum hits to be returned by Lucene.
 
+The `lang:xx` specification is an optional string, where _xx_ is 
+a BCP-47 language tag. This restricts searches to field values that were originally 
+indexed with the tag _xx_. Searches may be restricted to field values with no 
+language tag via `"lang:none"`. The use of the `lang:xx` is only effective if 
+[multilingual support](#linguistic-support-with-lucene-index) has been configured.
+Further, if the `lang:xx` is used then the `property` URI must be supplied
+in order for searches to work.
+
+If both `limit` and `lang:xx` are present, then `limit` must precede
+`lang:xx`.
+
+If only the query string is required, the surrounding `( )` can be omitted.
+
+#### Output arguments:
+
 | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
 |-------------------|--------------------------------|
-| indexed term      | The indexed RDF term.          |
+| subject URI       | The subject of the indexed RDF triple.          |
 | score             | (optional) The score for the match. |
-| hit               | (optional) The literal matched. |
+| literal           | (optional) The matched object literal. |
 
-The `property` URI is only necessary if multiple properties have been
-indexed and the property being searched over is not the [default field
-of the index](#entity-map-definition).  Also the `property` URI **must
-not** be used when the `query string` refers explicitly to one or more
+The results include the _subject URI_; the _score_ assigned by the
+text search engine; and the entire matched _literal_ (if the index has
+been [configured to store literal values](#text-dataset-assembler)).
+The _subject URI_ may be a variable, e.g., `?s`, or a _URI_. In the
+latter case the search is restricted to triples with the specified
+subject. The _score_ and the _literal_ **must** be variables.
+
+If only the _subject_ variable, `?s`, or specific _`URI`_ is needed 
+then it must be written without surrounding `( )`; otherwise, an error 
+is signalled.
+
+### Query strings
+
+There are several points that need to be considered when formulating
+SPARQL queries using the Lucene interface.
+
+#### Simple queries
+
+The simplest use of the jena-text Lucene integration is:
+
+    ?s text:query "some phrase"
+
+This will bind `?s` to each entity URI that is the subject of a triple
+that has the default property and an object literal that matches
+the argument string, e.g.:
+
+    ex:AnEntity skos:prefLabel "this is some phrase to match"
+
+This query form will indicate the subjects that have literals that match
+without providing any information about the specific literals that matched.
+If this use case is sufficient for your needs you can skip on to the 
+[sections on configuration](#configuration).
+
+#### Queries with language tags
+
+When working with `rdf:langString`s It may be tempting to write:
+
+    ?s text:query "protégé"@fr
+
+However, the above will silently fail to return results since the
+`query string` must be a simple `xsd:string` not an `rdf:langString`.
+
+The effective form of the above query is expressed:
+
+    ?s text:query (skos:prefLabel "protégé" 'lang:fr')
+
+if the intent is to search only labels with French content.
+
+Even if the default _property_ is `skos:prefLabel` it is necessary
+to use the above form rather than omitting the `property` argument
+when restricting the Lucene search to a specific `lang:xx`; otherwise,
+again there will be no results.
+
+If all one is interested in are _subjects_ with `skos:prefLabel` where
+that is the `text:predicate` of the `text:defaultField` and without regard 
+for specified `lang:xx`s then:
+
+    ?s text:query "protégé"
+
+will do the job.
+
+For a non-default `Field` with no language restriction, the patterns:
+
+    ?s text:query (rdfs:label "protégé")
+
+or
+
+    ?s text:query "rdfsLabel:protégé"
+
+may be used (see [below](#entity-map-definition) for how RDF _property_ names 
+are mapped to Lucene `Field` names). However, as mentioned earlier,
+
+    ?s text:query ("rdfsLabel:protégé" "lang:fr")
+
+will result in an error owing to the way in which the jena-text composes the
+query string to Lucene in the presence of the `"lang:fr"` argument.
+
+#### Queries that retrieve literals
+
+It is possible to retrieve the *literal*s that Lucene finds matches for
+assuming that
+
+    <#TextIndex#> text:storeValues true ;
+
+has been specified in the `TextIndex` configuration. So
+
+    (?s ?sc ?lit) text:query (rdfs:label "protégé")
+
+will bind the matching literals to `?lit`, e.g.,
+
+    "zorn protégé a prés"@fr
+
+However, it is important to note that the apparently equivalent form:
+
+    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
+
+will fail to produce a binding for `?lit` even though `?s` and `?sc` are
+bound as expected.
+
+So if the _literal_ matches are needed you **must use** the query arguments that
+list the _property_ explicitly, except in the simple case of a query against
+the default `Field`/_property_.
+
+#### Queries across multiple `Field`s
+
+It has been mentioned earlier that the text index uses the
+[native Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
+however, there are important constraints on how the Lucene query language is used within jena-text.
+This is owing to the fact that jena-text composes the query string that is sent to Lucene so that
+features such as `lang:xx` may be implemented. Other aspects of using the Lucene query language
+reflect the fact that each triple is a separate document.
+
+This latter observation is important when considering queries that are intended to involve several
 fields.
 
-The results include the subject URI, `?s`; the `?score` assigned by the
-text search engine; and the entire matched `?literal` (if the index has
-been [configured to store literal values](#text-dataset-assembler)).
+For example, consider the following triples:
 
-If the `query string` refers to more than one field, e.g.,
+    ex:SomePrinter 
+        rdfs:label     "laser printer" ;
+        ex:description "includes a large capacity cartridge" .
 
-    "label: printer AND description: \"large capacity cartridge\""
+ assuming an appropriate configuration we might expect to retrieve `ex:SomePrinter`
+ with the following query:
 
-then the `?literal` in the results will not be bound since there is no
-single field that contains the match &ndash; the match is separated over
-two fields.
+    ?s text:query "label:printer AND description:\"large capacity cartridge\""
 
-If an output indexed term is already a known value, either as a constant
-in the query or variable already set, then the index lookup becomes a
-check that this is a match for the input arguments.
+However, this query will fail to find the expected results since the `AND` is interpreted
+by Lucene to indicate that all documents that contain a matching `label` field _and_
+a matching `description` field are to be returned. Yet from the discussion above
+regarding the [structure of Lucene documents in jena-text](#document-structure) it
+is evident that there is not one but rather in fact two separate documents one with a 
+`label` field and one with a `description` field so an effective query is:
 
+    ?s text:query "label:printer" .
+    ?s text:query "description:\"large capacity cartridge\"" .
+
+which leads to `?s` being bound to `ex:SomePrinter`.
+
+In other words when a query is to involve two or more _properties_/`Field`s then it
+expressed at the SPARQL level, as it were, versus in Lucene's query language.
+
+It is worth noting that:
+
+    ?s text:query "label:printer OR description:\"large capacity cartridge\""
+
+works simply because Lucene is required to do nothing more than a _union_ of
+matching documents the same as if written:
+
+    { ?s text:query "label:printer" . }
+    union
+    { ?s text:query "description:\"large capacity cartridge\"" . }
+
+Suppose the matching literals are required for the above then it should be clear
+from the above that:
+
+    (?s ?sc1 ?lit1) text:query (skos:prefLabel "printer") .
+    (?s ?sc2 ?lit2) text:query (ex:description "large capacity cartridge") .
+
+will be the appropriate form to retrieve the _subject_ and the associated literals.
+
+There is no loss of expressiveness of the Lucene query language versus the jena-text
+integration of Lucene. Any cross-field `AND`s are replaced by concurrent SPARQL calls to
+text:query as illustrated above and uses of Lucene `OR` can be converted to SPARQL 
+`union`s. Uses of Lucene `NOT` are converted to appropriate SPARQL `filter`s.
+
+#### Queries within a `Field`
+
+On the other hand the various features of the [Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+are all available to be used for searches within a `Field`. For example:
+
+    ?s text:query "description:(large AND cartridge)"
+
+and
+
+    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR capacity))")
+
+will work as expected.
+
+The key is to always surround the field query with `( )`s.
+
+
 ### Good practice
 
-The query engine does not have information about the selectivity of the
+From the above it should be clear that best practice, except in the simplest cases
+is to use explicit `text:query` forms such as:
+
+    (?s ?sc ?lit) text:query (ex:someProperty "a single Field query")
+
+possibly with _limit_ and `lang:xx` arguments.
+
+Further, the query engine does not have information about the selectivity of the
 text index and so effective query plans cannot be determined
 programmatically.  It is helpful to be aware of the following two
 general query patterns.
@@ -394,7 +657,7 @@
 is returned on a match. The value of the property is arbitrary so long as it is unique among the
 defined names.
 
-#### Automatic document deletion
+#### UID Field and automatic document deletion
 
 When the `text:uidField` is defined in the `EntityMap` then dropping a triple will result in the 
 corresponding document, if any, being deleted from the text index. The value, `"uid"`, is arbitrary 
@@ -632,7 +895,7 @@
  
 #### SPARQL Linguistic Clause Forms
 
-Once the `langField` is set, you can use it directly inside SPARQL queries, for that the `'lang:xx'`
+Once the `langField` is set, you can use it directly inside SPARQL queries, for that the `lang:xx`
 argument allows you to target specific localized values. For example:
 
     //target english literals
@@ -714,7 +977,7 @@
 Hence, the result set of the query will contain "institute" related
 subjects (institution, institutional,...) in French and in English.
 
-**Note**: If the `text:langField` property is not set, the `text:langField` will default to"lang".
+**Note**: If the `text:langField` property is not set, the `text:langField` will default to "lang".
 
 ### Generic and Defined Analyzer Support
 


Re: CMS diff: Jena Full Text Search

Posted by Chris Tomlinson <ch...@gmail.com>.
Hi All,

I’ve completed my proposed updates to the Jena text-query documentation. The documentation corresponds to 3.6.0-SNAPSHOT. I’ve noted several instances where the current behavior may be considered an issue that will be corrected in a future release. I’ve separately created issues for these: JENA-1437 <https://issues.apache.org/jira/browse/JENA-1437>, JENA-1438 <https://issues.apache.org/jira/browse/JENA-1438>, and JENA-1439 <https://issues.apache.org/jira/browse/JENA-1439>.

Thank you,
Chris


> On Nov 22, 2017, at 8:25 AM, Chris Tomlinson <ch...@gmail.com> wrote:
> 
> Hi Andy and Osma,
> 
> I posted JENA-1426 <https://issues.apache.org/jira/browse/JENA-1426> since the “improve this page” facility didn’t seem to offer any way to add a commit message or more extensive explanation of the reasons for the proposed edits and they were somewhat extensive. So raising an issue seemed a way to proceed; however, after several days with no comments I thought perhaps I should follow the published protocol and I made the update as guest on the CMS.
> 
> I had several motivations regarding updating the documentation: 1) I wanted to present how the current implementation functions in a way that might be more useful to users - for example clarifying what can be expected to work and what not in terms of using the native Lucene query language, e.g., JENA-1388 <https://issues.apache.org/jira/browse/JENA-1388>; 2) identify areas that might indicate perhaps unintended aspects of the current implementation; and 3) understand the code in preparation for developing a proposal for adding jena-text highlighting support <http://apache.markmail.org/message/rlzzdd3yw7a7aoqc?q=jena-text+highlighting+support>.
> 
> Based on Osma’s feedback I will be opening a few issues on JIRA and making corrections to the original submission. I assume that updates should just be made as further commits.
> 
> Thanks,
> Chris
> 
> 
> 
>> On Nov 22, 2017, at 6:41 AM, Andy Seaborne <andy@apache.org <ma...@apache.org>> wrote:
>> 
>> How is this related to JENA-1426?
>> 
>>    Andy
>> 
>> On 21/11/17 14:48, Osma Suominen wrote:
>>> ajs6f kirjoitti 20.11.2017 klo 18:36:
>>>> Osma (or anyone else who knows text indexing better than do I, which wouldn't take much)-- could you review this? It's got some great useful detail about how the indexing works and can be used.
>>> Sure, will do.
>>> Comments about specific sections below. Generally this is a very good contribution to the jena-text documentation, which has stagnated a bit.
>>>>> +The following illustrates a Lucene document that Jena will create and
>>>>> +request Lucene to index:
>>>>> +
>>>>> +    Document<
>>>>> +        stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne <http://example.org/SomeOne>>
>>>>> +        indexed, omitNorms, indexOptions=DOCS <graph:urn:x-arq:DefaultGraphNode>
>>>>> +        stored, indexed, tokenized <label:zorn protégé a prés>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>>> +        stored, indexed, tokenized <label_fr:zorn protégé a prés>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
>>>>> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>>> +        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
>>>>> +        >
>>>>> +
>>>>> +It may be instructive to refer back to this example when considering the various
>>>>> +points below.
>>> Not sure if this is a perfect illustration. The level of detail is rather excessive. I know Lucene quite well and I still struggle to understand what's going on here. Is there another way of presenting this information, for example just a key-value list that shows the field values that get stored in the document? I think the field options stored, indexed, tokenized, omitNorms etc. are unnecessary here or at least should not be so prominent.
>>>>> +The `lang:xx` specification is an optional string, where _xx_ is
>>>>> +a BCP-47 language tag. This restricts searches to field values that were originally
>>>>> +indexed with the tag _xx_. Searches may be restricted to field values with no
>>>>> +language tag via `"lang:none"`. The use of the `lang:xx` is only effective if
>>>>> +[multilingual support](#linguistic-support-with-lucene-index) has been configured.
>>> The last sentence is not true. You can restrict by language even without enabling multilingual support, as long as langField has been set.
>>>>> +Further, if the `lang:xx` is used then the `property` URI must be supplied
>>>>> +in order for searches to work.
>>> Not true. The default property should be used if no property was specified.
>>>>> +When working with `rdf:langString`s It may be tempting to write:
>>>>> +
>>>>> +    ?s text:query "protégé"@fr
>>>>> +
>>>>> +However, the above will silently fail to return results since the
>>>>> +`query string` must be a simple `xsd:string` not an `rdf:langString`.
>>> This could be considered a bug - at least it shouldn't fail silently.
>>>>> +Even if the default _property_ is `skos:prefLabel` it is necessary
>>>>> +to use the above form rather than omitting the `property` argument
>>>>> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
>>>>> +again there will be no results.
>>> Again, not true. I just tested this query against YSO:
>>>  ?s text:query ("cat" "lang:en")
>>> and it gave a single result, as expected.
>>>>> +For a non-default `Field` with no language restriction, the patterns:
>>>>> +
>>>>> +    ?s text:query (rdfs:label "protégé")
>>>>> +
>>>>> +or
>>>>> +
>>>>> +    ?s text:query "rdfsLabel:protégé"
>>>>> +
>>>>> +may be used (see [below](#entity-map-definition) for how RDF _property_ names
>>>>> +are mapped to Lucene `Field` names). 
>>> I wouldn't recommend using a query form like "rdfsLabel:protégé" in the documentation at all. It violates the layered architecture of jena-text - the query should not be targeting named fields. If you want to target rdfs:label, use the first form.
>>>>> However, as mentioned earlier,
>>>>> +
>>>>> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
>>>>> +
>>>>> +will result in an error owing to the way in which the jena-text composes the
>>>>> +query string to Lucene in the presence of the `"lang:fr"` argument.
>>> Don't do that then. Remove this section. (see previous comment)
>>>>> +However, it is important to note that the apparently equivalent form:
>>>>> +
>>>>> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
>>>>> +
>>>>> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are
>>>>> +bound as expected.
>>> Again, don't do that. Use (rdfs:label "protégé") instead and let jena-text handle the translation from property to Lucene field.
>>>>> +So if the _literal_ matches are needed you **must use** the query arguments that
>>>>> +list the _property_ explicitly, except in the simple case of a query against
>>>>> +the default `Field`/_property_.
>>> Exactly. And those are the only supported query forms anyway.
>>>>> +#### Queries across multiple `Field`s
>>>>> +
>>>>> +It has been mentioned earlier that the text index uses the
>>>>> +[native Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description <http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>); 
>>>>> +however, there are important constraints on how the Lucene query language is used within jena-text.
>>>>> +This is owing to the fact that jena-text composes the query string that is sent to Lucene so that
>>>>> +features such as `lang:xx` may be implemented. Other aspects of using the Lucene query language
>>>>> +reflect the fact that each triple is a separate document.
>>>>> +
>>>>> +This latter observation is important when considering queries that are intended to involve several
>>>>> fields.
>>>>> 
>>>>> -The results include the subject URI, `?s`; the `?score` assigned by the
>>>>> -text search engine; and the entire matched `?literal` (if the index has
>>>>> -been [configured to store literal values](#text-dataset-assembler)).
>>>>> +For example, consider the following triples:
>>>>> 
>>>>> -If the `query string` refers to more than one field, e.g.,
>>>>> +    ex:SomePrinter
>>>>> +        rdfs:label     "laser printer" ;
>>>>> +        ex:description "includes a large capacity cartridge" .
>>>>> 
>>>>> -    "label: printer AND description: \"large capacity cartridge\""
>>>>> + assuming an appropriate configuration we might expect to retrieve `ex:SomePrinter`
>>>>> + with the following query:
>>>>> 
>>>>> -then the `?literal` in the results will not be bound since there is no
>>>>> -single field that contains the match &ndash; the match is separated over
>>>>> -two fields.
>>>>> +    ?s text:query "label:printer AND description:\"large capacity cartridge\""
>>>>> 
>>>>> -If an output indexed term is already a known value, either as a constant
>>>>> -in the query or variable already set, then the index lookup becomes a
>>>>> -check that this is a match for the input arguments.
>>>>> +However, this query will fail to find the expected results since the `AND` is interpreted
>>>>> +by Lucene to indicate that all documents that contain a matching `label` field _and_
>>>>> +a matching `description` field are to be returned. Yet from the discussion above
>>>>> +regarding the [structure of Lucene documents in jena-text](#document-structure) it
>>>>> +is evident that there is not one but rather in fact two separate documents one with a
>>>>> +`label` field and one with a `description` field so an effective query is:
>>>>> 
>>>>> +    ?s text:query "label:printer" .
>>>>> +    ?s text:query "description:\"large capacity cartridge\"" .
>>>>> +
>>>>> +which leads to `?s` being bound to `ex:SomePrinter`.
>>> Again this should instead be written as
>>>     ?s text:query (rdfs:label "printer") .
>>>     ?s text:query (ex:description "large capacity cartridge") .
>>>>> +In other words when a query is to involve two or more _properties_/`Field`s then it
>>>>> +expressed at the SPARQL level, as it were, versus in Lucene's query language.
>>>>> +
>>>>> +It is worth noting that:
>>>>> +
>>>>> +    ?s text:query "label:printer OR description:\"large capacity cartridge\""
>>>>> +
>>>>> +works simply because Lucene is required to do nothing more than a _union_ of
>>>>> +matching documents the same as if written:
>>>>> +
>>>>> +    { ?s text:query "label:printer" . }
>>>>> +    union
>>>>> +    { ?s text:query "description:\"large capacity cartridge\"" . }
>>> Don't do that, even if it happens to work.
>>>>> +On the other hand the various features of the [Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description <http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>) 
>>>>> +are all available to be used for searches within a `Field`. For example:
>>>>> +
>>>>> +    ?s text:query "description:(large AND cartridge)"
>>>>> +
>>>>> +and
>>>>> +
>>>>> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR capacity))")
>>>>> +
>>>>> +will work as expected.
>>> The first one is better written as
>>>     ?s text:query (ex:description "(large AND cartridge)")
>>> -Osma
> 


Re: CMS diff: Jena Full Text Search

Posted by Chris Tomlinson <ch...@gmail.com>.
Hi Andy and Osma,

I posted JENA-1426 <https://issues.apache.org/jira/browse/JENA-1426> since the “improve this page” facility didn’t seem to offer any way to add a commit message or more extensive explanation of the reasons for the proposed edits and they were somewhat extensive. So raising an issue seemed a way to proceed; however, after several days with no comments I thought perhaps I should follow the published protocol and I made the update as guest on the CMS.

I had several motivations regarding updating the documentation: 1) I wanted to present how the current implementation functions in a way that might be more useful to users - for example clarifying what can be expected to work and what not in terms of using the native Lucene query language, e.g., JENA-1388 <https://issues.apache.org/jira/browse/JENA-1388>; 2) identify areas that might indicate perhaps unintended aspects of the current implementation; and 3) understand the code in preparation for developing a proposal for adding jena-text highlighting support <http://apache.markmail.org/message/rlzzdd3yw7a7aoqc?q=jena-text+highlighting+support>.

Based on Osma’s feedback I will be opening a few issues on JIRA and making corrections to the original submission. I assume that updates should just be made as further commits.

Thanks,
Chris



> On Nov 22, 2017, at 6:41 AM, Andy Seaborne <an...@apache.org> wrote:
> 
> How is this related to JENA-1426?
> 
>    Andy
> 
> On 21/11/17 14:48, Osma Suominen wrote:
>> ajs6f kirjoitti 20.11.2017 klo 18:36:
>>> Osma (or anyone else who knows text indexing better than do I, which wouldn't take much)-- could you review this? It's got some great useful detail about how the indexing works and can be used.
>> Sure, will do.
>> Comments about specific sections below. Generally this is a very good contribution to the jena-text documentation, which has stagnated a bit.
>>>> +The following illustrates a Lucene document that Jena will create and
>>>> +request Lucene to index:
>>>> +
>>>> +    Document<
>>>> +        stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne>
>>>> +        indexed, omitNorms, indexOptions=DOCS <graph:urn:x-arq:DefaultGraphNode>
>>>> +        stored, indexed, tokenized <label:zorn protégé a prés>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>> +        stored, indexed, tokenized <label_fr:zorn protégé a prés>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
>>>> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>> +        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
>>>> +        >
>>>> +
>>>> +It may be instructive to refer back to this example when considering the various
>>>> +points below.
>> Not sure if this is a perfect illustration. The level of detail is rather excessive. I know Lucene quite well and I still struggle to understand what's going on here. Is there another way of presenting this information, for example just a key-value list that shows the field values that get stored in the document? I think the field options stored, indexed, tokenized, omitNorms etc. are unnecessary here or at least should not be so prominent.
>>>> +The `lang:xx` specification is an optional string, where _xx_ is
>>>> +a BCP-47 language tag. This restricts searches to field values that were originally
>>>> +indexed with the tag _xx_. Searches may be restricted to field values with no
>>>> +language tag via `"lang:none"`. The use of the `lang:xx` is only effective if
>>>> +[multilingual support](#linguistic-support-with-lucene-index) has been configured.
>> The last sentence is not true. You can restrict by language even without enabling multilingual support, as long as langField has been set.
>>>> +Further, if the `lang:xx` is used then the `property` URI must be supplied
>>>> +in order for searches to work.
>> Not true. The default property should be used if no property was specified.
>>>> +When working with `rdf:langString`s It may be tempting to write:
>>>> +
>>>> +    ?s text:query "protégé"@fr
>>>> +
>>>> +However, the above will silently fail to return results since the
>>>> +`query string` must be a simple `xsd:string` not an `rdf:langString`.
>> This could be considered a bug - at least it shouldn't fail silently.
>>>> +Even if the default _property_ is `skos:prefLabel` it is necessary
>>>> +to use the above form rather than omitting the `property` argument
>>>> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
>>>> +again there will be no results.
>> Again, not true. I just tested this query against YSO:
>>  ?s text:query ("cat" "lang:en")
>> and it gave a single result, as expected.
>>>> +For a non-default `Field` with no language restriction, the patterns:
>>>> +
>>>> +    ?s text:query (rdfs:label "protégé")
>>>> +
>>>> +or
>>>> +
>>>> +    ?s text:query "rdfsLabel:protégé"
>>>> +
>>>> +may be used (see [below](#entity-map-definition) for how RDF _property_ names
>>>> +are mapped to Lucene `Field` names). 
>> I wouldn't recommend using a query form like "rdfsLabel:protégé" in the documentation at all. It violates the layered architecture of jena-text - the query should not be targeting named fields. If you want to target rdfs:label, use the first form.
>>>> However, as mentioned earlier,
>>>> +
>>>> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
>>>> +
>>>> +will result in an error owing to the way in which the jena-text composes the
>>>> +query string to Lucene in the presence of the `"lang:fr"` argument.
>> Don't do that then. Remove this section. (see previous comment)
>>>> +However, it is important to note that the apparently equivalent form:
>>>> +
>>>> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
>>>> +
>>>> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are
>>>> +bound as expected.
>> Again, don't do that. Use (rdfs:label "protégé") instead and let jena-text handle the translation from property to Lucene field.
>>>> +So if the _literal_ matches are needed you **must use** the query arguments that
>>>> +list the _property_ explicitly, except in the simple case of a query against
>>>> +the default `Field`/_property_.
>> Exactly. And those are the only supported query forms anyway.
>>>> +#### Queries across multiple `Field`s
>>>> +
>>>> +It has been mentioned earlier that the text index uses the
>>>> +[native Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description); 
>>>> +however, there are important constraints on how the Lucene query language is used within jena-text.
>>>> +This is owing to the fact that jena-text composes the query string that is sent to Lucene so that
>>>> +features such as `lang:xx` may be implemented. Other aspects of using the Lucene query language
>>>> +reflect the fact that each triple is a separate document.
>>>> +
>>>> +This latter observation is important when considering queries that are intended to involve several
>>>> fields.
>>>> 
>>>> -The results include the subject URI, `?s`; the `?score` assigned by the
>>>> -text search engine; and the entire matched `?literal` (if the index has
>>>> -been [configured to store literal values](#text-dataset-assembler)).
>>>> +For example, consider the following triples:
>>>> 
>>>> -If the `query string` refers to more than one field, e.g.,
>>>> +    ex:SomePrinter
>>>> +        rdfs:label     "laser printer" ;
>>>> +        ex:description "includes a large capacity cartridge" .
>>>> 
>>>> -    "label: printer AND description: \"large capacity cartridge\""
>>>> + assuming an appropriate configuration we might expect to retrieve `ex:SomePrinter`
>>>> + with the following query:
>>>> 
>>>> -then the `?literal` in the results will not be bound since there is no
>>>> -single field that contains the match &ndash; the match is separated over
>>>> -two fields.
>>>> +    ?s text:query "label:printer AND description:\"large capacity cartridge\""
>>>> 
>>>> -If an output indexed term is already a known value, either as a constant
>>>> -in the query or variable already set, then the index lookup becomes a
>>>> -check that this is a match for the input arguments.
>>>> +However, this query will fail to find the expected results since the `AND` is interpreted
>>>> +by Lucene to indicate that all documents that contain a matching `label` field _and_
>>>> +a matching `description` field are to be returned. Yet from the discussion above
>>>> +regarding the [structure of Lucene documents in jena-text](#document-structure) it
>>>> +is evident that there is not one but rather in fact two separate documents one with a
>>>> +`label` field and one with a `description` field so an effective query is:
>>>> 
>>>> +    ?s text:query "label:printer" .
>>>> +    ?s text:query "description:\"large capacity cartridge\"" .
>>>> +
>>>> +which leads to `?s` being bound to `ex:SomePrinter`.
>> Again this should instead be written as
>>     ?s text:query (rdfs:label "printer") .
>>     ?s text:query (ex:description "large capacity cartridge") .
>>>> +In other words when a query is to involve two or more _properties_/`Field`s then it
>>>> +expressed at the SPARQL level, as it were, versus in Lucene's query language.
>>>> +
>>>> +It is worth noting that:
>>>> +
>>>> +    ?s text:query "label:printer OR description:\"large capacity cartridge\""
>>>> +
>>>> +works simply because Lucene is required to do nothing more than a _union_ of
>>>> +matching documents the same as if written:
>>>> +
>>>> +    { ?s text:query "label:printer" . }
>>>> +    union
>>>> +    { ?s text:query "description:\"large capacity cartridge\"" . }
>> Don't do that, even if it happens to work.
>>>> +On the other hand the various features of the [Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description) 
>>>> +are all available to be used for searches within a `Field`. For example:
>>>> +
>>>> +    ?s text:query "description:(large AND cartridge)"
>>>> +
>>>> +and
>>>> +
>>>> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR capacity))")
>>>> +
>>>> +will work as expected.
>> The first one is better written as
>>     ?s text:query (ex:description "(large AND cartridge)")
>> -Osma


Re: CMS diff: Jena Full Text Search

Posted by Andy Seaborne <an...@apache.org>.
How is this related to JENA-1426?

     Andy

On 21/11/17 14:48, Osma Suominen wrote:
> ajs6f kirjoitti 20.11.2017 klo 18:36:
>> Osma (or anyone else who knows text indexing better than do I, which 
>> wouldn't take much)-- could you review this? It's got some great 
>> useful detail about how the indexing works and can be used.
> 
> Sure, will do.
> 
> Comments about specific sections below. Generally this is a very good 
> contribution to the jena-text documentation, which has stagnated a bit.
> 
> 
>>> +The following illustrates a Lucene document that Jena will create and
>>> +request Lucene to index:
>>> +
>>> +    Document<
>>> +        stored, indexed, indexOptions=DOCS 
>>> <uri:http://example.org/SomeOne>
>>> +        indexed, omitNorms, indexOptions=DOCS 
>>> <graph:urn:x-arq:DefaultGraphNode>
>>> +        stored, indexed, tokenized <label:zorn protégé a prés>
>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>> +        stored, indexed, tokenized <label_fr:zorn protégé a prés>
>>> +        stored, indexed, omitNorms, indexOptions=DOCS 
>>> <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
>>> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode>
>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>> +        stored, indexed, tokenized 
>>> <graph_fr:urn:x-arq:DefaultGraphNode>
>>> +        stored, indexed, omitNorms, indexOptions=DOCS 
>>> <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
>>> +        >
>>> +
>>> +It may be instructive to refer back to this example when considering 
>>> the various
>>> +points below.
> 
> Not sure if this is a perfect illustration. The level of detail is 
> rather excessive. I know Lucene quite well and I still struggle to 
> understand what's going on here. Is there another way of presenting this 
> information, for example just a key-value list that shows the field 
> values that get stored in the document? I think the field options 
> stored, indexed, tokenized, omitNorms etc. are unnecessary here or at 
> least should not be so prominent.
> 
> 
>>> +The `lang:xx` specification is an optional string, where _xx_ is
>>> +a BCP-47 language tag. This restricts searches to field values that 
>>> were originally
>>> +indexed with the tag _xx_. Searches may be restricted to field 
>>> values with no
>>> +language tag via `"lang:none"`. The use of the `lang:xx` is only 
>>> effective if
>>> +[multilingual support](#linguistic-support-with-lucene-index) has 
>>> been configured.
> 
> The last sentence is not true. You can restrict by language even without 
> enabling multilingual support, as long as langField has been set.
> 
>>> +Further, if the `lang:xx` is used then the `property` URI must be 
>>> supplied
>>> +in order for searches to work.
> 
> Not true. The default property should be used if no property was specified.
> 
> 
>>> +When working with `rdf:langString`s It may be tempting to write:
>>> +
>>> +    ?s text:query "protégé"@fr
>>> +
>>> +However, the above will silently fail to return results since the
>>> +`query string` must be a simple `xsd:string` not an `rdf:langString`.
> 
> This could be considered a bug - at least it shouldn't fail silently.
> 
>>> +Even if the default _property_ is `skos:prefLabel` it is necessary
>>> +to use the above form rather than omitting the `property` argument
>>> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
>>> +again there will be no results.
> 
> Again, not true. I just tested this query against YSO:
>   ?s text:query ("cat" "lang:en")
> and it gave a single result, as expected.
> 
>>> +For a non-default `Field` with no language restriction, the patterns:
>>> +
>>> +    ?s text:query (rdfs:label "protégé")
>>> +
>>> +or
>>> +
>>> +    ?s text:query "rdfsLabel:protégé"
>>> +
>>> +may be used (see [below](#entity-map-definition) for how RDF 
>>> _property_ names
>>> +are mapped to Lucene `Field` names). 
> 
> I wouldn't recommend using a query form like "rdfsLabel:protégé" in the 
> documentation at all. It violates the layered architecture of jena-text 
> - the query should not be targeting named fields. If you want to target 
> rdfs:label, use the first form.
> 
>>> However, as mentioned earlier,
>>> +
>>> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
>>> +
>>> +will result in an error owing to the way in which the jena-text 
>>> composes the
>>> +query string to Lucene in the presence of the `"lang:fr"` argument.
> 
> Don't do that then. Remove this section. (see previous comment)
> 
>>> +However, it is important to note that the apparently equivalent form:
>>> +
>>> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
>>> +
>>> +will fail to produce a binding for `?lit` even though `?s` and `?sc` 
>>> are
>>> +bound as expected.
> 
> Again, don't do that. Use (rdfs:label "protégé") instead and let 
> jena-text handle the translation from property to Lucene field.
> 
>>> +So if the _literal_ matches are needed you **must use** the query 
>>> arguments that
>>> +list the _property_ explicitly, except in the simple case of a query 
>>> against
>>> +the default `Field`/_property_.
> 
> Exactly. And those are the only supported query forms anyway.
> 
>>> +#### Queries across multiple `Field`s
>>> +
>>> +It has been mentioned earlier that the text index uses the
>>> +[native Lucene query 
>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description); 
>>>
>>> +however, there are important constraints on how the Lucene query 
>>> language is used within jena-text.
>>> +This is owing to the fact that jena-text composes the query string 
>>> that is sent to Lucene so that
>>> +features such as `lang:xx` may be implemented. Other aspects of 
>>> using the Lucene query language
>>> +reflect the fact that each triple is a separate document.
>>> +
>>> +This latter observation is important when considering queries that 
>>> are intended to involve several
>>> fields.
>>>
>>> -The results include the subject URI, `?s`; the `?score` assigned by the
>>> -text search engine; and the entire matched `?literal` (if the index has
>>> -been [configured to store literal values](#text-dataset-assembler)).
>>> +For example, consider the following triples:
>>>
>>> -If the `query string` refers to more than one field, e.g.,
>>> +    ex:SomePrinter
>>> +        rdfs:label     "laser printer" ;
>>> +        ex:description "includes a large capacity cartridge" .
>>>
>>> -    "label: printer AND description: \"large capacity cartridge\""
>>> + assuming an appropriate configuration we might expect to retrieve 
>>> `ex:SomePrinter`
>>> + with the following query:
>>>
>>> -then the `?literal` in the results will not be bound since there is no
>>> -single field that contains the match &ndash; the match is separated 
>>> over
>>> -two fields.
>>> +    ?s text:query "label:printer AND description:\"large capacity 
>>> cartridge\""
>>>
>>> -If an output indexed term is already a known value, either as a 
>>> constant
>>> -in the query or variable already set, then the index lookup becomes a
>>> -check that this is a match for the input arguments.
>>> +However, this query will fail to find the expected results since the 
>>> `AND` is interpreted
>>> +by Lucene to indicate that all documents that contain a matching 
>>> `label` field _and_
>>> +a matching `description` field are to be returned. Yet from the 
>>> discussion above
>>> +regarding the [structure of Lucene documents in 
>>> jena-text](#document-structure) it
>>> +is evident that there is not one but rather in fact two separate 
>>> documents one with a
>>> +`label` field and one with a `description` field so an effective 
>>> query is:
>>>
>>> +    ?s text:query "label:printer" .
>>> +    ?s text:query "description:\"large capacity cartridge\"" .
>>> +
>>> +which leads to `?s` being bound to `ex:SomePrinter`.
> 
> Again this should instead be written as
> 
>      ?s text:query (rdfs:label "printer") .
>      ?s text:query (ex:description "large capacity cartridge") .
> 
>>> +In other words when a query is to involve two or more 
>>> _properties_/`Field`s then it
>>> +expressed at the SPARQL level, as it were, versus in Lucene's query 
>>> language.
>>> +
>>> +It is worth noting that:
>>> +
>>> +    ?s text:query "label:printer OR description:\"large capacity 
>>> cartridge\""
>>> +
>>> +works simply because Lucene is required to do nothing more than a 
>>> _union_ of
>>> +matching documents the same as if written:
>>> +
>>> +    { ?s text:query "label:printer" . }
>>> +    union
>>> +    { ?s text:query "description:\"large capacity cartridge\"" . }
> 
> Don't do that, even if it happens to work.
> 
> 
>>> +On the other hand the various features of the [Lucene query 
>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description) 
>>>
>>> +are all available to be used for searches within a `Field`. For 
>>> example:
>>> +
>>> +    ?s text:query "description:(large AND cartridge)"
>>> +
>>> +and
>>> +
>>> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large 
>>> OR capacity))")
>>> +
>>> +will work as expected.
> 
> The first one is better written as
> 
>      ?s text:query (ex:description "(large AND cartridge)")
> 
> 
> -Osma
> 

Re: CMS diff: Jena Full Text Search

Posted by Osma Suominen <os...@helsinki.fi>.
ajs6f kirjoitti 20.11.2017 klo 18:36:
> Osma (or anyone else who knows text indexing better than do I, which wouldn't take much)-- could you review this? It's got some great useful detail about how the indexing works and can be used.

Sure, will do.

Comments about specific sections below. Generally this is a very good 
contribution to the jena-text documentation, which has stagnated a bit.


>> +The following illustrates a Lucene document that Jena will create and
>> +request Lucene to index:
>> +
>> +    Document<
>> +        stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne>
>> +        indexed, omitNorms, indexOptions=DOCS <graph:urn:x-arq:DefaultGraphNode>
>> +        stored, indexed, tokenized <label:zorn protégé a prés>
>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>> +        stored, indexed, tokenized <label_fr:zorn protégé a prés>
>> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
>> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode>
>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>> +        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode>
>> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
>> +        >
>> +
>> +It may be instructive to refer back to this example when considering the various
>> +points below.

Not sure if this is a perfect illustration. The level of detail is 
rather excessive. I know Lucene quite well and I still struggle to 
understand what's going on here. Is there another way of presenting this 
information, for example just a key-value list that shows the field 
values that get stored in the document? I think the field options 
stored, indexed, tokenized, omitNorms etc. are unnecessary here or at 
least should not be so prominent.


>> +The `lang:xx` specification is an optional string, where _xx_ is
>> +a BCP-47 language tag. This restricts searches to field values that were originally
>> +indexed with the tag _xx_. Searches may be restricted to field values with no
>> +language tag via `"lang:none"`. The use of the `lang:xx` is only effective if
>> +[multilingual support](#linguistic-support-with-lucene-index) has been configured.

The last sentence is not true. You can restrict by language even without 
enabling multilingual support, as long as langField has been set.

>> +Further, if the `lang:xx` is used then the `property` URI must be supplied
>> +in order for searches to work.

Not true. The default property should be used if no property was specified.


>> +When working with `rdf:langString`s It may be tempting to write:
>> +
>> +    ?s text:query "protégé"@fr
>> +
>> +However, the above will silently fail to return results since the
>> +`query string` must be a simple `xsd:string` not an `rdf:langString`.

This could be considered a bug - at least it shouldn't fail silently.

>> +Even if the default _property_ is `skos:prefLabel` it is necessary
>> +to use the above form rather than omitting the `property` argument
>> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
>> +again there will be no results.

Again, not true. I just tested this query against YSO:
  ?s text:query ("cat" "lang:en")
and it gave a single result, as expected.

>> +For a non-default `Field` with no language restriction, the patterns:
>> +
>> +    ?s text:query (rdfs:label "protégé")
>> +
>> +or
>> +
>> +    ?s text:query "rdfsLabel:protégé"
>> +
>> +may be used (see [below](#entity-map-definition) for how RDF _property_ names
>> +are mapped to Lucene `Field` names). 

I wouldn't recommend using a query form like "rdfsLabel:protégé" in the 
documentation at all. It violates the layered architecture of jena-text 
- the query should not be targeting named fields. If you want to target 
rdfs:label, use the first form.

>> However, as mentioned earlier,
>> +
>> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
>> +
>> +will result in an error owing to the way in which the jena-text composes the
>> +query string to Lucene in the presence of the `"lang:fr"` argument.

Don't do that then. Remove this section. (see previous comment)

>> +However, it is important to note that the apparently equivalent form:
>> +
>> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
>> +
>> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are
>> +bound as expected.

Again, don't do that. Use (rdfs:label "protégé") instead and let 
jena-text handle the translation from property to Lucene field.

>> +So if the _literal_ matches are needed you **must use** the query arguments that
>> +list the _property_ explicitly, except in the simple case of a query against
>> +the default `Field`/_property_.

Exactly. And those are the only supported query forms anyway.

>> +#### Queries across multiple `Field`s
>> +
>> +It has been mentioned earlier that the text index uses the
>> +[native Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
>> +however, there are important constraints on how the Lucene query language is used within jena-text.
>> +This is owing to the fact that jena-text composes the query string that is sent to Lucene so that
>> +features such as `lang:xx` may be implemented. Other aspects of using the Lucene query language
>> +reflect the fact that each triple is a separate document.
>> +
>> +This latter observation is important when considering queries that are intended to involve several
>> fields.
>>
>> -The results include the subject URI, `?s`; the `?score` assigned by the
>> -text search engine; and the entire matched `?literal` (if the index has
>> -been [configured to store literal values](#text-dataset-assembler)).
>> +For example, consider the following triples:
>>
>> -If the `query string` refers to more than one field, e.g.,
>> +    ex:SomePrinter
>> +        rdfs:label     "laser printer" ;
>> +        ex:description "includes a large capacity cartridge" .
>>
>> -    "label: printer AND description: \"large capacity cartridge\""
>> + assuming an appropriate configuration we might expect to retrieve `ex:SomePrinter`
>> + with the following query:
>>
>> -then the `?literal` in the results will not be bound since there is no
>> -single field that contains the match &ndash; the match is separated over
>> -two fields.
>> +    ?s text:query "label:printer AND description:\"large capacity cartridge\""
>>
>> -If an output indexed term is already a known value, either as a constant
>> -in the query or variable already set, then the index lookup becomes a
>> -check that this is a match for the input arguments.
>> +However, this query will fail to find the expected results since the `AND` is interpreted
>> +by Lucene to indicate that all documents that contain a matching `label` field _and_
>> +a matching `description` field are to be returned. Yet from the discussion above
>> +regarding the [structure of Lucene documents in jena-text](#document-structure) it
>> +is evident that there is not one but rather in fact two separate documents one with a
>> +`label` field and one with a `description` field so an effective query is:
>>
>> +    ?s text:query "label:printer" .
>> +    ?s text:query "description:\"large capacity cartridge\"" .
>> +
>> +which leads to `?s` being bound to `ex:SomePrinter`.

Again this should instead be written as

     ?s text:query (rdfs:label "printer") .
     ?s text:query (ex:description "large capacity cartridge") .

>> +In other words when a query is to involve two or more _properties_/`Field`s then it
>> +expressed at the SPARQL level, as it were, versus in Lucene's query language.
>> +
>> +It is worth noting that:
>> +
>> +    ?s text:query "label:printer OR description:\"large capacity cartridge\""
>> +
>> +works simply because Lucene is required to do nothing more than a _union_ of
>> +matching documents the same as if written:
>> +
>> +    { ?s text:query "label:printer" . }
>> +    union
>> +    { ?s text:query "description:\"large capacity cartridge\"" . }

Don't do that, even if it happens to work.


>> +On the other hand the various features of the [Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>> +are all available to be used for searches within a `Field`. For example:
>> +
>> +    ?s text:query "description:(large AND cartridge)"
>> +
>> +and
>> +
>> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR capacity))")
>> +
>> +will work as expected.

The first one is better written as

     ?s text:query (ex:description "(large AND cartridge)")


-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: CMS diff: Jena Full Text Search

Posted by ajs6f <aj...@apache.org>.
I went to review this diff and rediscovered (to my chagrin) that I really know very little about Jena's text indexing.

Osma (or anyone else who knows text indexing better than do I, which wouldn't take much)-- could you review this? It's got some great useful detail about how the indexing works and can be used.

ajs6f

> On Nov 20, 2017, at 1:51 AM, Chris Tomlinson <an...@apache.org> wrote:
> 
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
> 
> Chris Tomlinson
> 
> Index: trunk/content/documentation/query/text-query.mdtext
> ===================================================================
> --- trunk/content/documentation/query/text-query.mdtext	(revision 1815762)
> +++ trunk/content/documentation/query/text-query.mdtext	(working copy)
> @@ -1,5 +1,7 @@
> Title: Jena Full Text Search
> 
> +Title: Jena Full Text Search
> +
> This extension to ARQ combines SPARQL and full text search via
> [Lucene](https://lucene.apache.org) 6.4.1 or
> [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
> @@ -64,7 +66,20 @@
> ## Table of Contents
> 
> -   [Architecture](#architecture)
> +    -   [External content](#external-content)
> +    -   [External applications](#external-applications)
> +    -   [Document structure](#document-structure)
> -   [Query with SPARQL](#query-with-sparql)
> +    -   [Syntax](#syntax)
> +        -   [Input arguments](#input-arguments)
> +        -   [Output arguments](#output-arguments)
> +    -   [Query strings](#query-strings)
> +        -   [Simple queries](#simple-queries)
> +        -   [Queries with language tags](#queries-with-language-tags)
> +        -   [Queries that retrieve literals](#queries-that-retrieve-literals)
> +        -   [Queries across multiple `Field`s](#queries-across-multiple-fields)
> +        -   [Queries within a `Field`](#queries-within-a-field)
> +    -   [Good practice](#good-practice)
> -   [Configuration](#configuration)
>     -   [Text Dataset Assembler](#text-dataset-assembler)
>     -   [Configuring an analyzer](#configuring-an-analyzer)
> @@ -134,6 +149,69 @@
> By using Elasticsearch, other applications can share the text index with
> SPARQL search.
> 
> +### Document structure
> +
> +As mentioned above, text indexing of a triple involves associating a Lucene
> +document with the triple. How is this done?
> +
> +Lucene documents are composed of `Field`s. Indexing and searching are performed 
> +over the contents of these `Field`s. For an RDF triple to be indexed in Lucene the 
> +_property_ of the triple must be 
> +[configured in the entity map of a TextIndex](#entity-map-definition).
> +This associates a Lucene analyzer with the _`property`_ which will be used
> +for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
> +`Field` in the resulting document.
> +
> +A Lucene index includes a _default_ `Field`, which is specified in the configuration, 
> +that is the field to search if not otherwise named in the query. In jena-text 
> +this field is configured via the `text:defaultField` property which is then mapped 
> +to a specific RDF property via `text:predicate` (see [entity map](#entity-map-definition) 
> +below).
> +
> +There are several additional `Field`s that will be included in the
> +document that is passed to the Lucene `IndexWriter` depending on the
> +configuration options that are used. These additional fields are used to
> +manage the interface between Jena and Lucene and are not generally 
> +searchable per se.
> +
> +The most important of these additional `Field`s is the `text:entityField`.
> +This configuration property defines the name of the `Field` that will contain
> +the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. This property does
> +not have a default and must be specified for most uses of `jena-text`. This
> +`Field` is often given the name, `uri`, in examples. It is via this `Field`
> +that `?s` is bound in a typical use such as:
> +
> +    select ?s
> +    where {
> +        ?s text:query "some text"
> +    }
> +
> +Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
> +and so on are discussed below.
> +
> +Given the triple:
> +
> +    ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
> +
> +The following illustrates a Lucene document that Jena will create and
> +request Lucene to index:
> +
> +    Document<
> +        stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne> 
> +        indexed, omitNorms, indexOptions=DOCS <graph:urn:x-arq:DefaultGraphNode> 
> +        stored, indexed, tokenized <label:zorn protégé a prés> 
> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> 
> +        stored, indexed, tokenized <label_fr:zorn protégé a prés> 
> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00> 
> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode> 
> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> 
> +        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode> 
> +        stored, indexed, omitNorms, indexOptions=DOCS <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
> +        >
> +
> +It may be instructive to refer back to this example when considering the various
> +points below.
> +
> ## Query with SPARQL
> 
> The URI of the text extension property function is
> @@ -143,63 +221,248 @@
> 
>     ...   text:query ...
> 
> +### Syntax
> 
> The following forms are all legal:
> 
> -    ?s text:query 'word'                   # query
> -    ?s text:query (rdfs:label 'word')      # query specific property if multiple
> -    ?s text:query ('word' 10)              # with limit on results
> -    (?s ?score) text:query 'word'          # query capturing also the score
> -    (?s ?score ?literal) text:query 'word' # ... and original literal value
> +    ?s text:query 'word'                              # query
> +    ?s text:query ('word' 10)                         # with limit on results
> +    ?s text:query (rdfs:label 'word')                 # query specific property if multiple
> +    ?s text:query (rdfs:label 'protégé' 'lang:fr')    # restrict search to French
> +    (?s ?score) text:query 'word'                     # query capturing also the score
> +    (?s ?score ?literal) text:query 'word'            # ... and original literal value
> 
> The most general form is:
> 
> -     (?s ?score ?literal) text:query (property 'query string' limit)
> +     (?s ?score ?literal) text:query (property 'query string' limit 'lang:xx')
> 
> -Only the query string is required, and if it is the only argument the
> -surrounding `( )` can be omitted.
> +#### Input arguments:
> 
> -Input arguments:
> -
> | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
> |-------------------|--------------------------------|
> | property          | (optional) URI (including prefix name form) |
> -| query string      | The native query string        |
> +| query string      | Lucene query string fragment       |
> | limit             | (optional) `int` limit on the number of results       |
> +| lang:xx           | (optional) language tag spec       |
> 
> -Output arguments:
> +The `property` URI is only necessary if multiple properties have been
> +indexed and the property being searched over is not the [default field
> +of the index](#entity-map-definition).  Also the `property` URI **must
> +not** be used when the `query string` refers explicitly to one or more
> +fields. The optional `limit` indicates the maximum hits to be returned by Lucene.
> 
> +The `lang:xx` specification is an optional string, where _xx_ is 
> +a BCP-47 language tag. This restricts searches to field values that were originally 
> +indexed with the tag _xx_. Searches may be restricted to field values with no 
> +language tag via `"lang:none"`. The use of the `lang:xx` is only effective if 
> +[multilingual support](#linguistic-support-with-lucene-index) has been configured.
> +Further, if the `lang:xx` is used then the `property` URI must be supplied
> +in order for searches to work.
> +
> +If both `limit` and `lang:xx` are present, then `limit` must precede
> +`lang:xx`.
> +
> +If only the query string is required, the surrounding `( )` can be omitted.
> +
> +#### Output arguments:
> +
> | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
> |-------------------|--------------------------------|
> -| indexed term      | The indexed RDF term.          |
> +| subject URI       | The subject of the indexed RDF triple.          |
> | score             | (optional) The score for the match. |
> -| hit               | (optional) The literal matched. |
> +| literal           | (optional) The matched object literal. |
> 
> -The `property` URI is only necessary if multiple properties have been
> -indexed and the property being searched over is not the [default field
> -of the index](#entity-map-definition).  Also the `property` URI **must
> -not** be used when the `query string` refers explicitly to one or more
> +The results include the _subject URI_; the _score_ assigned by the
> +text search engine; and the entire matched _literal_ (if the index has
> +been [configured to store literal values](#text-dataset-assembler)).
> +The _subject URI_ may be a variable, e.g., `?s`, or a _URI_. In the
> +latter case the search is restricted to triples with the specified
> +subject. The _score_ and the _literal_ **must** be variables.
> +
> +If only the _subject_ variable, `?s`, or specific _`URI`_ is needed 
> +then it must be written without surrounding `( )`; otherwise, an error 
> +is signalled.
> +
> +### Query strings
> +
> +There are several points that need to be considered when formulating
> +SPARQL queries using the Lucene interface.
> +
> +#### Simple queries
> +
> +The simplest use of the jena-text Lucene integration is:
> +
> +    ?s text:query "some phrase"
> +
> +This will bind `?s` to each entity URI that is the subject of a triple
> +that has the default property and an object literal that matches
> +the argument string, e.g.:
> +
> +    ex:AnEntity skos:prefLabel "this is some phrase to match"
> +
> +This query form will indicate the subjects that have literals that match
> +without providing any information about the specific literals that matched.
> +If this use case is sufficient for your needs you can skip on to the 
> +[sections on configuration](#configuration).
> +
> +#### Queries with language tags
> +
> +When working with `rdf:langString`s It may be tempting to write:
> +
> +    ?s text:query "protégé"@fr
> +
> +However, the above will silently fail to return results since the
> +`query string` must be a simple `xsd:string` not an `rdf:langString`.
> +
> +The effective form of the above query is expressed:
> +
> +    ?s text:query (skos:prefLabel "protégé" 'lang:fr')
> +
> +if the intent is to search only labels with French content.
> +
> +Even if the default _property_ is `skos:prefLabel` it is necessary
> +to use the above form rather than omitting the `property` argument
> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
> +again there will be no results.
> +
> +If all one is interested in are _subjects_ with `skos:prefLabel` where
> +that is the `text:predicate` of the `text:defaultField` and without regard 
> +for specified `lang:xx`s then:
> +
> +    ?s text:query "protégé"
> +
> +will do the job.
> +
> +For a non-default `Field` with no language restriction, the patterns:
> +
> +    ?s text:query (rdfs:label "protégé")
> +
> +or
> +
> +    ?s text:query "rdfsLabel:protégé"
> +
> +may be used (see [below](#entity-map-definition) for how RDF _property_ names 
> +are mapped to Lucene `Field` names). However, as mentioned earlier,
> +
> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
> +
> +will result in an error owing to the way in which the jena-text composes the
> +query string to Lucene in the presence of the `"lang:fr"` argument.
> +
> +#### Queries that retrieve literals
> +
> +It is possible to retrieve the *literal*s that Lucene finds matches for
> +assuming that
> +
> +    <#TextIndex#> text:storeValues true ;
> +
> +has been specified in the `TextIndex` configuration. So
> +
> +    (?s ?sc ?lit) text:query (rdfs:label "protégé")
> +
> +will bind the matching literals to `?lit`, e.g.,
> +
> +    "zorn protégé a prés"@fr
> +
> +However, it is important to note that the apparently equivalent form:
> +
> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
> +
> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are
> +bound as expected.
> +
> +So if the _literal_ matches are needed you **must use** the query arguments that
> +list the _property_ explicitly, except in the simple case of a query against
> +the default `Field`/_property_.
> +
> +#### Queries across multiple `Field`s
> +
> +It has been mentioned earlier that the text index uses the
> +[native Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
> +however, there are important constraints on how the Lucene query language is used within jena-text.
> +This is owing to the fact that jena-text composes the query string that is sent to Lucene so that
> +features such as `lang:xx` may be implemented. Other aspects of using the Lucene query language
> +reflect the fact that each triple is a separate document.
> +
> +This latter observation is important when considering queries that are intended to involve several
> fields.
> 
> -The results include the subject URI, `?s`; the `?score` assigned by the
> -text search engine; and the entire matched `?literal` (if the index has
> -been [configured to store literal values](#text-dataset-assembler)).
> +For example, consider the following triples:
> 
> -If the `query string` refers to more than one field, e.g.,
> +    ex:SomePrinter 
> +        rdfs:label     "laser printer" ;
> +        ex:description "includes a large capacity cartridge" .
> 
> -    "label: printer AND description: \"large capacity cartridge\""
> + assuming an appropriate configuration we might expect to retrieve `ex:SomePrinter`
> + with the following query:
> 
> -then the `?literal` in the results will not be bound since there is no
> -single field that contains the match &ndash; the match is separated over
> -two fields.
> +    ?s text:query "label:printer AND description:\"large capacity cartridge\""
> 
> -If an output indexed term is already a known value, either as a constant
> -in the query or variable already set, then the index lookup becomes a
> -check that this is a match for the input arguments.
> +However, this query will fail to find the expected results since the `AND` is interpreted
> +by Lucene to indicate that all documents that contain a matching `label` field _and_
> +a matching `description` field are to be returned. Yet from the discussion above
> +regarding the [structure of Lucene documents in jena-text](#document-structure) it
> +is evident that there is not one but rather in fact two separate documents one with a 
> +`label` field and one with a `description` field so an effective query is:
> 
> +    ?s text:query "label:printer" .
> +    ?s text:query "description:\"large capacity cartridge\"" .
> +
> +which leads to `?s` being bound to `ex:SomePrinter`.
> +
> +In other words when a query is to involve two or more _properties_/`Field`s then it
> +expressed at the SPARQL level, as it were, versus in Lucene's query language.
> +
> +It is worth noting that:
> +
> +    ?s text:query "label:printer OR description:\"large capacity cartridge\""
> +
> +works simply because Lucene is required to do nothing more than a _union_ of
> +matching documents the same as if written:
> +
> +    { ?s text:query "label:printer" . }
> +    union
> +    { ?s text:query "description:\"large capacity cartridge\"" . }
> +
> +Suppose the matching literals are required for the above then it should be clear
> +from the above that:
> +
> +    (?s ?sc1 ?lit1) text:query (skos:prefLabel "printer") .
> +    (?s ?sc2 ?lit2) text:query (ex:description "large capacity cartridge") .
> +
> +will be the appropriate form to retrieve the _subject_ and the associated literals.
> +
> +There is no loss of expressiveness of the Lucene query language versus the jena-text
> +integration of Lucene. Any cross-field `AND`s are replaced by concurrent SPARQL calls to
> +text:query as illustrated above and uses of Lucene `OR` can be converted to SPARQL 
> +`union`s. Uses of Lucene `NOT` are converted to appropriate SPARQL `filter`s.
> +
> +#### Queries within a `Field`
> +
> +On the other hand the various features of the [Lucene query language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
> +are all available to be used for searches within a `Field`. For example:
> +
> +    ?s text:query "description:(large AND cartridge)"
> +
> +and
> +
> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR capacity))")
> +
> +will work as expected.
> +
> +The key is to always surround the field query with `( )`s.
> +
> +
> ### Good practice
> 
> -The query engine does not have information about the selectivity of the
> +From the above it should be clear that best practice, except in the simplest cases
> +is to use explicit `text:query` forms such as:
> +
> +    (?s ?sc ?lit) text:query (ex:someProperty "a single Field query")
> +
> +possibly with _limit_ and `lang:xx` arguments.
> +
> +Further, the query engine does not have information about the selectivity of the
> text index and so effective query plans cannot be determined
> programmatically.  It is helpful to be aware of the following two
> general query patterns.
> @@ -394,7 +657,7 @@
> is returned on a match. The value of the property is arbitrary so long as it is unique among the
> defined names.
> 
> -#### Automatic document deletion
> +#### UID Field and automatic document deletion
> 
> When the `text:uidField` is defined in the `EntityMap` then dropping a triple will result in the 
> corresponding document, if any, being deleted from the text index. The value, `"uid"`, is arbitrary 
> @@ -632,7 +895,7 @@
> 
> #### SPARQL Linguistic Clause Forms
> 
> -Once the `langField` is set, you can use it directly inside SPARQL queries, for that the `'lang:xx'`
> +Once the `langField` is set, you can use it directly inside SPARQL queries, for that the `lang:xx`
> argument allows you to target specific localized values. For example:
> 
>     //target english literals
> @@ -714,7 +977,7 @@
> Hence, the result set of the query will contain "institute" related
> subjects (institution, institutional,...) in French and in English.
> 
> -**Note**: If the `text:langField` property is not set, the `text:langField` will default to"lang".
> +**Note**: If the `text:langField` property is not set, the `text:langField` will default to "lang".
> 
> ### Generic and Defined Analyzer Support
> 
>