You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by on 2013/04/10 18:12:39 UTC

svn commit: r1466541 - /jena/site/trunk/content/documentation/query/text-query.mdtext

Author: andy
Date: Wed Apr 10 16:12:39 2013
New Revision: 1466541

Text query documentation


Added: jena/site/trunk/content/documentation/query/text-query.mdtext
--- jena/site/trunk/content/documentation/query/text-query.mdtext (added)
+++ jena/site/trunk/content/documentation/query/text-query.mdtext Wed Apr 10 16:12:39 2013
@@ -0,0 +1,287 @@
+Title: LARQ - free text searches with SPARQL
+LARQ is a combination of SPARQL and text search.
+It gives applications the ability to perform free text searches within
+SPARQL queries. Text indexes are additional information for
+accessing the RDF graph.
+The text index can be either [Apache Lucene]( for a
+same-machine text index, or [Apache Solr](
+for a large scale enterprise search application.
+Some example code is available here: @@ [examples]().
+LARQ2 uses Lucene4 or Solr4.
+This query makes a text query for 'word' on a specific property
+(the index needs to correctly configured) and limits the output
+to 10 matches; it then looks in the RDF data for
+the actual label.  More details are given below.
+    PREFIX text: <>
+    PREFIX rdfs: <>
+    SELECT ?s
+    { ?s text:query (rdfs:label 'word' 10) ; 
+         rdfs:label ?label 
+    }
+- [Architecture](#architecture)
+- [Configuration](#configuration)
+- [Loading Data](#loading-data)
+- [Query with SPARQL](#with-with-sparql)
+- [Examples](#examples)
+- [Working with Fuseki](#working-with-fuseki)
+## Architecture
+The text index is used provide a reverse index mapping query strings to URIs.  
+The text indexed can be part of the RDF data or the text index can be used to index
+external content with only additional RDF in the RDF store.
+The LARQ index uses the native text index text query language:
+[Lucene query format](
+[Solr query format](
+A text-supporting dataset is configured with a description of which
+properties work with.  When data is added, any properties matching the
+description caus an entry to be added from analysed text from the triple
+object and mapping to the subject.
+### Pattern A: RDF data
+In this pattern, the data in the text index is indexing literals in the RDF data.  
+Additions to the RDF data are reflected in additions to the index.
+(Deletes do not remove text index netries - [see below](#deletion))
+### Pattern B: External content
+There is no requirement that the text data indexed is present in the RDF
+data.  As long as the index contains the index text documents to match the
+index description, then text search can be performed.
+For example, if the content of a collection of documents is indexed and the
+URI naming the document is the result of the text search, then an RDF
+dataset with the document metadata can be combined with accessing the
+content by URI.
+The maintence of the index is external to the RDF data store.
+### External applications
+By using Solr, in either pattern A (RDF data indexed) or pattern B
+(external content indexed), other applications can share the
+text index with SPARQL search.
+## Query
+## Deletion
+If the text index is being maintain by changed to the RDF, then deletion of
+RDF triple or quads does not cause entries in the index to be removed.  The
+index does not store the literal indexed, nor does it store a reference
+count of how many triples refer to the index so the information to delete
+entries is not available. 
+In situations where this matters, the SPARQL query should look up in the
+text index, then check in the RDF data.  Indeed, this may be necessary
+anyway because a text search does not necessarily give only exact matches.
+In the initial example:
+    SELECT ?s ?label
+    { ?s text:query (rdfs:label 'word' 10) ; 
+         rdfs:label ?label 
+    }
+the SPARQL query is checking that the `rdfs:label` triple exists, and if it
+does, returning the whole label.
+Bu only indexing, and not storing, literals, the index is kept smaller.  It
+may be necessary to periodically rebuild the index if a large proportion
+of the RDF data changes.
+## Configuration
+The important structure is an "entity map" which defines the properties to
+index, the name of the lucene/solr field and filed used for storing the URI
+For common RDF use, you'd have one field, mapping a property to a text
+index field.
+More complex setups, with multiple properties per enitity (URI) are possible.
+The usual way to describe an index is with a
+[Jena assembler description](../assembler/index.html).
+### Assemblers
+    @prefix :        <http://localhost/jena_example/#> .
+    @prefix rdf:     <> .
+    @prefix rdfs:    <> .
+    @prefix tdb:     <> .
+    @prefix ja:      <> .
+    @prefix text:    <> .
+    ## Example of a TDB dataset and text index
+    ## Initialize TDB
+    [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
+    tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
+    tdb:GraphTDB    rdfs:subClassOf  ja:Model .
+    ## Initialize LARQ
+    [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
+    # A TextDataset is a regular dataset with a text index.
+    text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
+    # Lucene index
+    text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
+    # Solr index
+    text:TextIndexSolrne  rdfs:subClassOf   text:TextIndex .
+    ## ---------------------------------------------------------------
+    ## This URI must be fixed - it's used to assemble the text dataset.
+    :text_dataset rdf:type     text:TextDataset ;
+        text:dataset   <#dataset> ;
+        text:index     <#indexLucene> ;
+        .
+    # A TDB datset used for RDF storage
+    <#dataset> rdf:type      tdb:DatasetTDB ;
+        tdb:location "DB" ;
+        tdb:unionDefaultGraph true ; # Optional
+        .
+    # Text index description
+    <#indexLucene> a text:TextIndexLucene ;
+        text:directory <file:Lucene> ;
+        ##text:directory "mem" ;
+        text:entityMap <#entMap> ;
+        .
+    # Mapping in the index
+    # URI stored in filed "uri"
+    # rdfs:label is mapped to field "text"
+    <#entMap> a text:EntityMap ;
+        text:entityField      "uri" ;
+        text:defaultField     "text" ;
+        text:map (
+             [ text:field "text" ; text:predicate rdfs:label ]
+             ) .
+then use code such as:
+    Dataset ds = DatasetFactory.assemble(
+        "text-config.ttl", 
+        "http://localhost/jena_example/#text_dataset") ;
+Key here is that the assembler contains two dataset definitions, one for
+the text dataset, one for the base data.  Therefore, the application
+needs to identify the text dataset by it's URI
+### Build with code
+        // Example of building a text dadaset with code.
+        // Example is in-memory.
+        // Base data
+        Dataset ds1 = DatasetFactory.createMem() ; 
+        EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label.asNode()) ;
+        // Lucene, in memory.
+        Directory dir =  new RAMDirectory();
+        // Join together into a dataset
+        Dataset ds = TextDatasetFactory.createLucene(ds1, dir, entDef) ;
+### Fuseki
+The Fuseki configuration simply points to the text dataset as the
+`fuseki:dataset` of the service.
+    <#service_text_tdb> rdf:type fuseki:Service ;
+        rdfs:label                      "TDB/text service" ;
+        fuseki:name                     "ds" ;
+        fuseki:serviceQuery             "query" ;
+        fuseki:serviceQuery             "sparql" ;
+        fuseki:serviceUpdate            "update" ;
+        fuseki:serviceUpload            "upload" ;
+        fuseki:serviceReadGraphStore    "get" ;
+        fuseki:serviceReadWriteGraphStore    "data" ;
+        fuseki:dataset                  :text_dataset ;
+        .
+## Query with SPARQL
+The property function is `` more
+conveniently writtern:
+    PREFIX text: <>
+    ...   text:query ...
+This is different to LARQ v1.
+The following forms are all legal:
+    ?s text:query 'word'              # query
+    ?s text:query (rdfs:label 'word') # query specific property if multiple
+    ?s text:query ('word' 10)         # with limit on results
+The most general form is:
+    ?s text:query (<i>property</i> '<i>query string</i>' 'limit')
+Only the query string is required, and if it is the only argument the
+surrounding `( )` can be omitted.
+The property URI is only necessary if multiple properties have been indexed.
+| &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
+| property          | The URI (inc prefix name form) |
+| query string      | The native query string        |
+| limit             | The limit on the results       |
+## Good practice
+The query execution does not know the selectivity of the text index.  It is
+better to use one of two styles.
+### Query pattern 1 : 
+Access to the index is first in the query and used to find a number of
+items of interest; further information is obtained about these items from
+the RDF data.
+    SELECT ?s
+    { ?s text:query (rdfs:label 'word' 10) ; 
+         rdfs:label ?label ;
+         rdf:type   ?type 
+    }
+Limit is useful here when working with large indexes to limit results to the
+more higher scoring results.
+### Query pattern 2 : Filter 
+By finding items of interest first in the RDF data, the text search can be
+used to restrict the items found stil further.
+    SELECT ?s
+    { ?s rdf:type     :book ;
+         dc:createor  "John" .
+      ?s text:query   (dc:title 'word') ; 
+    }