You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by an...@apache.org on 2013/04/10 18:12:39 UTC
svn commit: r1466541 -
/jena/site/trunk/content/documentation/query/text-query.mdtext
Author: andy
Date: Wed Apr 10 16:12:39 2013
New Revision: 1466541
URL: http://svn.apache.org/r1466541
Log:
Text query documentation
Added:
jena/site/trunk/content/documentation/query/text-query.mdtext
Added: jena/site/trunk/content/documentation/query/text-query.mdtext
URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/query/text-query.mdtext?rev=1466541&view=auto
==============================================================================
--- jena/site/trunk/content/documentation/query/text-query.mdtext (added)
+++ jena/site/trunk/content/documentation/query/text-query.mdtext Wed Apr 10 16:12:39 2013
@@ -0,0 +1,287 @@
+Title: LARQ - free text searches with SPARQL
+
+LARQ is a combination of SPARQL and text search.
+
+It gives applications the ability to perform free text searches within
+SPARQL queries. Text indexes are additional information for
+accessing the RDF graph.
+
+The text index can be either [Apache Lucene](http://lucene.apache.org/core) for a
+same-machine text index, or [Apache Solr](http://lucene.apache.org/solr/)
+for a large scale enterprise search application.
+
+Some example code is available here: @@ [examples]().
+
+LARQ2 uses Lucene4 or Solr4.
+
+*Illustration*
+
+This query makes a text query for 'word' on a specific property
+(the index needs to correctly configured) and limits the output
+to 10 matches; it then looks in the RDF data for
+the actual label. More details are given below.
+
+ PREFIX text: <http://jena.apache.org/text#>
+ PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+
+ SELECT ?s
+ { ?s text:query (rdfs:label 'word' 10) ;
+ rdfs:label ?label
+ }
+
+@@TOC
+- [Architecture](#architecture)
+- [Configuration](#configuration)
+- [Loading Data](#loading-data)
+- [Query with SPARQL](#with-with-sparql)
+- [Examples](#examples)
+- [Working with Fuseki](#working-with-fuseki)
+
+## Architecture
+
+The text index is used provide a reverse index mapping query strings to URIs.
+The text indexed can be part of the RDF data or the text index can be used to index
+external content with only additional RDF in the RDF store.
+
+The LARQ index uses the native text index text query language:
+[Lucene query format](http://lucene.apache.org/core/4_1_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+or
+[Solr query format](http://wiki.apache.org/solr/SolrQuerySyntax).
+
+A text-supporting dataset is configured with a description of which
+properties work with. When data is added, any properties matching the
+description caus an entry to be added from analysed text from the triple
+object and mapping to the subject.
+
+### Pattern A: RDF data
+
+In this pattern, the data in the text index is indexing literals in the RDF data.
+Additions to the RDF data are reflected in additions to the index.
+
+(Deletes do not remove text index netries - [see below](#deletion))
+
+### Pattern B: External content
+
+There is no requirement that the text data indexed is present in the RDF
+data. As long as the index contains the index text documents to match the
+index description, then text search can be performed.
+
+For example, if the content of a collection of documents is indexed and the
+URI naming the document is the result of the text search, then an RDF
+dataset with the document metadata can be combined with accessing the
+content by URI.
+
+The maintence of the index is external to the RDF data store.
+
+### External applications
+
+By using Solr, in either pattern A (RDF data indexed) or pattern B
+(external content indexed), other applications can share the
+text index with SPARQL search.
+
+## Query
+
+@@
+
+## Deletion
+
+If the text index is being maintain by changed to the RDF, then deletion of
+RDF triple or quads does not cause entries in the index to be removed. The
+index does not store the literal indexed, nor does it store a reference
+count of how many triples refer to the index so the information to delete
+entries is not available.
+
+In situations where this matters, the SPARQL query should look up in the
+text index, then check in the RDF data. Indeed, this may be necessary
+anyway because a text search does not necessarily give only exact matches.
+
+In the initial example:
+
+ SELECT ?s ?label
+ { ?s text:query (rdfs:label 'word' 10) ;
+ rdfs:label ?label
+ }
+
+the SPARQL query is checking that the `rdfs:label` triple exists, and if it
+does, returning the whole label.
+
+Bu only indexing, and not storing, literals, the index is kept smaller. It
+may be necessary to periodically rebuild the index if a large proportion
+of the RDF data changes.
+
+## Configuration
+
+The important structure is an "entity map" which defines the properties to
+index, the name of the lucene/solr field and filed used for storing the URI
+itself.
+
+For common RDF use, you'd have one field, mapping a property to a text
+index field.
+
+More complex setups, with multiple properties per enitity (URI) are possible.
+
+The usual way to describe an index is with a
+[Jena assembler description](../assembler/index.html).
+
+### Assemblers
+
+ @prefix : <http://localhost/jena_example/#> .
+ @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
+ @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
+ @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
+ @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
+ @prefix text: <http://jena.apache.org/text#> .
+
+ ## Example of a TDB dataset and text index
+ ## Initialize TDB
+ [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
+ tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
+ tdb:GraphTDB rdfs:subClassOf ja:Model .
+
+ ## Initialize LARQ
+ [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
+ # A TextDataset is a regular dataset with a text index.
+ text:TextDataset rdfs:subClassOf ja:RDFDataset .
+ # Lucene index
+ text:TextIndexLucene rdfs:subClassOf text:TextIndex .
+ # Solr index
+ text:TextIndexSolrne rdfs:subClassOf text:TextIndex .
+
+ ## ---------------------------------------------------------------
+ ## This URI must be fixed - it's used to assemble the text dataset.
+
+ :text_dataset rdf:type text:TextDataset ;
+ text:dataset <#dataset> ;
+ text:index <#indexLucene> ;
+ .
+
+ # A TDB datset used for RDF storage
+ <#dataset> rdf:type tdb:DatasetTDB ;
+ tdb:location "DB" ;
+ tdb:unionDefaultGraph true ; # Optional
+ .
+
+ # Text index description
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ ##text:directory "mem" ;
+ text:entityMap <#entMap> ;
+ .
+
+ # Mapping in the index
+ # URI stored in filed "uri"
+ # rdfs:label is mapped to field "text"
+ <#entMap> a text:EntityMap ;
+ text:entityField "uri" ;
+ text:defaultField "text" ;
+ text:map (
+ [ text:field "text" ; text:predicate rdfs:label ]
+ ) .
+
+then use code such as:
+
+ Dataset ds = DatasetFactory.assemble(
+ "text-config.ttl",
+ "http://localhost/jena_example/#text_dataset") ;
+
+Key here is that the assembler contains two dataset definitions, one for
+the text dataset, one for the base data. Therefore, the application
+needs to identify the text dataset by it's URI
+`http://localhost/jena_example/#text_dataset`.
+
+### Build with code
+
+ // Example of building a text dadaset with code.
+ // Example is in-memory.
+ // Base data
+ Dataset ds1 = DatasetFactory.createMem() ;
+
+ EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label.asNode()) ;
+
+ // Lucene, in memory.
+ Directory dir = new RAMDirectory();
+
+ // Join together into a dataset
+ Dataset ds = TextDatasetFactory.createLucene(ds1, dir, entDef) ;
+
+### Fuseki
+
+The Fuseki configuration simply points to the text dataset as the
+`fuseki:dataset` of the service.
+
+ <#service_text_tdb> rdf:type fuseki:Service ;
+ rdfs:label "TDB/text service" ;
+ fuseki:name "ds" ;
+ fuseki:serviceQuery "query" ;
+ fuseki:serviceQuery "sparql" ;
+ fuseki:serviceUpdate "update" ;
+ fuseki:serviceUpload "upload" ;
+ fuseki:serviceReadGraphStore "get" ;
+ fuseki:serviceReadWriteGraphStore "data" ;
+ fuseki:dataset :text_dataset ;
+ .
+
+## Query with SPARQL
+
+The property function is `http://jena.apache.org/text#query` more
+conveniently writtern:
+
+ PREFIX text: <http://jena.apache.org/text#>
+
+ ... text:query ...
+
+This is different to LARQ v1.
+
+The following forms are all legal:
+
+ ?s text:query 'word' # query
+ ?s text:query (rdfs:label 'word') # query specific property if multiple
+ ?s text:query ('word' 10) # with limit on results
+
+The most general form is:
+
+ ?s text:query (<i>property</i> '<i>query string</i>' 'limit')
+
+Only the query string is required, and if it is the only argument the
+surrounding `( )` can be omitted.
+
+The property URI is only necessary if multiple properties have been indexed.
+
+| Argument | Definition |
+|-------------------|--------------------------------|
+| property | The URI (inc prefix name form) |
+| query string | The native query string |
+| limit | The limit on the results |
+
+@@Example
+
+## Good practice
+
+The query execution does not know the selectivity of the text index. It is
+better to use one of two styles.
+
+### Query pattern 1 :
+
+Access to the index is first in the query and used to find a number of
+items of interest; further information is obtained about these items from
+the RDF data.
+
+ SELECT ?s
+ { ?s text:query (rdfs:label 'word' 10) ;
+ rdfs:label ?label ;
+ rdf:type ?type
+ }
+
+Limit is useful here when working with large indexes to limit results to the
+more higher scoring results.
+
+### Query pattern 2 : Filter
+
+By finding items of interest first in the RDF data, the text search can be
+used to restrict the items found stil further.
+
+ SELECT ?s
+ { ?s rdf:type :book ;
+ dc:createor "John" .
+ ?s text:query (dc:title 'word') ;
+ }