You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2013/04/11 18:41:14 UTC
[ANN] Jena and text search
There is a new, experimental module - jena-text - for anyone interested
to try out.
This is a possible replacement for LARQ (whether to call it "LARQ2" or
something else is for discussion). It is not compatible with current LARQ1.
== Features
* works in Fuseki, with assembler setup,
without the need for additional java code.
* tracks additions to the dataset
* works with Lucene4, and with Solr4 for sharing
the text index with non-SPARQL apps.
* simpler and smaller index design
== Documentation
http://jena.staging.apache.org/documentation/query/text-query.html
== Example query
# text search on rdfs:label for occurrences of "word"
# then retrieve the actual value from the RDF data
PREFIX : <http://example/>
PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
{ ?s text:query (rdfs:label 'word') ;
rdfs:label ?label
}
== Download
It's available from the Apache snapshot maven repository:
https://repository.apache.org/content/repositories/snapshots/org/apache/jena/
Depends on Jena 2.10.1 SNAPSHOT.
SVN is currently:
https://svn.apache.org/repos/asf/jena/Experimental/jena-text/
== Fuseki
There is special build of Fuseki in the jena-text artifact area:
(you will need a copy of the pages/ directory from Fuseki distribution
if you want the webpages as well)
There is an example of a Fuseki config at the end of this message.
== Notes
Currently, it does not expose the match score - the real requirement for
that we found is to retain ordering in text search results: score is a
partial solution to that (two hits can have the same score). Maybe we
need a "row id".
Not tested heavily at scale.
Many thanks to Brian McBride (Epimorphics) who has contributed testing,
bug fixes and generally made it better.
Comments and feedback especially welcome - easier to change things
before first release when APIs become depended upon.
Andy
## Example of a TDB dataset and text index published using Fuseki
@prefix : <#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text: <http://jena.apache.org/text#> .
[] rdf:type fuseki:Server ;
fuseki:services (
<#service_text_tdb>
) .
# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
tdb:GraphTDB rdfs:subClassOf ja:Model .
# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset rdfs:subClassOf ja:RDFDataset .
#text:TextIndexSolr rdfs:subClassOf text:TextIndex .
text:TextIndexLucene rdfs:subClassOf text:TextIndex .
## ---------------------------------------------------------------
<#service_text_tdb> rdf:type fuseki:Service ;
rdfs:label "TDB/text service" ;
fuseki:name "ds" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceUpdate "update" ;
fuseki:serviceUpload "upload" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadWriteGraphStore "data" ;
fuseki:dataset <#text_dataset> ;
.
<#text_dataset> rdf:type text:TextDataset ;
text:dataset <#dataset> ;
##text:index <#indexSolr> ;
text:index <#indexLucene> ;
.
<#dataset> rdf:type tdb:DatasetTDB ;
tdb:location "DB" ;
tdb:unionDefaultGraph true ;
.
<#indexSolr> a text:TextIndexSolr ;
#text:server <http://localhost:8983/solr/COLLECTION> ;
text:server <embedded:SolrARQ> ;
text:entityMap <#entMap> ;
.
<#indexLucene> a text:TextIndexLucene ;
text:directory <file:Lucene> ;
##text:directory "mem" ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; ## Must be defined in the text:map
text:map (
# rdfs:label
[ text:field "text" ; text:predicate rdfs:label ]
) .