You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2013/04/11 18:41:14 UTC
[ANN] Jena and text search

There is a new, experimental module - jena-text - for anyone interested 
to try out.

This is a possible replacement for LARQ (whether to call it "LARQ2" or 
something else is for discussion).  It is not compatible with current LARQ1.

== Features

* works in Fuseki, with assembler setup,
   without the need for additional java code.

* tracks additions to the dataset

* works with Lucene4, and with Solr4 for sharing
   the text index with non-SPARQL apps.

* simpler and smaller index design

== Documentation

http://jena.staging.apache.org/documentation/query/text-query.html

== Example query

# text search on rdfs:label for occurrences of "word"
# then retrieve the actual value from the RDF data
PREFIX :     <http://example/>
PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT *
{ ?s text:query (rdfs:label 'word') ;
      rdfs:label ?label
}

== Download

It's available from the Apache snapshot maven repository:

https://repository.apache.org/content/repositories/snapshots/org/apache/jena/

Depends on Jena 2.10.1 SNAPSHOT.

SVN is currently:
https://svn.apache.org/repos/asf/jena/Experimental/jena-text/

== Fuseki

There is special build of Fuseki in the jena-text artifact area:

(you will need a copy of the pages/ directory from Fuseki distribution 
if you want the webpages as well)

There is an example of a Fuseki config at the end of this message.

== Notes

Currently, it does not expose the match score - the real requirement for 
that we found is to retain ordering in text search results: score is a 
partial solution to that (two hits can have the same score).  Maybe we 
need a "row id".

Not tested heavily at scale.


Many thanks to Brian McBride (Epimorphics) who has contributed testing, 
bug fixes and generally made it better.

Comments and feedback especially welcome - easier to change things 
before first release when APIs become depended upon.

     Andy


## Example of a TDB dataset and text index published using Fuseki

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

[] rdf:type fuseki:Server ;
    fuseki:services (
      <#service_text_tdb>
    ) .

# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ;
     rdfs:label                      "TDB/text service" ;
     fuseki:name                     "ds" ;
     fuseki:serviceQuery             "query" ;
     fuseki:serviceQuery             "sparql" ;
     fuseki:serviceUpdate            "update" ;
     fuseki:serviceUpload            "upload" ;
     fuseki:serviceReadGraphStore    "get" ;
     fuseki:serviceReadWriteGraphStore    "data" ;
     fuseki:dataset                  <#text_dataset> ;
     .

<#text_dataset> rdf:type     text:TextDataset ;
     text:dataset   <#dataset> ;
     ##text:index   <#indexSolr> ;
     text:index     <#indexLucene> ;
     .

<#dataset> rdf:type      tdb:DatasetTDB ;
     tdb:location "DB" ;
     tdb:unionDefaultGraph true ;
     .

<#indexSolr> a text:TextIndexSolr ;
     #text:server <http://localhost:8983/solr/COLLECTION> ;
     text:server <embedded:SolrARQ> ;
     text:entityMap <#entMap> ;
     .

<#indexLucene> a text:TextIndexLucene ;
     text:directory <file:Lucene> ;
     ##text:directory "mem" ;
     text:entityMap <#entMap> ;
     .

<#entMap> a text:EntityMap ;
     text:entityField      "uri" ;
     text:defaultField     "text" ; ## Must be defined in the text:map
     text:map (
          # rdfs:label
          [ text:field "text" ; text:predicate rdfs:label ]
          ) .