You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Brad Moran <bm...@pinnacle21.net> on 2013/08/05 22:49:48 UTC
Jena Text Search Help
I have an existing Jena TDB based on this example RDF:
<mms:DataElement rdf:ID="DE.Intervention.--MODIFY">
<sdtms:dataElementRole rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.SynonymQualifier"/>
<sdtms:supportedBySEND rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
>true</sdtms:supportedBySEND>
<mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger"
>2</mms:ordinal>
<mms:dataElementDescription rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>If the value for --TRT is modified for coding purposes, then the
modified text is placed here.</mms:dataElementDescription>
<mms:dataElementName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>--MODIFY</mms:dataElementName>
<sdtms:dataElementType rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.Character"/>
<sdtms:supportedBySDTMIG rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
>true</sdtms:supportedBySDTMIG>
<mms:dataElementLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Modified Treatment Name</mms:dataElementLabel>
<mms:dataElementType rdf:datatype="
http://www.w3.org/2001/XMLSchema#QName"
>xsd:string</mms:dataElementType>
<mms:context>
<mms:VariableGrouping rdf:ID="InterventionVariables">
<mms:contextLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Interventions Observation Class Variables</mms:contextLabel>
<mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger"
>1</mms:ordinal>
<mms:context rdf:resource="#Model.SDTM-1-2"/>
</mms:VariableGrouping>
</mms:context>
<sdtms:qualifies>
<mms:DataElement rdf:ID="DE.Intervention.--TRT">
<mms:dataElementName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>--TRT</mms:dataElementName>
<sdtms:dataElementRole rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.TopicVariable"/>
<sdtms:dataElementType rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.Character"/>
<mms:dataElementDescription rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>The topic for the intervention observation, usually the verbatim
name of the treatment, drug, medicine, or therapy given during the dosing
interval for the observation.</mms:dataElementDescription>
<mms:dataElementLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Name of Treatment</mms:dataElementLabel>
<sdtms:supportedBySDTMIG rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
>true</sdtms:supportedBySDTMIG>
<mms:context rdf:resource="#InterventionVariables"/>
<mms:dataElementType rdf:datatype="
http://www.w3.org/2001/XMLSchema#QName"
>xsd:string</mms:dataElementType>
<sdtms:supportedBySEND rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
>true</sdtms:supportedBySEND>
<mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger"
>1</mms:ordinal>
</mms:DataElement>
</sdtms:qualifies>
</mms:DataElement>
This is one of two forms of rdf that is in the TDB, the second is:
<mms:PermissibleValue rdf:ID="C81224.C81203">
<mms:inValueDomain>
<mms:EnumeratedValueDomain rdf:ID="C81224">
<cts:cdiscDefinition rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Derivation Type: Analysis value derivation
method.</cts:cdiscDefinition>
<cts:nciPreferredTerm rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>CDISC ADaM Derivation Type Terminology</cts:nciPreferredTerm>
<cts:nciCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>C81224</cts:nciCode>
<cts:cdiscSynonyms rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Derivation Type</cts:cdiscSynonyms>
<cts:cdiscSubmissionValue rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>DTYPE</cts:cdiscSubmissionValue>
<cts:codelistName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Derivation Type</cts:codelistName>
<cts:isExtensibleCodelist rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
>true</cts:isExtensibleCodelist>
</mms:EnumeratedValueDomain>
</mms:inValueDomain>
<cts:nciPreferredTerm rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Worst Case Imputation Technique</cts:nciPreferredTerm>
<cts:nciCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>C81203</cts:nciCode>
<cts:cdiscDefinition rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>Worst Case: A data imputation technique which populates missing values
with the worst possible outcome.</cts:cdiscDefinition>
<cts:cdiscSubmissionValue rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
>WC</cts:cdiscSubmissionValue>
</mms:PermissibleValue>
I have compiled a Jena TDB based on several of these RDF files so it is a
large TDB and have several SPARQL queries that work as desired. I am now
trying to implement a full text search on this TDB. I have downloaded the
Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
implement this text search through java code using the new Jena Text Search
feature. This is my best attempt at solving the problem so far:
public class TextSearchTest {
public static void main(String[] args)
{
try{
String DBDirectory = "tdb";
// Construct the Lucene Index to be queried
String indexDir = "luceneIndexes";
File file = new File(indexDir);
Directory dir = FSDirectory.open(file);
// Create the in memory text index described
Dataset ds1 = TDBFactory.createDataset(DBDirectory);
String uri = "<http://rdf.cdisc.org/mms#dataElement>";
String property = "<http://rdf.cdisc.org/mms#dataElementName>";
EntityDefinition entDef = new EntityDefinition(uri, property,
RDFS.Literal);//RDFS.label
// Construct the Lucene Index to be queried
Dataset dataset = TextDatasetFactory.createLucene(ds1, dir,
entDef);
// try query
dataset.begin(ReadWrite.READ);
QueryExecution qExec = QueryExecutionFactory.create(
"PREFIX text: <http://jena.apache.org/text#> PREFIX
mms: <http://rdf.cdisc.org/mms#> "
+ "SELECT * WHERE{?s text:query
(mms:dataElementName 'AE')}", dataset);
ResultSet rs = qExec.execSelect();
ResultSetFormatter.out(rs);
dataset.end();
}
catch(Exception e){
System.out.println(e);
}
}
}
This results in: WARN o.apache.jena.query.text.TextQueryPF - Predicate not
indexed: http://rdf.cdisc.org/mms#dataElementName
and an empty result set is printed out by resultSetFormatter. It does not
seem to create an index for the TDB. I believe my problem occurs with my
EntityDefinition (mainly because I am not sure where the parameters
entityField, primaryField, and primaryPredicate should come from). Also in
the example code it seems a lucene index is created then the data is loaded
by an assembler file. Maybe I am just implementing this wrong. So to try to
wrap this up:
1. Do I need to use an assembler file?
2. Can I create an index from an existing TDB or do I need to create the
index as I create the TDB.
3. Could you give me a description of the parameters of EntityDefintion
class and where they come from? (in the rdf maybe?)
4. Any general advice on how I can solve this problem from my code.
I tried to be as specific as possible here in hopes that you may be able to
guide me in the right direction. If I left anything out just let me out and
hopefully I can explain better. Thanks.
--Brad
Re: Jena Text Search Help
Posted by Andy Seaborne <an...@apache.org>.
On 06/08/13 17:18, Brad Moran wrote:
> Ok, since I already have the TDB built, it seems the best plan would be to
> create an assembler file and then use the jena.textindexes application.
> Sorry, these are the namespaces:
>
> xmlns:mms="http://rdf.cdisc.org/mms#"
> xmlns="http://rdf.cdisc.org/sdtm-1-2/std#"
> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> xmlns:skos="http://www.w3.org/2004/02/skos/core#"
> xmlns:owl="http://www.w3.org/2002/07/owl#"
> xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
> xmlns:sdtms="http://rdf.cdisc.org/sdtm-1-2/schema#"
> xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
> xmlns:cts="http://rdf.cdisc.org/ct/schema#"
> xml:base="http://rdf.cdisc.org/sdtm-1-2/std">
>
> I have no experience with assembler files so I based mine off the example
> on documentation. Does this look right?
Yes.
Starting with the working example and tweaking bit by bit (binary
search!) until it is what you want is a good approach.
Andy
>
> @prefix : <http://localhost/jena_example/#> .
>
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>
> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>
> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
>
> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>
> @prefix text: <http://jena.apache.org/text#> .
>
> @prefix mms: <http://rdf.cdisc.org/mms#> .
>
>
> ## Example of a TDB dataset and text index
>
> ## Initialize TDB
>
> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>
> tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
>
> tdb:GraphTDB rdfs:subClassOf ja:Model .
>
>
> ## Initialize text query
>
> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>
> # A TextDataset is a regular dataset with a text index.
>
> text:TextDataset rdfs:subClassOf ja:RDFDataset .
>
> # Lucene index
>
> text:TextIndexLucene rdfs:subClassOf text:TextIndex .
>
>
> ## ---------------------------------------------------------------
>
> ## This URI must be fixed - it's used to assemble the text dataset.
>
>
> :text_dataset rdf:type text:TextDataset ;
>
> text:dataset <#dataset> ;
>
> text:index <#indexLucene> ;
>
> .
>
>
> # A TDB dataset used for RDF storage
>
> <#dataset> rdf:type tdb:DatasetTDB ;
>
> tdb:location "tdb" ;
>
> tdb:unionDefaultGraph true ; # Optional
>
> .
>
>
> # Text index description
>
> <#indexLucene> a text:TextIndexLucene ;
>
> text:directory <file:luceneIndexes> ;
>
>
> text:entityMap <#entMap> ;
>
> .
>
>
> # Mapping in the index
>
> # URI stored in field "uri"
>
> # rdfs:label is mapped to field "text"
>
> <#entMap> a text:EntityMap ;
>
> text:entityField "uri" ;
>
> text:defaultField "text" ;
>
> text:map (
>
> [ text:field "text" ; text:predicate mms:dataElementName ]
> [text:field "text" ; text:predicate mms:dataElementDescription ]
> # the rest of the fields?
>
> ) .
>
>
>
> On Tue, Aug 6, 2013 at 7:15 AM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 05/08/13 21:49, Brad Moran wrote:
>>
>>> I have an existing Jena TDB based on this example RDF:
>>>
>>> ...
>>
>>
>>> I have compiled a Jena TDB based on several of these RDF files so it is a
>>> large TDB and have several SPARQL queries that work as desired. I am now
>>> trying to implement a full text search on this TDB. I have downloaded the
>>> Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
>>> implement this text search through java code using the new Jena Text
>>> Search
>>> feature. This is my best attempt at solving the problem so far:
>>>
>>> public class TextSearchTest {
>>> public static void main(String[] args)
>>> {
>>> try{
>>> String DBDirectory = "tdb";
>>>
>>> // Construct the Lucene Index to be queried
>>>
>>> String indexDir = "luceneIndexes";
>>> File file = new File(indexDir);
>>> Directory dir = FSDirectory.open(file);
>>>
>>> // Create the in memory text index described
>>> Dataset ds1 = TDBFactory.createDataset(**DBDirectory);
>>> String uri = "<http://rdf.cdisc.org/mms#**dataElement<http://rdf.cdisc.org/mms#dataElement>
>>>> ";
>>> String property = "<http://rdf.cdisc.org/mms#**
>>> dataElementName <http://rdf.cdisc.org/mms#dataElementName>>";
>>> EntityDefinition entDef = new EntityDefinition(uri, property,
>>> RDFS.Literal);//RDFS.label
>>>
>>
>> This defines the text index to be working on a particular property.
>>
>> You want to pass in a resource (Resource or Property object) for
>> http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>here.
>>
>>
>>
>>
>> // Construct the Lucene Index to be queried
>>> Dataset dataset = TextDatasetFactory.**createLucene(ds1,
>>> dir,
>>> entDef);
>>>
>>
>> I hope you loaded the data into this dataset, not the underlying TDB one
>> because other wise the text indexer would not have seen the RDF triples to
>> index.
>>
>>
>>
>>> // try query
>>> dataset.begin(ReadWrite.READ);
>>> QueryExecution qExec = QueryExecutionFactory.create(
>>> "PREFIX text: <http://jena.apache.org/text#>
>>> PREFIX
>>> mms: <http://rdf.cdisc.org/mms#> "
>>> + "SELECT * WHERE{?s text:query
>>> (mms:dataElementName 'AE')}", dataset);
>>>
>>> ResultSet rs = qExec.execSelect();
>>> ResultSetFormatter.out(rs);
>>>
>>> dataset.end();
>>> }
>>> catch(Exception e){
>>> System.out.println(e);
>>> }
>>> }
>>> }
>>>
>>>
>>> This results in: WARN o.apache.jena.query.text.**TextQueryPF -
>>> Predicate not
>>> indexed: http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>
>>>
>>
>> Because that field isn't being indexed.
>>
>> You can have several fileds indexed if you .set the EntityDefinition with
>> additional predicates.
>>
>>
>> and an empty result set is printed out by resultSetFormatter. It does not
>>> seem to create an index for the TDB.
>>>
>>
>> I believe my problem occurs with my
>>> EntityDefinition (mainly because I am not sure where the parameters
>>> entityField, primaryField, and primaryPredicate should come from). Also in
>>> the example code it seems a lucene index is created then the data is
>>> loaded
>>> by an assembler file. Maybe I am just implementing this wrong. So to try
>>> to
>>> wrap this up:
>>>
>>> 1. Do I need to use an assembler file?
>>>
>>
>> No but it may be easier that way.
>>
>>
>> 2. Can I create an index from an existing TDB or do I need to create the
>>> index as I create the TDB.
>>>
>>
>> As the data is loaded.
>>
>> There is a simple application 'jena.textindexer' which will create the
>> index from existing data.
>>
>> http://jena.staging.apache.**org/documentation/query/text-**
>> query.html#building-a-text-**index<http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index>
>>
>>
>> 3. Could you give me a description of the parameters of EntityDefintion
>>> class and where they come from? (in the rdf maybe?)
>>>
>>
>> Create Property object for http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>nad pass that in as the 3rd argument
>>
>>
>> 4. Any general advice on how I can solve this problem from my code.
>>>
>>> I tried to be as specific as possible here in hopes that you may be able
>>> to
>>> guide me in the right direction. If I left anything out just let me out
>>> and
>>> hopefully I can explain better. Thanks.
>>>
>>
>> minor in this case, but the data is incomplete RDF/XML, no namespaces, so
>> I didn't try using it.
>>
>> Our mantra is "complete, minimal example". Both "complete" and "minimal"
>> make it much, much easier to give good answers.
>>
>>
>>> --Brad
>>>
>>>
>> Andy
>>
>
Re: Jena Text Search Help
Posted by Brad Moran <bm...@pinnacle21.net>.
Ok, since I already have the TDB built, it seems the best plan would be to
create an assembler file and then use the jena.textindexes application.
Sorry, these are the namespaces:
xmlns:mms="http://rdf.cdisc.org/mms#"
xmlns="http://rdf.cdisc.org/sdtm-1-2/std#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:sdtms="http://rdf.cdisc.org/sdtm-1-2/schema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:cts="http://rdf.cdisc.org/ct/schema#"
xml:base="http://rdf.cdisc.org/sdtm-1-2/std">
I have no experience with assembler files so I based mine off the example
on documentation. Does this look right?
@prefix : <http://localhost/jena_example/#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text: <http://jena.apache.org/text#> .
@prefix mms: <http://rdf.cdisc.org/mms#> .
## Example of a TDB dataset and text index
## Initialize TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
tdb:GraphTDB rdfs:subClassOf ja:Model .
## Initialize text query
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset rdfs:subClassOf ja:RDFDataset .
# Lucene index
text:TextIndexLucene rdfs:subClassOf text:TextIndex .
## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.
:text_dataset rdf:type text:TextDataset ;
text:dataset <#dataset> ;
text:index <#indexLucene> ;
.
# A TDB dataset used for RDF storage
<#dataset> rdf:type tdb:DatasetTDB ;
tdb:location "tdb" ;
tdb:unionDefaultGraph true ; # Optional
.
# Text index description
<#indexLucene> a text:TextIndexLucene ;
text:directory <file:luceneIndexes> ;
text:entityMap <#entMap> ;
.
# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ;
text:map (
[ text:field "text" ; text:predicate mms:dataElementName ]
[text:field "text" ; text:predicate mms:dataElementDescription ]
# the rest of the fields?
) .
On Tue, Aug 6, 2013 at 7:15 AM, Andy Seaborne <an...@apache.org> wrote:
> On 05/08/13 21:49, Brad Moran wrote:
>
>> I have an existing Jena TDB based on this example RDF:
>>
>> ...
>
>
>> I have compiled a Jena TDB based on several of these RDF files so it is a
>> large TDB and have several SPARQL queries that work as desired. I am now
>> trying to implement a full text search on this TDB. I have downloaded the
>> Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
>> implement this text search through java code using the new Jena Text
>> Search
>> feature. This is my best attempt at solving the problem so far:
>>
>> public class TextSearchTest {
>> public static void main(String[] args)
>> {
>> try{
>> String DBDirectory = "tdb";
>>
>> // Construct the Lucene Index to be queried
>>
>> String indexDir = "luceneIndexes";
>> File file = new File(indexDir);
>> Directory dir = FSDirectory.open(file);
>>
>> // Create the in memory text index described
>> Dataset ds1 = TDBFactory.createDataset(**DBDirectory);
>> String uri = "<http://rdf.cdisc.org/mms#**dataElement<http://rdf.cdisc.org/mms#dataElement>
>> >";
>> String property = "<http://rdf.cdisc.org/mms#**
>> dataElementName <http://rdf.cdisc.org/mms#dataElementName>>";
>> EntityDefinition entDef = new EntityDefinition(uri, property,
>> RDFS.Literal);//RDFS.label
>>
>
> This defines the text index to be working on a particular property.
>
> You want to pass in a resource (Resource or Property object) for
> http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>here.
>
>
>
>
> // Construct the Lucene Index to be queried
>> Dataset dataset = TextDatasetFactory.**createLucene(ds1,
>> dir,
>> entDef);
>>
>
> I hope you loaded the data into this dataset, not the underlying TDB one
> because other wise the text indexer would not have seen the RDF triples to
> index.
>
>
>
>> // try query
>> dataset.begin(ReadWrite.READ);
>> QueryExecution qExec = QueryExecutionFactory.create(
>> "PREFIX text: <http://jena.apache.org/text#>
>> PREFIX
>> mms: <http://rdf.cdisc.org/mms#> "
>> + "SELECT * WHERE{?s text:query
>> (mms:dataElementName 'AE')}", dataset);
>>
>> ResultSet rs = qExec.execSelect();
>> ResultSetFormatter.out(rs);
>>
>> dataset.end();
>> }
>> catch(Exception e){
>> System.out.println(e);
>> }
>> }
>> }
>>
>>
>> This results in: WARN o.apache.jena.query.text.**TextQueryPF -
>> Predicate not
>> indexed: http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>
>>
>
> Because that field isn't being indexed.
>
> You can have several fileds indexed if you .set the EntityDefinition with
> additional predicates.
>
>
> and an empty result set is printed out by resultSetFormatter. It does not
>> seem to create an index for the TDB.
>>
>
> I believe my problem occurs with my
>> EntityDefinition (mainly because I am not sure where the parameters
>> entityField, primaryField, and primaryPredicate should come from). Also in
>> the example code it seems a lucene index is created then the data is
>> loaded
>> by an assembler file. Maybe I am just implementing this wrong. So to try
>> to
>> wrap this up:
>>
>> 1. Do I need to use an assembler file?
>>
>
> No but it may be easier that way.
>
>
> 2. Can I create an index from an existing TDB or do I need to create the
>> index as I create the TDB.
>>
>
> As the data is loaded.
>
> There is a simple application 'jena.textindexer' which will create the
> index from existing data.
>
> http://jena.staging.apache.**org/documentation/query/text-**
> query.html#building-a-text-**index<http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index>
>
>
> 3. Could you give me a description of the parameters of EntityDefintion
>> class and where they come from? (in the rdf maybe?)
>>
>
> Create Property object for http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>nad pass that in as the 3rd argument
>
>
> 4. Any general advice on how I can solve this problem from my code.
>>
>> I tried to be as specific as possible here in hopes that you may be able
>> to
>> guide me in the right direction. If I left anything out just let me out
>> and
>> hopefully I can explain better. Thanks.
>>
>
> minor in this case, but the data is incomplete RDF/XML, no namespaces, so
> I didn't try using it.
>
> Our mantra is "complete, minimal example". Both "complete" and "minimal"
> make it much, much easier to give good answers.
>
>
>> --Brad
>>
>>
> Andy
>
Re: Jena Text Search Help
Posted by Andy Seaborne <an...@apache.org>.
On 05/08/13 21:49, Brad Moran wrote:
> I have an existing Jena TDB based on this example RDF:
>
...
>
> I have compiled a Jena TDB based on several of these RDF files so it is a
> large TDB and have several SPARQL queries that work as desired. I am now
> trying to implement a full text search on this TDB. I have downloaded the
> Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
> implement this text search through java code using the new Jena Text Search
> feature. This is my best attempt at solving the problem so far:
>
> public class TextSearchTest {
> public static void main(String[] args)
> {
> try{
> String DBDirectory = "tdb";
>
> // Construct the Lucene Index to be queried
>
> String indexDir = "luceneIndexes";
> File file = new File(indexDir);
> Directory dir = FSDirectory.open(file);
>
> // Create the in memory text index described
> Dataset ds1 = TDBFactory.createDataset(DBDirectory);
> String uri = "<http://rdf.cdisc.org/mms#dataElement>";
> String property = "<http://rdf.cdisc.org/mms#dataElementName>";
> EntityDefinition entDef = new EntityDefinition(uri, property,
> RDFS.Literal);//RDFS.label
This defines the text index to be working on a particular property.
You want to pass in a resource (Resource or Property object) for
http://rdf.cdisc.org/mms#dataElementName here.
> // Construct the Lucene Index to be queried
> Dataset dataset = TextDatasetFactory.createLucene(ds1, dir,
> entDef);
I hope you loaded the data into this dataset, not the underlying TDB one
because other wise the text indexer would not have seen the RDF triples
to index.
>
> // try query
> dataset.begin(ReadWrite.READ);
> QueryExecution qExec = QueryExecutionFactory.create(
> "PREFIX text: <http://jena.apache.org/text#> PREFIX
> mms: <http://rdf.cdisc.org/mms#> "
> + "SELECT * WHERE{?s text:query
> (mms:dataElementName 'AE')}", dataset);
>
> ResultSet rs = qExec.execSelect();
> ResultSetFormatter.out(rs);
>
> dataset.end();
> }
> catch(Exception e){
> System.out.println(e);
> }
> }
> }
>
>
> This results in: WARN o.apache.jena.query.text.TextQueryPF - Predicate not
> indexed: http://rdf.cdisc.org/mms#dataElementName
Because that field isn't being indexed.
You can have several fileds indexed if you .set the EntityDefinition
with additional predicates.
> and an empty result set is printed out by resultSetFormatter. It does not
> seem to create an index for the TDB.
> I believe my problem occurs with my
> EntityDefinition (mainly because I am not sure where the parameters
> entityField, primaryField, and primaryPredicate should come from). Also in
> the example code it seems a lucene index is created then the data is loaded
> by an assembler file. Maybe I am just implementing this wrong. So to try to
> wrap this up:
>
> 1. Do I need to use an assembler file?
No but it may be easier that way.
> 2. Can I create an index from an existing TDB or do I need to create the
> index as I create the TDB.
As the data is loaded.
There is a simple application 'jena.textindexer' which will create the
index from existing data.
http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index
> 3. Could you give me a description of the parameters of EntityDefintion
> class and where they come from? (in the rdf maybe?)
Create Property object for http://rdf.cdisc.org/mms#dataElementName nad
pass that in as the 3rd argument
> 4. Any general advice on how I can solve this problem from my code.
>
> I tried to be as specific as possible here in hopes that you may be able to
> guide me in the right direction. If I left anything out just let me out and
> hopefully I can explain better. Thanks.
minor in this case, but the data is incomplete RDF/XML, no namespaces,
so I didn't try using it.
Our mantra is "complete, minimal example". Both "complete" and
"minimal" make it much, much easier to give good answers.
>
> --Brad
>
Andy