You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Brad Moran <bm...@pinnacle21.net> on 2013/08/05 22:49:48 UTC

Jena Text Search Help

I have an existing Jena TDB based on this example RDF:

   <mms:DataElement rdf:ID="DE.Intervention.--MODIFY">
    <sdtms:dataElementRole rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.SynonymQualifier"/>
    <sdtms:supportedBySEND rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
    >true</sdtms:supportedBySEND>
    <mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger"
    >2</mms:ordinal>
    <mms:dataElementDescription rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
    >If the value for --TRT is modified for coding purposes, then the
modified text is placed here.</mms:dataElementDescription>
    <mms:dataElementName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
    >--MODIFY</mms:dataElementName>
    <sdtms:dataElementType rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.Character"/>
    <sdtms:supportedBySDTMIG rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
    >true</sdtms:supportedBySDTMIG>
    <mms:dataElementLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
    >Modified Treatment Name</mms:dataElementLabel>
    <mms:dataElementType rdf:datatype="
http://www.w3.org/2001/XMLSchema#QName"
    >xsd:string</mms:dataElementType>
    <mms:context>
      <mms:VariableGrouping rdf:ID="InterventionVariables">
        <mms:contextLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >Interventions Observation Class Variables</mms:contextLabel>
        <mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger"
        >1</mms:ordinal>
        <mms:context rdf:resource="#Model.SDTM-1-2"/>
      </mms:VariableGrouping>
    </mms:context>
    <sdtms:qualifies>
      <mms:DataElement rdf:ID="DE.Intervention.--TRT">
        <mms:dataElementName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >--TRT</mms:dataElementName>
        <sdtms:dataElementRole rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.TopicVariable"/>
        <sdtms:dataElementType rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.Character"/>
        <mms:dataElementDescription rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >The topic for the intervention observation, usually the verbatim
name of the treatment, drug, medicine, or therapy given during the dosing
interval          for the observation.</mms:dataElementDescription>
        <mms:dataElementLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >Name of Treatment</mms:dataElementLabel>
        <sdtms:supportedBySDTMIG rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
        >true</sdtms:supportedBySDTMIG>
        <mms:context rdf:resource="#InterventionVariables"/>
        <mms:dataElementType rdf:datatype="
http://www.w3.org/2001/XMLSchema#QName"
        >xsd:string</mms:dataElementType>
        <sdtms:supportedBySEND rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
        >true</sdtms:supportedBySEND>
        <mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger"
        >1</mms:ordinal>
      </mms:DataElement>
    </sdtms:qualifies>
   </mms:DataElement>


This is one of two forms of rdf that is in the TDB, the second is:

   <mms:PermissibleValue rdf:ID="C81224.C81203">
    <mms:inValueDomain>
      <mms:EnumeratedValueDomain rdf:ID="C81224">
        <cts:cdiscDefinition rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >Derivation Type: Analysis value derivation
method.</cts:cdiscDefinition>
        <cts:nciPreferredTerm rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >CDISC ADaM Derivation Type Terminology</cts:nciPreferredTerm>
        <cts:nciCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
        >C81224</cts:nciCode>
        <cts:cdiscSynonyms rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >Derivation Type</cts:cdiscSynonyms>
        <cts:cdiscSubmissionValue rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >DTYPE</cts:cdiscSubmissionValue>
        <cts:codelistName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
        >Derivation Type</cts:codelistName>
        <cts:isExtensibleCodelist rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean"
        >true</cts:isExtensibleCodelist>
      </mms:EnumeratedValueDomain>
    </mms:inValueDomain>
    <cts:nciPreferredTerm rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
    >Worst Case Imputation Technique</cts:nciPreferredTerm>
    <cts:nciCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >C81203</cts:nciCode>
    <cts:cdiscDefinition rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
    >Worst Case: A data imputation technique which populates missing values
with the worst possible outcome.</cts:cdiscDefinition>
    <cts:cdiscSubmissionValue rdf:datatype="
http://www.w3.org/2001/XMLSchema#string"
    >WC</cts:cdiscSubmissionValue>
  </mms:PermissibleValue>


I have compiled a Jena TDB based on several of these RDF files so it is a
large TDB and have several SPARQL queries that work as desired. I am now
trying to implement a full text search on this TDB. I have downloaded the
Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
implement this text search through java code using the new Jena Text Search
feature. This is my best attempt at solving the problem so far:

 public class TextSearchTest {
    public static void main(String[] args)
    {
        try{
            String DBDirectory = "tdb";

            // Construct the Lucene Index to be queried

            String indexDir = "luceneIndexes";
            File file = new File(indexDir);
            Directory dir = FSDirectory.open(file);

            // Create the in memory text index described
            Dataset ds1 = TDBFactory.createDataset(DBDirectory);
            String uri = "<http://rdf.cdisc.org/mms#dataElement>";
            String property = "<http://rdf.cdisc.org/mms#dataElementName>";
            EntityDefinition entDef = new EntityDefinition(uri, property,
RDFS.Literal);//RDFS.label
            // Construct the Lucene Index to be queried
            Dataset dataset = TextDatasetFactory.createLucene(ds1, dir,
entDef);

            // try query
            dataset.begin(ReadWrite.READ);
                QueryExecution qExec = QueryExecutionFactory.create(
                        "PREFIX text: <http://jena.apache.org/text#> PREFIX
mms: <http://rdf.cdisc.org/mms#> "
                        + "SELECT * WHERE{?s text:query
(mms:dataElementName 'AE')}", dataset);

                ResultSet rs = qExec.execSelect();
                ResultSetFormatter.out(rs);

            dataset.end();
        }
        catch(Exception e){
            System.out.println(e);
        }
    }
}


This results in: WARN  o.apache.jena.query.text.TextQueryPF - Predicate not
indexed: http://rdf.cdisc.org/mms#dataElementName
and an empty result set is printed out by resultSetFormatter. It does not
seem to create an index for the TDB. I believe my problem occurs with my
EntityDefinition (mainly because I am not sure where the parameters
entityField, primaryField, and primaryPredicate should come from). Also in
the example code it seems a lucene index is created then the data is loaded
by an assembler file. Maybe I am just implementing this wrong. So to try to
wrap this up:

1. Do I need to use an assembler file?
2. Can I create an index from an existing TDB or do I need to create the
index as I create the TDB.
3. Could you give me a description of the parameters of EntityDefintion
class and where they come from? (in the rdf maybe?)
4. Any general advice on how I can solve this problem from my code.

I tried to be as specific as possible here in hopes that you may be able to
guide me in the right direction. If I left anything out just let me out and
hopefully I can explain better. Thanks.

--Brad

Re: Jena Text Search Help

Posted by Andy Seaborne <an...@apache.org>.
On 06/08/13 17:18, Brad Moran wrote:
> Ok, since I already have the TDB built, it seems the best plan would be to
> create an assembler file and then use the jena.textindexes application.
> Sorry, these are the namespaces:
>
>      xmlns:mms="http://rdf.cdisc.org/mms#"
>      xmlns="http://rdf.cdisc.org/sdtm-1-2/std#"
>      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>      xmlns:skos="http://www.w3.org/2004/02/skos/core#"
>      xmlns:owl="http://www.w3.org/2002/07/owl#"
>      xmlns:dc="http://purl.org/dc/elements/1.1/"
>      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
>      xmlns:sdtms="http://rdf.cdisc.org/sdtm-1-2/schema#"
>      xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
>      xmlns:cts="http://rdf.cdisc.org/ct/schema#"
>      xml:base="http://rdf.cdisc.org/sdtm-1-2/std">
>
> I have no experience with assembler files so I based mine off the example
> on documentation. Does this look right?

Yes.

Starting with the working example and tweaking bit by bit (binary 
search!) until it is what you want is a good approach.

	Andy

>
> @prefix :        <http://localhost/jena_example/#> .
>
> @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>
> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>
> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>
> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
>
> @prefix text:    <http://jena.apache.org/text#> .
>
> @prefix mms:     <http://rdf.cdisc.org/mms#> .
>
>
> ## Example of a TDB dataset and text index
>
> ## Initialize TDB
>
> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>
> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>
> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>
>
> ## Initialize text query
>
> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>
> # A TextDataset is a regular dataset with a text index.
>
> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>
> # Lucene index
>
> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>
>
> ## ---------------------------------------------------------------
>
> ## This URI must be fixed - it's used to assemble the text dataset.
>
>
> :text_dataset rdf:type     text:TextDataset ;
>
>      text:dataset   <#dataset> ;
>
>      text:index     <#indexLucene> ;
>
>      .
>
>
> # A TDB dataset used for RDF storage
>
> <#dataset> rdf:type      tdb:DatasetTDB ;
>
>      tdb:location "tdb" ;
>
>      tdb:unionDefaultGraph true ; # Optional
>
>      .
>
>
> # Text index description
>
> <#indexLucene> a text:TextIndexLucene ;
>
>      text:directory <file:luceneIndexes> ;
>
>
>      text:entityMap <#entMap> ;
>
>      .
>
>
> # Mapping in the index
>
> # URI stored in field "uri"
>
> # rdfs:label is mapped to field "text"
>
> <#entMap> a text:EntityMap ;
>
>      text:entityField      "uri" ;
>
>      text:defaultField     "text" ;
>
>      text:map (
>
>           [ text:field "text" ; text:predicate mms:dataElementName ]
>           [text:field "text" ; text:predicate mms:dataElementDescription ]
>           # the rest of the fields?
>
>           ) .
>
>
>
> On Tue, Aug 6, 2013 at 7:15 AM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 05/08/13 21:49, Brad Moran wrote:
>>
>>> I have an existing Jena TDB based on this example RDF:
>>>
>>>   ...
>>
>>
>>> I have compiled a Jena TDB based on several of these RDF files so it is a
>>> large TDB and have several SPARQL queries that work as desired. I am now
>>> trying to implement a full text search on this TDB. I have downloaded the
>>> Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
>>> implement this text search through java code using the new Jena Text
>>> Search
>>> feature. This is my best attempt at solving the problem so far:
>>>
>>>    public class TextSearchTest {
>>>       public static void main(String[] args)
>>>       {
>>>           try{
>>>               String DBDirectory = "tdb";
>>>
>>>               // Construct the Lucene Index to be queried
>>>
>>>               String indexDir = "luceneIndexes";
>>>               File file = new File(indexDir);
>>>               Directory dir = FSDirectory.open(file);
>>>
>>>               // Create the in memory text index described
>>>               Dataset ds1 = TDBFactory.createDataset(**DBDirectory);
>>>               String uri = "<http://rdf.cdisc.org/mms#**dataElement<http://rdf.cdisc.org/mms#dataElement>
>>>> ";
>>>               String property = "<http://rdf.cdisc.org/mms#**
>>> dataElementName <http://rdf.cdisc.org/mms#dataElementName>>";
>>>               EntityDefinition entDef = new EntityDefinition(uri, property,
>>> RDFS.Literal);//RDFS.label
>>>
>>
>> This defines the text index to be working on a particular property.
>>
>> You want to pass in a resource (Resource or Property object)  for
>> http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>here.
>>
>>
>>
>>
>>                // Construct the Lucene Index to be queried
>>>               Dataset dataset = TextDatasetFactory.**createLucene(ds1,
>>> dir,
>>> entDef);
>>>
>>
>> I hope you loaded the data into this dataset, not the underlying TDB one
>> because other wise the text indexer would not have seen the RDF triples to
>> index.
>>
>>
>>
>>>               // try query
>>>               dataset.begin(ReadWrite.READ);
>>>                   QueryExecution qExec = QueryExecutionFactory.create(
>>>                           "PREFIX text: <http://jena.apache.org/text#>
>>> PREFIX
>>> mms: <http://rdf.cdisc.org/mms#> "
>>>                           + "SELECT * WHERE{?s text:query
>>> (mms:dataElementName 'AE')}", dataset);
>>>
>>>                   ResultSet rs = qExec.execSelect();
>>>                   ResultSetFormatter.out(rs);
>>>
>>>               dataset.end();
>>>           }
>>>           catch(Exception e){
>>>               System.out.println(e);
>>>           }
>>>       }
>>> }
>>>
>>>
>>> This results in: WARN  o.apache.jena.query.text.**TextQueryPF -
>>> Predicate not
>>> indexed: http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>
>>>
>>
>> Because that field isn't being indexed.
>>
>> You can have several fileds indexed if you .set the EntityDefinition with
>> additional predicates.
>>
>>
>>   and an empty result set is printed out by resultSetFormatter. It does not
>>> seem to create an index for the TDB.
>>>
>>
>>   I believe my problem occurs with my
>>> EntityDefinition (mainly because I am not sure where the parameters
>>> entityField, primaryField, and primaryPredicate should come from). Also in
>>> the example code it seems a lucene index is created then the data is
>>> loaded
>>> by an assembler file. Maybe I am just implementing this wrong. So to try
>>> to
>>> wrap this up:
>>>
>>> 1. Do I need to use an assembler file?
>>>
>>
>> No but it may be easier that way.
>>
>>
>>   2. Can I create an index from an existing TDB or do I need to create the
>>> index as I create the TDB.
>>>
>>
>> As the data is loaded.
>>
>> There is a simple application 'jena.textindexer' which will create the
>> index from existing data.
>>
>> http://jena.staging.apache.**org/documentation/query/text-**
>> query.html#building-a-text-**index<http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index>
>>
>>
>>   3. Could you give me a description of the parameters of EntityDefintion
>>> class and where they come from? (in the rdf maybe?)
>>>
>>
>> Create Property object for http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>nad pass that in as the 3rd argument
>>
>>
>>   4. Any general advice on how I can solve this problem from my code.
>>>
>>> I tried to be as specific as possible here in hopes that you may be able
>>> to
>>> guide me in the right direction. If I left anything out just let me out
>>> and
>>> hopefully I can explain better. Thanks.
>>>
>>
>> minor in this case, but the data is incomplete RDF/XML, no namespaces, so
>> I didn't try using it.
>>
>> Our mantra is "complete, minimal example".  Both "complete" and "minimal"
>> make it much, much easier to give good answers.
>>
>>
>>> --Brad
>>>
>>>
>>          Andy
>>
>


Re: Jena Text Search Help

Posted by Brad Moran <bm...@pinnacle21.net>.
Ok, since I already have the TDB built, it seems the best plan would be to
create an assembler file and then use the jena.textindexes application.
Sorry, these are the namespaces:

    xmlns:mms="http://rdf.cdisc.org/mms#"
    xmlns="http://rdf.cdisc.org/sdtm-1-2/std#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:skos="http://www.w3.org/2004/02/skos/core#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:sdtms="http://rdf.cdisc.org/sdtm-1-2/schema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:cts="http://rdf.cdisc.org/ct/schema#"
    xml:base="http://rdf.cdisc.org/sdtm-1-2/std">

I have no experience with assembler files so I based mine off the example
on documentation. Does this look right?

@prefix :        <http://localhost/jena_example/#> .

@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .

@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .

@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

@prefix text:    <http://jena.apache.org/text#> .

@prefix mms:     <http://rdf.cdisc.org/mms#> .


## Example of a TDB dataset and text index

## Initialize TDB

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .

tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .

tdb:GraphTDB    rdfs:subClassOf  ja:Model .


## Initialize text query

[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .

# A TextDataset is a regular dataset with a text index.

text:TextDataset      rdfs:subClassOf   ja:RDFDataset .

# Lucene index

text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .


## ---------------------------------------------------------------

## This URI must be fixed - it's used to assemble the text dataset.


:text_dataset rdf:type     text:TextDataset ;

    text:dataset   <#dataset> ;

    text:index     <#indexLucene> ;

    .


# A TDB dataset used for RDF storage

<#dataset> rdf:type      tdb:DatasetTDB ;

    tdb:location "tdb" ;

    tdb:unionDefaultGraph true ; # Optional

    .


# Text index description

<#indexLucene> a text:TextIndexLucene ;

    text:directory <file:luceneIndexes> ;


    text:entityMap <#entMap> ;

    .


# Mapping in the index

# URI stored in field "uri"

# rdfs:label is mapped to field "text"

<#entMap> a text:EntityMap ;

    text:entityField      "uri" ;

    text:defaultField     "text" ;

    text:map (

         [ text:field "text" ; text:predicate mms:dataElementName ]
         [text:field "text" ; text:predicate mms:dataElementDescription ]
         # the rest of the fields?

         ) .



On Tue, Aug 6, 2013 at 7:15 AM, Andy Seaborne <an...@apache.org> wrote:

> On 05/08/13 21:49, Brad Moran wrote:
>
>> I have an existing Jena TDB based on this example RDF:
>>
>>  ...
>
>
>> I have compiled a Jena TDB based on several of these RDF files so it is a
>> large TDB and have several SPARQL queries that work as desired. I am now
>> trying to implement a full text search on this TDB. I have downloaded the
>> Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
>> implement this text search through java code using the new Jena Text
>> Search
>> feature. This is my best attempt at solving the problem so far:
>>
>>   public class TextSearchTest {
>>      public static void main(String[] args)
>>      {
>>          try{
>>              String DBDirectory = "tdb";
>>
>>              // Construct the Lucene Index to be queried
>>
>>              String indexDir = "luceneIndexes";
>>              File file = new File(indexDir);
>>              Directory dir = FSDirectory.open(file);
>>
>>              // Create the in memory text index described
>>              Dataset ds1 = TDBFactory.createDataset(**DBDirectory);
>>              String uri = "<http://rdf.cdisc.org/mms#**dataElement<http://rdf.cdisc.org/mms#dataElement>
>> >";
>>              String property = "<http://rdf.cdisc.org/mms#**
>> dataElementName <http://rdf.cdisc.org/mms#dataElementName>>";
>>              EntityDefinition entDef = new EntityDefinition(uri, property,
>> RDFS.Literal);//RDFS.label
>>
>
> This defines the text index to be working on a particular property.
>
> You want to pass in a resource (Resource or Property object)  for
> http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>here.
>
>
>
>
>               // Construct the Lucene Index to be queried
>>              Dataset dataset = TextDatasetFactory.**createLucene(ds1,
>> dir,
>> entDef);
>>
>
> I hope you loaded the data into this dataset, not the underlying TDB one
> because other wise the text indexer would not have seen the RDF triples to
> index.
>
>
>
>>              // try query
>>              dataset.begin(ReadWrite.READ);
>>                  QueryExecution qExec = QueryExecutionFactory.create(
>>                          "PREFIX text: <http://jena.apache.org/text#>
>> PREFIX
>> mms: <http://rdf.cdisc.org/mms#> "
>>                          + "SELECT * WHERE{?s text:query
>> (mms:dataElementName 'AE')}", dataset);
>>
>>                  ResultSet rs = qExec.execSelect();
>>                  ResultSetFormatter.out(rs);
>>
>>              dataset.end();
>>          }
>>          catch(Exception e){
>>              System.out.println(e);
>>          }
>>      }
>> }
>>
>>
>> This results in: WARN  o.apache.jena.query.text.**TextQueryPF -
>> Predicate not
>> indexed: http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>
>>
>
> Because that field isn't being indexed.
>
> You can have several fileds indexed if you .set the EntityDefinition with
> additional predicates.
>
>
>  and an empty result set is printed out by resultSetFormatter. It does not
>> seem to create an index for the TDB.
>>
>
>  I believe my problem occurs with my
>> EntityDefinition (mainly because I am not sure where the parameters
>> entityField, primaryField, and primaryPredicate should come from). Also in
>> the example code it seems a lucene index is created then the data is
>> loaded
>> by an assembler file. Maybe I am just implementing this wrong. So to try
>> to
>> wrap this up:
>>
>> 1. Do I need to use an assembler file?
>>
>
> No but it may be easier that way.
>
>
>  2. Can I create an index from an existing TDB or do I need to create the
>> index as I create the TDB.
>>
>
> As the data is loaded.
>
> There is a simple application 'jena.textindexer' which will create the
> index from existing data.
>
> http://jena.staging.apache.**org/documentation/query/text-**
> query.html#building-a-text-**index<http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index>
>
>
>  3. Could you give me a description of the parameters of EntityDefintion
>> class and where they come from? (in the rdf maybe?)
>>
>
> Create Property object for http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>nad pass that in as the 3rd argument
>
>
>  4. Any general advice on how I can solve this problem from my code.
>>
>> I tried to be as specific as possible here in hopes that you may be able
>> to
>> guide me in the right direction. If I left anything out just let me out
>> and
>> hopefully I can explain better. Thanks.
>>
>
> minor in this case, but the data is incomplete RDF/XML, no namespaces, so
> I didn't try using it.
>
> Our mantra is "complete, minimal example".  Both "complete" and "minimal"
> make it much, much easier to give good answers.
>
>
>> --Brad
>>
>>
>         Andy
>

Re: Jena Text Search Help

Posted by Andy Seaborne <an...@apache.org>.
On 05/08/13 21:49, Brad Moran wrote:
> I have an existing Jena TDB based on this example RDF:
>
...
>
> I have compiled a Jena TDB based on several of these RDF files so it is a
> large TDB and have several SPARQL queries that work as desired. I am now
> trying to implement a full text search on this TDB. I have downloaded the
> Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
> implement this text search through java code using the new Jena Text Search
> feature. This is my best attempt at solving the problem so far:
>
>   public class TextSearchTest {
>      public static void main(String[] args)
>      {
>          try{
>              String DBDirectory = "tdb";
>
>              // Construct the Lucene Index to be queried
>
>              String indexDir = "luceneIndexes";
>              File file = new File(indexDir);
>              Directory dir = FSDirectory.open(file);
>
>              // Create the in memory text index described
>              Dataset ds1 = TDBFactory.createDataset(DBDirectory);
>              String uri = "<http://rdf.cdisc.org/mms#dataElement>";
>              String property = "<http://rdf.cdisc.org/mms#dataElementName>";
>              EntityDefinition entDef = new EntityDefinition(uri, property,
> RDFS.Literal);//RDFS.label

This defines the text index to be working on a particular property.

You want to pass in a resource (Resource or Property object)  for 
http://rdf.cdisc.org/mms#dataElementName here.



>              // Construct the Lucene Index to be queried
>              Dataset dataset = TextDatasetFactory.createLucene(ds1, dir,
> entDef);

I hope you loaded the data into this dataset, not the underlying TDB one 
because other wise the text indexer would not have seen the RDF triples 
to index.

>
>              // try query
>              dataset.begin(ReadWrite.READ);
>                  QueryExecution qExec = QueryExecutionFactory.create(
>                          "PREFIX text: <http://jena.apache.org/text#> PREFIX
> mms: <http://rdf.cdisc.org/mms#> "
>                          + "SELECT * WHERE{?s text:query
> (mms:dataElementName 'AE')}", dataset);
>
>                  ResultSet rs = qExec.execSelect();
>                  ResultSetFormatter.out(rs);
>
>              dataset.end();
>          }
>          catch(Exception e){
>              System.out.println(e);
>          }
>      }
> }
>
>
> This results in: WARN  o.apache.jena.query.text.TextQueryPF - Predicate not
> indexed: http://rdf.cdisc.org/mms#dataElementName

Because that field isn't being indexed.

You can have several fileds indexed if you .set the EntityDefinition 
with additional predicates.

> and an empty result set is printed out by resultSetFormatter. It does not
> seem to create an index for the TDB.

> I believe my problem occurs with my
> EntityDefinition (mainly because I am not sure where the parameters
> entityField, primaryField, and primaryPredicate should come from). Also in
> the example code it seems a lucene index is created then the data is loaded
> by an assembler file. Maybe I am just implementing this wrong. So to try to
> wrap this up:
>
> 1. Do I need to use an assembler file?

No but it may be easier that way.

> 2. Can I create an index from an existing TDB or do I need to create the
> index as I create the TDB.

As the data is loaded.

There is a simple application 'jena.textindexer' which will create the 
index from existing data.

http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index

> 3. Could you give me a description of the parameters of EntityDefintion
> class and where they come from? (in the rdf maybe?)

Create Property object for http://rdf.cdisc.org/mms#dataElementName nad 
pass that in as the 3rd argument

> 4. Any general advice on how I can solve this problem from my code.
>
> I tried to be as specific as possible here in hopes that you may be able to
> guide me in the right direction. If I left anything out just let me out and
> hopefully I can explain better. Thanks.

minor in this case, but the data is incomplete RDF/XML, no namespaces, 
so I didn't try using it.

Our mantra is "complete, minimal example".  Both "complete" and 
"minimal" make it much, much easier to give good answers.

>
> --Brad
>

	Andy