You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by hu...@aol.com on 2013/10/22 10:09:05 UTC

Jena Text Lucene Assembler file questions

Hi,

I have been trying to set up a text search but unfortunately with limited success so far. I attached the assembler file and the exception I am getting.I have an already existing TDB store and I just want to add text search on the predicate "nci:Preferred_Name". I copied the example file and changed the following pieces:

- The default URI Preifx
- tdb:location
- tdb:unionDefaultGraph
- text:predicate

The error I am getting seems to be related to bad syntax. However, I cannot spot it.  Some help would be very much appreciated!

Other questions:
- text:entityField - What does that need to point to? There is no field "uri" in my ontology. Is this supposed to distinguish one concept from another? Or is "uri" an implicit field for any subject's URI?
- In TDB assembler examples in general, could you replace the tdb:location "DB" with something more meaningful, e.g. a relative path to the TDB store or something like that. "DB" could be anything  but as I understand it is the path to the directory of the already created TDB store.


Thanks!
Wolfgang

 

 




Re: Re: Jena Text Lucene Assembler file questions

Posted by Chris Dollin <ch...@epimorphics.com>.
On Thursday, October 24, 2013 05:31:06 AM hueyl16@aol.com wrote:
> I have another question about performance: Using jena-text with a Lucene
> index is expected to be faster than a query with a regex filter, correct?

That will depend on details, but typically, yes.

> I ran two queries, returning the (almost) same data, one using jena-text,
> the other regex filter. I measured the execution times from 
> QueryFactory.create until after qe.execSelect(). And from there to
> after CSVOutput.out(rs) (queries are attached). The results I get are:
> 
> jena-text:
> FINISH 1 - 359.32ms
> FINISH 2 - 130.28ms
> OVERALL - 489.61ms
> 
> regex filter:
> FINISH 1 - 46.27ms
> FINISH 2 - 2540.39ms
> OVERALL - 2586.66ms
> 
> So it seems to confirm the assumption that jena-text is faster. I was just wondering where the difference in FINISH 1 and FINISH 2 time is coming from? Is it executing the query or just preparing it in FINISH 1 and executing it once the ResultSet is being iterated over in FINISH 2?

Yes, that's more-or-less it. The query result is streamed as it is
computed, not computed in advance.

> The FINISH 2 time kind of suggests that since both are printing out the same list,
> but regex takes much longer to "print" which seems unlikely if it was just printing 
> the same list to console.

Streaming, plus the FILTER is being applied to lots more bindings 
than the jena-text index generates.

> And btw, sometimes the very first query using jena-text takes much longer 
> than subsequent queries. I am assuming that it is doing some sort of caching 
> in the first one?!

There's caching and there's also class loading.

Chris

-- 
"The wizard seemed quite willing when I talked to him."  /Howl's Moving Castle/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


Re: Jena Text Lucene Assembler file questions

Posted by Andy Seaborne <an...@apache.org>.
On 23/10/13 17:25, Dave Reynolds wrote:
> On 23/10/13 17:04, hueyl16@aol.com wrote:
>> I just compared the file I sent with the local one that I am using and
>> they are identical. I also just ran rdfcat and it reads the file just
>> fine. Tried jena.textindexer again and that still fails.
>
> You are using relative URIs and those are what the parser seems to be
> complaining about. For some reason, judging from your error messages,
> the base URI is being taken as the windows directory path instead of
> being a legal (file:) URI.
>
> I don't know why that's happening but a suggested work round would be to
> avoid relative URIs - replace your uses of <#dataset>, <#indexLucene>
> and <#entMap> with :dataset, :indexLucene and :entMap.

I think I've found the cause and fixed it.  It's the handling of drive 
letters (which also match the general URI pattern of
"scheme:scheme-specfic-part"

JENA-574

	andy

>
> Dave
>
>
>> -----Original Message-----
>> From: Chris Dollin <ch...@epimorphics.com>
>> To: users <us...@jena.apache.org>
>> Sent: Wed, Oct 23, 2013 3:32 pm
>> Subject: Re: Re: Jena Text Lucene Assembler file questions
>>
>>
>> On Wednesday, October 23, 2013 07:53:26 AM hueyl16@aol.com wrote:
>>> The file is called "jena_assembler.ttl" on my machine. I had to
>>> rename it to
>> .txt so the mailing list attachment filter wouldn't remove it. I am
>> getting the
>> error that I attached earlier when executing this from the command line:
>>>
>>> java -cp %FUSEKI_HOME%\fuseki-server.jar jena.textindexer
>>> --desc=C:\Development\Ontology\jena_assembler.ttl
>>
>> I can read jena_assembler.ttl without problems (using jena.rdfcat).
>> Are you sure that's the right file?
>>
>> Chris
>>
>


Re: Jena Text Lucene Assembler file questions

Posted by hu...@aol.com.
Hi Dave,

the suggested workaround works. Thanks! Also thanks to Chris for taking the time to respond.

I have another question about performance: Using jena-text with a Lucene index is expected to be faster than a query with a regex filter, correct? I ran two queries, returning the (almost) same data, one using jena-text, the other regex filter. I measured the execution times from QueryFactory.create until after qe.execSelect(). And from there to after CSVOutput.out(rs) (queries are attached). The results I get are:

jena-text:
FINISH 1 - 359.32ms
FINISH 2 - 130.28ms
OVERALL - 489.61ms

regex filter:
FINISH 1 - 46.27ms
FINISH 2 - 2540.39ms
OVERALL - 2586.66ms

So it seems to confirm the assumption that jena-text is faster. I was just wondering where the difference in FINISH 1 and FINISH 2 time is coming from? Is it executing the query or just preparing it in FINISH 1 and executing it once the ResultSet is being iterated over in FINISH 2? The FINISH 2 time kind of suggests that since both are printing out the same list, but regex takes much longer to "print" which seems unlikely if it was just printing the same list to console.

And btw, sometimes the very first query using jena-text takes much longer than subsequent queries. I am assuming that it is doing some sort of caching in the first one?!

-Wolfgang



Original Message-----

From: Dave Reynolds <da...@gmail.com>
To: users <us...@jena.apache.org>
Sent: Wed, Oct 23, 2013 6:26 pm
Subject: Re: Jena Text Lucene Assembler file questions


On 23/10/13 17:04, hueyl16@aol.com wrote:
> I just compared the file I sent with the local one that I am using and they 
are identical. I also just ran rdfcat and it reads the file just fine. Tried 
jena.textindexer again and that still fails.

You are using relative URIs and those are what the parser seems to be 
complaining about. For some reason, judging from your error messages, 
the base URI is being taken as the windows directory path instead of 
being a legal (file:) URI.

I don't know why that's happening but a suggested work round would be to 
avoid relative URIs - replace your uses of <#dataset>, <#indexLucene> 
and <#entMap> with :dataset, :indexLucene and :entMap.

Dave


> -----Original Message-----
> From: Chris Dollin <ch...@epimorphics.com>
> To: users <us...@jena.apache.org>
> Sent: Wed, Oct 23, 2013 3:32 pm
> Subject: Re: Re: Jena Text Lucene Assembler file questions
>
>
> On Wednesday, October 23, 2013 07:53:26 AM hueyl16@aol.com wrote:
>> The file is called "jena_assembler.ttl" on my machine. I had to rename it to
> .txt so the mailing list attachment filter wouldn't remove it. I am getting 
the
> error that I attached earlier when executing this from the command line:
>>
>> java -cp %FUSEKI_HOME%\fuseki-server.jar jena.textindexer 
--desc=C:\Development\Ontology\jena_assembler.ttl
>
> I can read jena_assembler.ttl without problems (using jena.rdfcat).
> Are you sure that's the right file?
>
> Chris
>


 

Re: Jena Text Lucene Assembler file questions

Posted by Dave Reynolds <da...@gmail.com>.
On 23/10/13 17:04, hueyl16@aol.com wrote:
> I just compared the file I sent with the local one that I am using and they are identical. I also just ran rdfcat and it reads the file just fine. Tried jena.textindexer again and that still fails.

You are using relative URIs and those are what the parser seems to be 
complaining about. For some reason, judging from your error messages, 
the base URI is being taken as the windows directory path instead of 
being a legal (file:) URI.

I don't know why that's happening but a suggested work round would be to 
avoid relative URIs - replace your uses of <#dataset>, <#indexLucene> 
and <#entMap> with :dataset, :indexLucene and :entMap.

Dave


> -----Original Message-----
> From: Chris Dollin <ch...@epimorphics.com>
> To: users <us...@jena.apache.org>
> Sent: Wed, Oct 23, 2013 3:32 pm
> Subject: Re: Re: Jena Text Lucene Assembler file questions
>
>
> On Wednesday, October 23, 2013 07:53:26 AM hueyl16@aol.com wrote:
>> The file is called "jena_assembler.ttl" on my machine. I had to rename it to
> .txt so the mailing list attachment filter wouldn't remove it. I am getting the
> error that I attached earlier when executing this from the command line:
>>
>> java -cp %FUSEKI_HOME%\fuseki-server.jar jena.textindexer --desc=C:\Development\Ontology\jena_assembler.ttl
>
> I can read jena_assembler.ttl without problems (using jena.rdfcat).
> Are you sure that's the right file?
>
> Chris
>


Re: Jena Text Lucene Assembler file questions

Posted by hu...@aol.com.
I just compared the file I sent with the local one that I am using and they are identical. I also just ran rdfcat and it reads the file just fine. Tried jena.textindexer again and that still fails.

 

 -Wolfgang



 

 

-----Original Message-----
From: Chris Dollin <ch...@epimorphics.com>
To: users <us...@jena.apache.org>
Sent: Wed, Oct 23, 2013 3:32 pm
Subject: Re: Re: Jena Text Lucene Assembler file questions


On Wednesday, October 23, 2013 07:53:26 AM hueyl16@aol.com wrote:
> The file is called "jena_assembler.ttl" on my machine. I had to rename it to 
.txt so the mailing list attachment filter wouldn't remove it. I am getting the 
error that I attached earlier when executing this from the command line:
> 
> java -cp %FUSEKI_HOME%\fuseki-server.jar jena.textindexer --desc=C:\Development\Ontology\jena_assembler.ttl

I can read jena_assembler.ttl without problems (using jena.rdfcat).
Are you sure that's the right file?

Chris

-- 
"It is seldom good news."      ~Crystal Ball~, /The Tough Guide to Fantasyland/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


 

Re: Re: Jena Text Lucene Assembler file questions

Posted by Chris Dollin <ch...@epimorphics.com>.
On Wednesday, October 23, 2013 07:53:26 AM hueyl16@aol.com wrote:
> The file is called "jena_assembler.ttl" on my machine. I had to rename it to .txt so the mailing list attachment filter wouldn't remove it. I am getting the error that I attached earlier when executing this from the command line:
> 
> java -cp %FUSEKI_HOME%\fuseki-server.jar jena.textindexer --desc=C:\Development\Ontology\jena_assembler.ttl

I can read jena_assembler.ttl without problems (using jena.rdfcat).
Are you sure that's the right file?

Chris

-- 
"It is seldom good news."      ~Crystal Ball~, /The Tough Guide to Fantasyland/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


Re: Jena Text Lucene Assembler file questions

Posted by hu...@aol.com.
The file is called "jena_assembler.ttl" on my machine. I had to rename it to .txt so the mailing list attachment filter wouldn't remove it. I am getting the error that I attached earlier when executing this from the command line:

java -cp %FUSEKI_HOME%\fuseki-server.jar jena.textindexer --desc=C:\Development\Ontology\jena_assembler.ttl

-Wolfgang

 

 



 

 

-----Original Message-----
From: Chris Dollin <ch...@epimorphics.com>
To: users <us...@jena.apache.org>
Sent: Wed, Oct 23, 2013 1:33 pm
Subject: Re: Re: Jena Text Lucene Assembler file questions


On Tuesday, October 22, 2013 05:57:21 PM hueyl16@aol.com wrote:
> Seems it does not like the file extension. Renaming it now. Sorry for the 
amount of mails.

When Jena reads a file it uses the file suffix to guess what language
the file is in. If it doesn't know the suffix it defaults to RDF/XML. So
if you have Turtle text in, say, myturtle.txt and try to read it in without
telling Jena that it's Turtle, the RDF/XML parser will object. (I got
"content is not allowed in prolog"; that's the RDF/XML parser assuming
the non-XML text its seeing is Just Stuff that would be skipped over,
but that is Not Allowed.)

Chris

-- 
"The wizard seemed quite willing when I talked to him."  /Howl's Moving Castle/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


 

Re: Re: Jena Text Lucene Assembler file questions

Posted by Chris Dollin <ch...@epimorphics.com>.
On Tuesday, October 22, 2013 05:57:21 PM hueyl16@aol.com wrote:
> Seems it does not like the file extension. Renaming it now. Sorry for the amount of mails.

When Jena reads a file it uses the file suffix to guess what language
the file is in. If it doesn't know the suffix it defaults to RDF/XML. So
if you have Turtle text in, say, myturtle.txt and try to read it in without
telling Jena that it's Turtle, the RDF/XML parser will object. (I got
"content is not allowed in prolog"; that's the RDF/XML parser assuming
the non-XML text its seeing is Just Stuff that would be skipped over,
but that is Not Allowed.)

Chris

-- 
"The wizard seemed quite willing when I talked to him."  /Howl's Moving Castle/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


Re: Jena Text Lucene Assembler file questions

Posted by hu...@aol.com.
Seems it does not like the file extension. Renaming it now. Sorry for the amount of mails.
 

 



 

 

-----Original Message-----
From: hueyl16 <hu...@aol.com>
To: users <us...@jena.apache.org>
Sent: Tue, Oct 22, 2013 11:54 pm
Subject: Re: Jena Text Lucene Assembler file questions


I sent it with two attachments, maybe one got removed ...

So here is the assembler file.

 

 



 

 

-----Original Message-----
From: Chris Dollin <ch...@epimorphics.com>
To: users <us...@jena.apache.org>
Sent: Tue, Oct 22, 2013 9:28 pm
Subject: Re: Jena Text Lucene Assembler file questions


On Tuesday, October 22, 2013 04:09:05 AM hueyl16@aol.com wrote:

> The error I am getting seems to be related to bad syntax.
> However, I cannot spot it. 

Neither can we, without seeing the source. 

It's at the beginning of line 3, though. Perhaps you have a "<<" rather 
than a "<"?

Chris


-- 
"The wizard seemed quite willing when I talked to him."  /Howl's Moving Castle/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


 


Re: Jena Text Lucene Assembler file questions

Posted by hu...@aol.com.
I sent it with two attachments, maybe one got removed ...

So here is the assembler file.

 

 



 

 

-----Original Message-----
From: Chris Dollin <ch...@epimorphics.com>
To: users <us...@jena.apache.org>
Sent: Tue, Oct 22, 2013 9:28 pm
Subject: Re: Jena Text Lucene Assembler file questions


On Tuesday, October 22, 2013 04:09:05 AM hueyl16@aol.com wrote:

> The error I am getting seems to be related to bad syntax.
> However, I cannot spot it. 

Neither can we, without seeing the source. 

It's at the beginning of line 3, though. Perhaps you have a "<<" rather 
than a "<"?

Chris


-- 
"The wizard seemed quite willing when I talked to him."  /Howl's Moving Castle/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


 

Re: Jena Text Lucene Assembler file questions

Posted by Chris Dollin <ch...@epimorphics.com>.
On Tuesday, October 22, 2013 04:09:05 AM hueyl16@aol.com wrote:

> The error I am getting seems to be related to bad syntax.
> However, I cannot spot it. 

Neither can we, without seeing the source. 

It's at the beginning of line 3, though. Perhaps you have a "<<" rather 
than a "<"?

Chris


-- 
"The wizard seemed quite willing when I talked to him."  /Howl's Moving Castle/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)