You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Joe Paulsen <jo...@verizon.net> on 2004/03/31 01:24:06 UTC

Near performance question

Based on the nature of our documents, we sometimes 
experience extremely long response times when executing
NEAR operations against a document (sometimes well over 
minutes - even though the operation is restricted
to a single document).

Our analysis of the code indicates (we think):

It looks up each of the terms in the word.dbx file. 

It intersects the occurrence lists. (So far so good!) 

It takes each gid found in the occurrence list and: 
finds its parent right up until the root of the document (in dom.dbx).
 
Traverses the tree depth-first until it finds the node text of interest. 

Does the expected scan to find out 
if the term distance requirement is satisfied. 

We did some timings on our document (Rusticus). 
It started off taking < 1 second per occ and grew to 25 seconds. 

If we changed the dom.dbx buffers, we got significant 
improvement, but still relatively slow (343 occs). 

QUESTION:
Seems to us the occs are ordered by gid 
(and we don't do any updating).  Is there 
a simple way to make use of the positioning 
information of the tree levels for the prior 
occurrence on the current occurrence so that 
we don't have to start again from the 
document root? 

Thanks,

Joe Paulsen



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org