You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kevin Burton <bu...@rojo.com> on 2005/06/07 03:41:05 UTC

Fastest way to fetch N documents with unique keys within large numbers of indexes..

Hey.

I'm trying to figure out the FASTEST way to solve this problem.

We have a system where I'll be given 10 or 20 unique keys.  Which are 
stored as non-tokenized fields within Lucene.  Each key represents a 
unique document.

Internally I'm creating a new Term and then calling 
IndexReader.termDocs() on this term.  Then if termdocs.next() matches 
then I'll return this document.

The problem is that this doesn't work very fast either.  This is not an 
academic debate as I've put the system in a profiler and Lucene is the 
top bottleneck (by far).

I don't think there's anything faster than this right?  Could I maybe 
cache a TermEnum and keep it as a pointer to the FIRST field for these 
IDs and then reuse that?  This might allow me to search faster to the 
start of my terms?

Does Lucene internally do a binary search for my term?

I could of course do an index merge of all this content but thats a 
separate problem.  We have a lot of indexes and often have more than 40 
and constantly merging these into a multigig index just takes FOREVER.

It seems that internally IO is the problem. I'm about as fast on IO as I 
can get as I'm on a SCSI RAID array at RAID0 on FAST scsi disks...  I 
also tried tweaking InputStream.BUFFER_SIZE with no visible change in 
performance.

Kevin

-- 


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 07 June 2005 09:22, Paul Elschot wrote:
...
> 
> With the indexes on multiple discs, some parallellism can be introduced.
> A thread per disk could be used.
> In case there are multiple requests pending, they can be serialized just 
> before the sorting of the terms, and just before the sorting of the document
> numbers.

That should be 'merged' instead of 'serialized'.
 
> Regards,
> Paul Elschot
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 08 June 2005 01:18, Kevin Burton wrote:
> Paul Elschot wrote:
> 
> >For a large number of indexes, it may be necessary to do this over
> >multiple indexes by first getting the doc numbers for all indexes,
> >then sorting these per index, then retrieving them
> >from all indexes, and repeating the whole thing using terms determined
> >from the retrieved docs.
> >  
> >
> Well this was a BIG win.  Just benchmarking it out shows a 10x -> 50x 
> performance increase.
> 
> Times in milliseconds:
> 
> Before:
> 
> duration: 1127
> duration: 449
> duration: 394
> duration: 564
> 
> After:
> 
> duration: 182
> duration: 39
> duration: 12
> duration: 11

There is no need for a relational db when you have Lucene :)
Thanks for reporting the old and the new times.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Kevin Burton <bu...@rojo.com>.

Paul Elschot wrote:

>For a large number of indexes, it may be necessary to do this over
>multiple indexes by first getting the doc numbers for all indexes,
>then sorting these per index, then retrieving them
>from all indexes, and repeating the whole thing using terms determined
>from the retrieved docs.
>  
>
Well this was a BIG win.  Just benchmarking it out shows a 10x -> 50x 
performance increase.

Times in milliseconds:

Before:

duration: 1127
duration: 449
duration: 394
duration: 564

After:

duration: 182
duration: 39
duration: 12
duration: 11

The values of 2-4  I'm sure are due to the filesystem buffer cache but I 
can't imagine why they'd be faster in the second round.  It might be 
that Linux is deciding not to buffer the document blocks.

Kevin

-- 

Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 07 June 2005 07:17, Kevin Burton wrote:
> Matt Quail wrote:
> 
> >> We have a system where I'll be given 10 or 20 unique keys.
> >
> >
> > I assume you mean you have one unique-key field, and you are given  
> > 10-20 values to find for this one field?
> >
> >>
> >> Internally I'm creating a new Term and then calling  
> >> IndexReader.termDocs() on this term.  Then if termdocs.next()  
> >> matches then I'll return this document.
> >
> >
> > Are you calling reader.termDocs() inside a (tight) loop? It might be  
> > better to create one TermEnum, and use "seek". Something like this:
> 
> Yes.. this is another approach I was thinking of taking.  I was thinking 
> of building up a list of indexes which had a high probability of holding 
> the given document and then searching for each of them.
> 
> What I'm worried about though is that it would be a bit slower...  I'm 
> just going to have to test out different implementations to see....
> 
> <snip>
> 
> >
> >
> > I'm pretty sure that will work. And if you can avoid the multi- 
> > threading issues, you might try and use the same TermDocs object for  
> > as long as possible (that is, move it up out of as many tight loops  
> > as you can).
> 
> Well... that doesn't look like the biggest overhead.  The bottleneck 
> seens to be in seek() and the fact that its using an InputStream to read 
> bytes off disk.  I actually tried to speed that up by crainking 
> InputSteam.BUFFER_SIZE var higher but that didn't work either though I'm 
> not sure if its a caching issue.  I sent an email to the list about this 
> earlier but no one responded.
> 
> So it seems like my bottleneck is in seek() so It would make sense to 
> figure out how to limit this.
> 
> Is this O(log(N))  btw or is it O(N) ?

Seeking terms should be O(log(N)), and the fastest way to do that
is by sorting the terms first, but that does not really help much.

The biggest increase in performance that I had for a similar problem
was to first collect all the document numbers, then sort them, and
then fetch all the corresponding docs. See also the file formats.
This avoids the disk head going between the terms and the stored docs,
and it also minimizes the head movement for retrieving the docs.

For a single term, the document numbers are already ordered, for
multiple terms the sort is needed.

For a large number of indexes, it may be necessary to do this over
multiple indexes by first getting the doc numbers for all indexes,
then sorting these per index, then retrieving them
from all indexes, and repeating the whole thing using terms determined
from the retrieved docs.

With the indexes on multiple discs, some parallellism can be introduced.
A thread per disk could be used.
In case there are multiple requests pending, they can be serialized just 
before the sorting of the terms, and just before the sorting of the document
numbers.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Kevin Burton <bu...@rojo.com>.

Matt Quail wrote:

>> We have a system where I'll be given 10 or 20 unique keys.
>
>
> I assume you mean you have one unique-key field, and you are given  
> 10-20 values to find for this one field?
>
>>
>> Internally I'm creating a new Term and then calling  
>> IndexReader.termDocs() on this term.  Then if termdocs.next()  
>> matches then I'll return this document.
>
>
> Are you calling reader.termDocs() inside a (tight) loop? It might be  
> better to create one TermEnum, and use "seek". Something like this:

Yes.. this is another approach I was thinking of taking.  I was thinking 
of building up a list of indexes which had a high probability of holding 
the given document and then searching for each of them.

What I'm worried about though is that it would be a bit slower...  I'm 
just going to have to test out different implementations to see....

<snip>

>
>
> I'm pretty sure that will work. And if you can avoid the multi- 
> threading issues, you might try and use the same TermDocs object for  
> as long as possible (that is, move it up out of as many tight loops  
> as you can).

Well... that doesn't look like the biggest overhead.  The bottleneck 
seens to be in seek() and the fact that its using an InputStream to read 
bytes off disk.  I actually tried to speed that up by crainking 
InputSteam.BUFFER_SIZE var higher but that didn't work either though I'm 
not sure if its a caching issue.  I sent an email to the list about this 
earlier but no one responded.

So it seems like my bottleneck is in seek() so It would make sense to 
figure out how to limit this.

Is this O(log(N))  btw or is it O(N) ?

Kevin

-- 

Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Matt Quail <ma...@cenqua.com>.

> We have a system where I'll be given 10 or 20 unique keys.

I assume you mean you have one unique-key field, and you are given  
10-20 values to find for this one field?
>
> Internally I'm creating a new Term and then calling  
> IndexReader.termDocs() on this term.  Then if termdocs.next()  
> matches then I'll return this document.

Are you calling reader.termDocs() inside a (tight) loop? It might be  
better to create one TermEnum, and use "seek". Something like this:

reader = ...;
td = reader.termDocs();

String[] keys = ....; // your unique keys;
sort(keys); // this probably helps seek() below

for (key in keys) {
     Term t = new Term("unique", key);
     td.seek(t);
     if (td.next()) {
         // found a match
     }
}

I'm pretty sure that will work. And if you can avoid the multi- 
threading issues, you might try and use the same TermDocs object for  
as long as possible (that is, move it up out of as many tight loops  
as you can).

=Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Kevin Burton <bu...@rojo.com>.

Chris Hostetter wrote:

>: was computing the score.  This was a big performance gain.  About 2x and
>: since its the slowest part of our app it was a nice one. :)
>:
>: We were using a TermQuery though.
>
>I believe that one search on one BooleanQuery containing 20
>TermQueries should be faster then 20 searches on 20 TermQueries.
>  
>
Actually.. it wasn't... :-/

It was about 4x slower.

Ug...

Kevin

-- 


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Chris Hostetter <ho...@fucit.org>.

: was computing the score.  This was a big performance gain.  About 2x and
: since its the slowest part of our app it was a nice one. :)
:
: We were using a TermQuery though.

I believe that one search on one BooleanQuery containing 20
TermQueries should be faster then 20 searches on 20 TermQueries.

: >2) have you tried sorting your terms first, then opening a TermDocs on the
: >   first one, and seeking to each of the remaining terms?  it seems like
: >   that would be faster then opening a new TermDocs for each Term.
: >
: >
: How do I do this?  I just assumed that termDocs was already sorted...
:
: I don't see any mention of this in the API...

I'm not talking about any special API to sort a TermDocs -- it is sorted.

I'm saying you should start by sorting your input (the 10-20 unique IDs
you mentioned) and then seek to the first ID, then using the same
TermDocs instance, seek to the second ID, etc....


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Kevin Burton <bu...@rojo.com>.

Chris Hostetter wrote:

>I haven't profiled either of thse suggestions but:
>
>1) have you tried constructing a BooleanQuery of all 10-20 terms? Is the
>   total time to execute the search, and access each Hit slower then your
>   termDocs approach?
>  
>
Actually using any type of query was very slow.  The problem was when it 
was computing the score.  This was a big performance gain.  About 2x and 
since its the slowest part of our app it was a nice one. :)

We were using a TermQuery though. 

I wonder if there's a way to tell lucene not to score.  Maybe I could 
then use a BooleanQuery with internal TermQueries and then scan the 
indexes once each. 

>2) have you tried sorting your terms first, then opening a TermDocs on the
>   first one, and seeking to each of the remaining terms?  it seems like
>   that would be faster then opening a new TermDocs for each Term.
>  
>
How do I do this?  I just assumed that termDocs was already sorted...

I don't see any mention of this in the API...

Kevin

-- 


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Posted by Chris Hostetter <ho...@fucit.org>.

I haven't profiled either of thse suggestions but:

1) have you tried constructing a BooleanQuery of all 10-20 terms? Is the
   total time to execute the search, and access each Hit slower then your
   termDocs approach?

2) have you tried sorting your terms first, then opening a TermDocs on the
   first one, and seeking to each of the remaining terms?  it seems like
   that would be faster then opening a new TermDocs for each Term.

: The problem is that this doesn't work very fast either.  This is not an
: academic debate as I've put the system in a profiler and Lucene is the
: top bottleneck (by far).

The tricky thing about profiling code: something is allways the bottleneck.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org