You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by João Rodrigues <an...@gmail.com> on 2008/02/24 16:12:30 UTC
Lucene - Search Optimization Problem
Hello all!
I've finally got round to setup Lucene 2.3.0 in my two production boxes
(Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC compilation
methods. Now, I have my application all up and running and.... it's damn
slow :( I'm running PyLucene by the way, and I've asked on that list
already, being directed here.
I have a 6.6GB index, with more than 5.000.000 biomedical abstracts indexed.
Each document has two fields: an integer, which I will want to retrieve upon
search (the ID of the document, sort of), and an 80 words, stored,
tokenized, string, which will be searched upon. So, I insert the query (say,
foo bar), it builds previously sort of a "boolean query" with a format such
as: 'foo' AND 'bar'. Then it parses it and spits out the results.
Problem is, unlike most of the posts I've read, I don't want the first 10 or
100 results. I want the first 10.000, or even all of them. I've read an
HitCollector is due for this task, but my first search on google got me an
expressive "HitCollector is too slow on PyLucene", so, I kind of sorted out
that option. It takes minutes to get me the results I need, as it is right
now. I'll post the code on pastebin and link it for those who feel in a good
mood to read n00b's code and help (see below). I've tracked down the problem
to the "doc.get("PMID")" method in the Searcher function.
My question is: how can I make my search faster? My index wasn't optimized
because it was huge and it was built with GCC. By now, it is probably
optimized (I left an optimizer running last night) so, that is taken care
of. I've considered threading as well, since I'll perform three different
searches per "round". Thing is, I'm pretty green when it comes to
programming (I'm a biologist) and I've never understood pretty much how
threading works. If someone can point me to the right tutorial or
documentation, I'd be glad enough to hack it up myself. Another option I've
been given was to use an implementation of Lucene written in either C# or
C++. However, Lucene.net <http://lucene.net/> isn't up to date, and neither
is CLucene..
So, if you think you can give out a tip on how to make my script run faster,
I'd thank you more than a lot. It's a shame that my project fails because of
this technical handicap :(
LINKS:
http://pastebin.com/m6c384ede -> Main Code
http://pastebin.com/m3484ebfc --> Searcher Functions
Best regards to you all,
João Rodrigues
Re: Lucene - Search Optimization Problem
Posted by Paul Elschot <pa...@xs4all.nl>.
Op Sunday 24 February 2008 17:57:42 schreef Daniel Naber:
> On Sonntag, 24. Februar 2008, João Rodrigues wrote:
> > Problem is, unlike most of the posts I've read, I don't want the
> > first 10 or 100 results. I want the first 10.000, or even all of
> > them.
>
> See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, you
> will probably need to make use of the FieldCache.
Further to that, I got the impression that you're doing
document retrieval, IndexReader.document(...), in the
HitCollector (see also the javadocs of HitCollector).
If that is the case, well, that is a definite nono,
use a TopDocs instead.
In case you really need a HitCollector: first collect _all_ the
doc numbers.
Normally, for best performance, retrieve the stored fields
in increasing order of doc numbers, even when using TopDocs.
Regards,
Paul Elschot
Re: Lucene - Search Optimization Problem
Posted by Daniel Naber <lu...@danielnaber.de>.
On Sonntag, 24. Februar 2008, João Rodrigues wrote:
> Problem is, unlike most of the posts I've read, I don't want the first
> 10 or 100 results. I want the first 10.000, or even all of them.
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, you will
probably need to make use of the FieldCache.
Regards
Daniel
--
http://www.danielnaber.de
Re: Lucene - Search Optimization Problem
Posted by João Rodrigues <an...@gmail.com>.
This is to all who gave me tips on my search optimization problems: Thank
you :)
I managed to, with just a simple optimization, not storing a field I didn't
need, and a change in the cycle (not using topdocs), reduce to a tenth (!!!)
the time of search. Here are some very preliminary results, made on two
indexes:
A) Old configuration (except this one was optimized), 10.000 documents
indexed:
Initializing VM machine
Setting variables
Searching for: gene OR dna OR protein OR patient OR disease OR metabolism OR
pathway OR chromossome OR human OR mouse OR biochemical OR lipid OR glucose
OR diabetes OR science OR health
4397
0:00:09.094805
----- Right afterwards (with supposedly the results in cache) ------------
Initializing VM machine
Setting variables
Searching for: gene OR dna OR protein OR patient OR disease OR metabolism OR
pathway OR chromossome OR human OR mouse OR biochemical OR lipid OR glucose
OR diabetes OR science OR health
4397
0:00:00.563862
B) New Configuration, with the same documents indexed:
Initializing VM machine
Setting variables
Searching for: gene OR dna OR protein OR patient OR disease OR metabolism OR
pathway OR chromossome OR human OR mouse OR biochemical OR lipid OR glucose
OR diabetes OR science OR health
4397
0:00:00.352372
---- Right after the first one (with supposedly cache results) -----
Initializing VM machine
Setting variables
Searching for: gene OR dna OR protein OR patient OR disease OR metabolism OR
pathway OR chromossome OR human OR mouse OR biochemical OR lipid OR glucose
OR diabetes OR science OR health
4397
0:00:00.351742
-------------------------------------
Now I have a question. Shouldn't the second search, with the cached results,
be significantly faster? If not, why is it on the config A and not on B?
Again, thanks quite a lot for all the comments. I'll still try the other
things you've said but at least for now, this'll shut my colleagues that
advised me to drop Lucene ;)
Cheers everyone!
Re: Lucene - Search Optimization Problem
Posted by Wolfgang Täger <wt...@epo.org>.
Hi João,
if you need 10.000 or more hits, this might require 10.000 or more disk
accesses.
Given the access time of disks, there is probably no way to get
significantly faster
using Lucene on the same hardware.
Either you can organise your data so that it is more local on hard disk
(what you probably can't), or
you need to use memory with lower access time than hard disks, say more
RAM for caching,
SSD or other flash drives.
You may try a cheap 8GB USB stick with low access time.
Another possibility is to use a suitable OS with at least 8GB of RAM.
If you do so, please share your results.
Best regards,
Wolfgang Täger
"João Rodrigues" <an...@gmail.com>
24-02-2008 16:19
Please respond to
general@lucene.apache.org
To
general@lucene.apache.org
cc
Subject
Lucene - Search Optimization Problem
Hello all!
I've finally got round to setup Lucene 2.3.0 in my two production boxes
(Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC
compilation
methods. Now, I have my application all up and running and.... it's damn
slow :( I'm running PyLucene by the way, and I've asked on that list
already, being directed here.
I have a 6.6GB index, with more than 5.000.000 biomedical abstracts
indexed.
Each document has two fields: an integer, which I will want to retrieve
upon
search (the ID of the document, sort of), and an 80 words, stored,
tokenized, string, which will be searched upon. So, I insert the query
(say,
foo bar), it builds previously sort of a "boolean query" with a format
such
as: 'foo' AND 'bar'. Then it parses it and spits out the results.
Problem is, unlike most of the posts I've read, I don't want the first 10
or
100 results. I want the first 10.000, or even all of them. I've read an
HitCollector is due for this task, but my first search on google got me an
expressive "HitCollector is too slow on PyLucene", so, I kind of sorted
out
that option. It takes minutes to get me the results I need, as it is right
now. I'll post the code on pastebin and link it for those who feel in a
good
mood to read n00b's code and help (see below). I've tracked down the
problem
to the "doc.get("PMID")" method in the Searcher function.
My question is: how can I make my search faster? My index wasn't optimized
because it was huge and it was built with GCC. By now, it is probably
optimized (I left an optimizer running last night) so, that is taken care
of. I've considered threading as well, since I'll perform three different
searches per "round". Thing is, I'm pretty green when it comes to
programming (I'm a biologist) and I've never understood pretty much how
threading works. If someone can point me to the right tutorial or
documentation, I'd be glad enough to hack it up myself. Another option
I've
been given was to use an implementation of Lucene written in either C# or
C++. However, Lucene.net <http://lucene.net/> isn't up to date, and
neither
is CLucene..
So, if you think you can give out a tip on how to make my script run
faster,
I'd thank you more than a lot. It's a shame that my project fails because
of
this technical handicap :(
LINKS:
http://pastebin.com/m6c384ede -> Main Code
http://pastebin.com/m3484ebfc --> Searcher Functions
Best regards to you all,
João Rodrigues