You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Kevin A. Burton" <bu...@newsmonster.org> on 2004/07/07 07:44:40 UTC

Most efficient way to index 14M documents (out of memory/file handles)

I'm trying to burn an index of 14M documents.

I have two problems.

1.  I have to run optimize() every 50k documents or I run out of file 
handles.  this takes TIME and of course is linear to the size of the 
index so it just gets slower by the time I complete.  It starts to crawl 
at about 3M documents.

2.  I eventually will run out of memory in this configuration.

I KNOW this has been covered before but for the life of me I can't find 
it in the archives, the FAQ or the wiki. 

I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
every 50k documents.

Does it make sense to just create a new IndexWriter for every 50k docs 
and then do one big optimize() at the end?

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Most efficient way to index 14M documents (out of memory/file handles)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.

Doug Cutting wrote:

> Julien,
>
> Thanks for the excellent explanation.
>
> I think this thread points to a documentation problem. We should 
> improve the javadoc for these parameters to make it easier for folks to
>
> In particular, the javadoc for mergeFactor should mention that very 
> large values (>100) are not recommended, since they can run into file 
> handle limitations with FSDirectory. The maximum number of open files 
> while merging is around mergeFactor * (5 + number of indexed fields). 
> Perhaps mergeFactor should be tagged an "Expert" parameter to 
> discourage folks playing with it, as it is such a common source of 
> problems.
>
> The javadoc should instead encourage using minMergeDocs to increase 
> indexing speed by using more memory. This parameter is unfortunately 
> poorly named. It should really be called something like maxBufferedDocs.

I'd like to see something like this done...

BTW.. I'm willing to add it to the wiki in the interim.

This conversation has happened a few times now...

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Most efficient way to index 14M documents (out of memory/file handles)

Posted by Doug Cutting <cu...@apache.org>.

Julien,

Thanks for the excellent explanation.

I think this thread points to a documentation problem.  We should 
improve the javadoc for these parameters to make it easier for folks to

In particular, the javadoc for mergeFactor should mention that very 
large values (>100) are not recommended, since they can run into file 
handle limitations with FSDirectory.  The maximum number of open files 
while merging is around mergeFactor * (5 + number of indexed fields). 
Perhaps mergeFactor should be tagged an "Expert" parameter to discourage 
folks playing with it, as it is such a common source of problems.

The javadoc should instead encourage using minMergeDocs to increase 
indexing speed by using more memory.  This parameter is unfortunately 
poorly named.  It should really be called something like maxBufferedDocs.

Doug

Julien Nioche wrote:
> It is not surprising that you run out of file handles with such a large
> mergeFactor.
> 
> Before trying more complex strategies involving RAMDirectories and/or
> splitting your indexation on several machines, I reckon you should try
> simple things like using a low mergeFactor (eg: 10) combined with a higher
> minMergeDocs (ex: 1000) and optimize only at the end of the process.
> 
> By setting a higher value to minMergeDocs, you'll index and merge with a
> RAMDirectory. When the limit is reached (ex 1000) a segment is written in
> the FS. MergeFactor controls the number of segments to be merged, so when
> you have 10 segments on the FS (which is already 10x1000 docs), the
> IndexWriter will merge them all into a single segment. This is equivalent to
> an optimize I think. The process continues like that until it's finished.
> 
> Combining theses parameters should be enough to achieve good performance.
> The good point of using minMergeDocs is that you make a heavy use of the
> RAMDirectory used by your IndexWriter (== fast) without having to be too
> careful with the RAM (which would be the case with RamDirectory). At the
> same time keeping your mergeFactor low limits the risks of too many handles
> problem.
> 
> 
> ----- Original Message ----- 
> From: "Kevin A. Burton" <bu...@newsmonster.org>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, July 07, 2004 7:44 AM
> Subject: Most efficient way to index 14M documents (out of memory/file
> handles)
> 
> 
> 
>>I'm trying to burn an index of 14M documents.
>>
>>I have two problems.
>>
>>1.  I have to run optimize() every 50k documents or I run out of file
>>handles.  this takes TIME and of course is linear to the size of the
>>index so it just gets slower by the time I complete.  It starts to crawl
>>at about 3M documents.
>>
>>2.  I eventually will run out of memory in this configuration.
>>
>>I KNOW this has been covered before but for the life of me I can't find
>>it in the archives, the FAQ or the wiki.
>>
>>I'm using an IndexWriter with a mergeFactor of 5k and then optimizing
>>every 50k documents.
>>
>>Does it make sense to just create a new IndexWriter for every 50k docs
>>and then do one big optimize() at the end?
>>
>>Kevin
>>
>>-- 
>>
>>Please reply using PGP.
>>
>>    http://peerfear.org/pubkey.asc
>>
>>    NewsMonster - http://www.newsmonster.org/
>>
>>Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>>       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
>>GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Most efficient way to index 14M documents (out of memory/file handles)

Posted by Julien Nioche <Ju...@lingway.com>.

It is not surprising that you run out of file handles with such a large
mergeFactor.

Before trying more complex strategies involving RAMDirectories and/or
splitting your indexation on several machines, I reckon you should try
simple things like using a low mergeFactor (eg: 10) combined with a higher
minMergeDocs (ex: 1000) and optimize only at the end of the process.

By setting a higher value to minMergeDocs, you'll index and merge with a
RAMDirectory. When the limit is reached (ex 1000) a segment is written in
the FS. MergeFactor controls the number of segments to be merged, so when
you have 10 segments on the FS (which is already 10x1000 docs), the
IndexWriter will merge them all into a single segment. This is equivalent to
an optimize I think. The process continues like that until it's finished.

Combining theses parameters should be enough to achieve good performance.
The good point of using minMergeDocs is that you make a heavy use of the
RAMDirectory used by your IndexWriter (== fast) without having to be too
careful with the RAM (which would be the case with RamDirectory). At the
same time keeping your mergeFactor low limits the risks of too many handles
problem.


----- Original Message ----- 
From: "Kevin A. Burton" <bu...@newsmonster.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, July 07, 2004 7:44 AM
Subject: Most efficient way to index 14M documents (out of memory/file
handles)


> I'm trying to burn an index of 14M documents.
>
> I have two problems.
>
> 1.  I have to run optimize() every 50k documents or I run out of file
> handles.  this takes TIME and of course is linear to the size of the
> index so it just gets slower by the time I complete.  It starts to crawl
> at about 3M documents.
>
> 2.  I eventually will run out of memory in this configuration.
>
> I KNOW this has been covered before but for the life of me I can't find
> it in the archives, the FAQ or the wiki.
>
> I'm using an IndexWriter with a mergeFactor of 5k and then optimizing
> every 50k documents.
>
> Does it make sense to just create a new IndexWriter for every 50k docs
> and then do one big optimize() at the end?
>
> Kevin
>
> -- 
>
> Please reply using PGP.
>
>     http://peerfear.org/pubkey.asc
>
>     NewsMonster - http://www.newsmonster.org/
>
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>        AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Most efficient way to index 14M documents (out of memory/file handles)

Posted by Harald Kirsch <ki...@ebi.ac.uk>.

On Tue, Jul 06, 2004 at 10:44:40PM -0700, Kevin A. Burton wrote:
> I'm trying to burn an index of 14M documents.
> 
> I have two problems.
> 
> 1.  I have to run optimize() every 50k documents or I run out of file 
> handles.  this takes TIME and of course is linear to the size of the 
> index so it just gets slower by the time I complete.  It starts to crawl 
> at about 3M documents.

Recently I indexed roughly this many documents. I separated the whole
thing first into 100 jobs (we happen to have that many machines in the
cluster.-) each indexing its share into its own index. I used
mergeFactor=100 and only optimized just before closing the index.

Then I merged them all into one index simply by

  writer.mergeFactor = 150; 
  writer.addIndexes(dirs);

I was surprised myself that it went through easily within under two
hours for each of the 101 indexes. The documents have, however, only
three fields.

  Maybe this helps,
  Harald.

-- 
------------------------------------------------------------------------
Harald Kirsch | kirsch@ebi.ac.uk | +44 (0) 1223/49-2593

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Anyone use MultiSearcher class

Posted by Mark Florence <mf...@airsmail.com>.

'Fraid not! Just a humble user :)

-- Mark

-----Original Message-----
From: Don Vaillancourt [mailto:donv@webimpact.com]
Sent: Monday, July 26, 2004 12:14 pm
To: Lucene Users List
Subject: RE: Anyone use MultiSearcher class


Eh Mark,

Are you involved with Lucene development?


At 11:39 AM 26/07/2004, you wrote:
>Don, at the low level, the issue isn't necessarily caching results
>from page-to-page (as viewed by some UI.) Such a cache would need to
>be co-ordinated with index writes.
>
>Rather, I plan to focus on the way Hits first reads 100 hits, then 200,
>then 400 and so on -- but all Hits knows about is the MultiSearcher.
>This means that in order to find the 101st hit, Hits effectively asks
>ALL the searchers in the MultiSearcher to search again -- even though it
>could be known that SOME of those searchers are incapable of returning
>results.
>
>-- Mark Florence
>
>-----Original Message-----
>From: Don Vaillancourt [mailto:donv@webimpact.com]
>Sent: Monday, July 26, 2004 11:06 am
>To: Lucene Users List; Lucene Users List
>Subject: RE: Anyone use MultiSearcher class
>
>
>Thanks for the info.
>
>Maybe the best solution to this may be to perform multiple individual
>searches, create a container class and store all the hits sorted by
>relevance within that class and then cache/serialize this result for the
>current search for page by page manipulation.
>
>
>At 09:46 AM 15/07/2004, Mark Florence wrote:
> >Don, I think I finally understand your problem -- and mine -- with
> >MultiSearcher. I had tested an implementation of my system using
> >ParallelMultiSearcher to split a huge index over many computers.
> >I was very impressed by the results on my test data, but alarmed
> >after a trial with live data :)
> >
> >Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
> >over ALL the Searchables in the MultiSearcher would return 1000
> >documents. But, the Hits object created by search() will only cache
> >the first 100 documents. When Hits.doc(101) is called, Hits will
> >cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
> >get these extra documents? By calling the MultiSearcher again.
> >
> >Now consider a MultiSearcher as described above with 2 Searchables.
> >With respect to Q, Searchable S has 1000 documents, Searchable T
> >has zero. So to fetch the 101st document, not only is S searched,
> >but T is too, even though the result of Q applied to T is still zero
> >and will always be zero. The same thing will happen when fetching
> >the 201st, 401st and 801st document.
> >
> >This accounts for my slow performance, and I think yours too. That
> >your observed degradation is a power of 2 is a clue.
> >
> >My performance is especially vulnerable because "slave" Searchables
> >in the MultiSearcher are Remote -- accessed via RMI.
> >
> >I guess I have to code smarter around MultiSearcher. One problem
> >you highlight is that Hits is final -- so it is not possible even to
> >modify the "100/200/400" cache size logic.
> >
> >Any ideas from anyone would be much appreciated.
> >
> >Mark Florence
> >CTO, AIRS
> >800-897-7714 x 1703
> >mflorence@airsmail.com
> >
> >
> >
> >
> >-----Original Message-----
> >From: Don Vaillancourt [mailto:donv@webimpact.com]
> >Sent: Monday, July 12, 2004 12:36 pm
> >To: Lucene Users List
> >Subject: Anyone use MultiSearcher class
> >
> >
> >Hello,
> >
> >Has anyone used the Multisearcher class?
> >
> >I have noticed that searching two indexes using this MultiSearcher class
> >takes 8 times longer than searching only one index.  I could understand
if
> >it took 3 to 4 times longer to search due to sorting the two search
results
> >and stuff, but why 8 times longer.
> >
> >Is there some optimization that can be done to hasten the search?  Or
> >should I just write my own MultiSearcher.  The problem though is that
there
> >is no way for me to create my own Hits object (no methods are available
and
> >the class is final).
> >
> >Anyone have any clue?
> >
> >Thanks
> >
> >
> >Don Vaillancourt
> >Director of Software Development
> >
> >WEB IMPACT INC.
> >416-815-2000 ext. 245
> >email: donv@web-impact.com
> >web: http://www.web-impact.com
> >
> >
> >
> >
> >This email message is intended only for the addressee(s)
> >and contains information that may be confidential and/or
> >copyright.  If you are not the intended recipient please
> >notify the sender by reply email and immediately delete
> >this email. Use, disclosure or reproduction of this email
> >by anyone other than the intended recipient(s) is strictly
> >prohibited. No representation is made that this email or
> >any attachments are free of viruses. Virus scanning is
> >recommended and is the responsibility of the recipient.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>Don Vaillancourt
>Director of Software Development
>
>WEB IMPACT INC.
>416-815-2000 ext. 245
>email: donv@web-impact.com
>web: http://www.web-impact.com
>
>
>
>
>This email message is intended only for the addressee(s)
>and contains information that may be confidential and/or
>copyright.  If you are not the intended recipient please
>notify the sender by reply email and immediately delete
>this email. Use, disclosure or reproduction of this email
>by anyone other than the intended recipient(s) is strictly
>prohibited. No representation is made that this email or
>any attachments are free of viruses. Virus scanning is
>recommended and is the responsibility of the recipient.
>
>
>
>
>
>
>
>
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: donv@web-impact.com
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.













---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Anyone use MultiSearcher class

Posted by Don Vaillancourt <do...@webimpact.com>.

Eh Mark,

Are you involved with Lucene development?


At 11:39 AM 26/07/2004, you wrote:
>Don, at the low level, the issue isn't necessarily caching results
>from page-to-page (as viewed by some UI.) Such a cache would need to
>be co-ordinated with index writes.
>
>Rather, I plan to focus on the way Hits first reads 100 hits, then 200,
>then 400 and so on -- but all Hits knows about is the MultiSearcher.
>This means that in order to find the 101st hit, Hits effectively asks
>ALL the searchers in the MultiSearcher to search again -- even though it
>could be known that SOME of those searchers are incapable of returning
>results.
>
>-- Mark Florence
>
>-----Original Message-----
>From: Don Vaillancourt [mailto:donv@webimpact.com]
>Sent: Monday, July 26, 2004 11:06 am
>To: Lucene Users List; Lucene Users List
>Subject: RE: Anyone use MultiSearcher class
>
>
>Thanks for the info.
>
>Maybe the best solution to this may be to perform multiple individual
>searches, create a container class and store all the hits sorted by
>relevance within that class and then cache/serialize this result for the
>current search for page by page manipulation.
>
>
>At 09:46 AM 15/07/2004, Mark Florence wrote:
> >Don, I think I finally understand your problem -- and mine -- with
> >MultiSearcher. I had tested an implementation of my system using
> >ParallelMultiSearcher to split a huge index over many computers.
> >I was very impressed by the results on my test data, but alarmed
> >after a trial with live data :)
> >
> >Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
> >over ALL the Searchables in the MultiSearcher would return 1000
> >documents. But, the Hits object created by search() will only cache
> >the first 100 documents. When Hits.doc(101) is called, Hits will
> >cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
> >get these extra documents? By calling the MultiSearcher again.
> >
> >Now consider a MultiSearcher as described above with 2 Searchables.
> >With respect to Q, Searchable S has 1000 documents, Searchable T
> >has zero. So to fetch the 101st document, not only is S searched,
> >but T is too, even though the result of Q applied to T is still zero
> >and will always be zero. The same thing will happen when fetching
> >the 201st, 401st and 801st document.
> >
> >This accounts for my slow performance, and I think yours too. That
> >your observed degradation is a power of 2 is a clue.
> >
> >My performance is especially vulnerable because "slave" Searchables
> >in the MultiSearcher are Remote -- accessed via RMI.
> >
> >I guess I have to code smarter around MultiSearcher. One problem
> >you highlight is that Hits is final -- so it is not possible even to
> >modify the "100/200/400" cache size logic.
> >
> >Any ideas from anyone would be much appreciated.
> >
> >Mark Florence
> >CTO, AIRS
> >800-897-7714 x 1703
> >mflorence@airsmail.com
> >
> >
> >
> >
> >-----Original Message-----
> >From: Don Vaillancourt [mailto:donv@webimpact.com]
> >Sent: Monday, July 12, 2004 12:36 pm
> >To: Lucene Users List
> >Subject: Anyone use MultiSearcher class
> >
> >
> >Hello,
> >
> >Has anyone used the Multisearcher class?
> >
> >I have noticed that searching two indexes using this MultiSearcher class
> >takes 8 times longer than searching only one index.  I could understand if
> >it took 3 to 4 times longer to search due to sorting the two search results
> >and stuff, but why 8 times longer.
> >
> >Is there some optimization that can be done to hasten the search?  Or
> >should I just write my own MultiSearcher.  The problem though is that there
> >is no way for me to create my own Hits object (no methods are available and
> >the class is final).
> >
> >Anyone have any clue?
> >
> >Thanks
> >
> >
> >Don Vaillancourt
> >Director of Software Development
> >
> >WEB IMPACT INC.
> >416-815-2000 ext. 245
> >email: donv@web-impact.com
> >web: http://www.web-impact.com
> >
> >
> >
> >
> >This email message is intended only for the addressee(s)
> >and contains information that may be confidential and/or
> >copyright.  If you are not the intended recipient please
> >notify the sender by reply email and immediately delete
> >this email. Use, disclosure or reproduction of this email
> >by anyone other than the intended recipient(s) is strictly
> >prohibited. No representation is made that this email or
> >any attachments are free of viruses. Virus scanning is
> >recommended and is the responsibility of the recipient.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>Don Vaillancourt
>Director of Software Development
>
>WEB IMPACT INC.
>416-815-2000 ext. 245
>email: donv@web-impact.com
>web: http://www.web-impact.com
>
>
>
>
>This email message is intended only for the addressee(s)
>and contains information that may be confidential and/or
>copyright.  If you are not the intended recipient please
>notify the sender by reply email and immediately delete
>this email. Use, disclosure or reproduction of this email
>by anyone other than the intended recipient(s) is strictly
>prohibited. No representation is made that this email or
>any attachments are free of viruses. Virus scanning is
>recommended and is the responsibility of the recipient.
>
>
>
>
>
>
>
>
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: donv@web-impact.com
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.

RE: Anyone use MultiSearcher class

Posted by Mark Florence <mf...@airsmail.com>.

Don, at the low level, the issue isn't necessarily caching results
from page-to-page (as viewed by some UI.) Such a cache would need to
be co-ordinated with index writes.

Rather, I plan to focus on the way Hits first reads 100 hits, then 200,
then 400 and so on -- but all Hits knows about is the MultiSearcher.
This means that in order to find the 101st hit, Hits effectively asks
ALL the searchers in the MultiSearcher to search again -- even though it
could be known that SOME of those searchers are incapable of returning
results.

-- Mark Florence

-----Original Message-----
From: Don Vaillancourt [mailto:donv@webimpact.com]
Sent: Monday, July 26, 2004 11:06 am
To: Lucene Users List; Lucene Users List
Subject: RE: Anyone use MultiSearcher class


Thanks for the info.

Maybe the best solution to this may be to perform multiple individual
searches, create a container class and store all the hits sorted by
relevance within that class and then cache/serialize this result for the
current search for page by page manipulation.


At 09:46 AM 15/07/2004, Mark Florence wrote:
>Don, I think I finally understand your problem -- and mine -- with
>MultiSearcher. I had tested an implementation of my system using
>ParallelMultiSearcher to split a huge index over many computers.
>I was very impressed by the results on my test data, but alarmed
>after a trial with live data :)
>
>Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
>over ALL the Searchables in the MultiSearcher would return 1000
>documents. But, the Hits object created by search() will only cache
>the first 100 documents. When Hits.doc(101) is called, Hits will
>cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
>get these extra documents? By calling the MultiSearcher again.
>
>Now consider a MultiSearcher as described above with 2 Searchables.
>With respect to Q, Searchable S has 1000 documents, Searchable T
>has zero. So to fetch the 101st document, not only is S searched,
>but T is too, even though the result of Q applied to T is still zero
>and will always be zero. The same thing will happen when fetching
>the 201st, 401st and 801st document.
>
>This accounts for my slow performance, and I think yours too. That
>your observed degradation is a power of 2 is a clue.
>
>My performance is especially vulnerable because "slave" Searchables
>in the MultiSearcher are Remote -- accessed via RMI.
>
>I guess I have to code smarter around MultiSearcher. One problem
>you highlight is that Hits is final -- so it is not possible even to
>modify the "100/200/400" cache size logic.
>
>Any ideas from anyone would be much appreciated.
>
>Mark Florence
>CTO, AIRS
>800-897-7714 x 1703
>mflorence@airsmail.com
>
>
>
>
>-----Original Message-----
>From: Don Vaillancourt [mailto:donv@webimpact.com]
>Sent: Monday, July 12, 2004 12:36 pm
>To: Lucene Users List
>Subject: Anyone use MultiSearcher class
>
>
>Hello,
>
>Has anyone used the Multisearcher class?
>
>I have noticed that searching two indexes using this MultiSearcher class
>takes 8 times longer than searching only one index.  I could understand if
>it took 3 to 4 times longer to search due to sorting the two search results
>and stuff, but why 8 times longer.
>
>Is there some optimization that can be done to hasten the search?  Or
>should I just write my own MultiSearcher.  The problem though is that there
>is no way for me to create my own Hits object (no methods are available and
>the class is final).
>
>Anyone have any clue?
>
>Thanks
>
>
>Don Vaillancourt
>Director of Software Development
>
>WEB IMPACT INC.
>416-815-2000 ext. 245
>email: donv@web-impact.com
>web: http://www.web-impact.com
>
>
>
>
>This email message is intended only for the addressee(s)
>and contains information that may be confidential and/or
>copyright.  If you are not the intended recipient please
>notify the sender by reply email and immediately delete
>this email. Use, disclosure or reproduction of this email
>by anyone other than the intended recipient(s) is strictly
>prohibited. No representation is made that this email or
>any attachments are free of viruses. Virus scanning is
>recommended and is the responsibility of the recipient.
>
>
>
>
>
>
>
>
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: donv@web-impact.com
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.













---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Anyone use MultiSearcher class

Posted by Don Vaillancourt <do...@webimpact.com>.

Thanks for the info.

Maybe the best solution to this may be to perform multiple individual 
searches, create a container class and store all the hits sorted by 
relevance within that class and then cache/serialize this result for the 
current search for page by page manipulation.


At 09:46 AM 15/07/2004, Mark Florence wrote:
>Don, I think I finally understand your problem -- and mine -- with
>MultiSearcher. I had tested an implementation of my system using
>ParallelMultiSearcher to split a huge index over many computers.
>I was very impressed by the results on my test data, but alarmed
>after a trial with live data :)
>
>Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
>over ALL the Searchables in the MultiSearcher would return 1000
>documents. But, the Hits object created by search() will only cache
>the first 100 documents. When Hits.doc(101) is called, Hits will
>cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
>get these extra documents? By calling the MultiSearcher again.
>
>Now consider a MultiSearcher as described above with 2 Searchables.
>With respect to Q, Searchable S has 1000 documents, Searchable T
>has zero. So to fetch the 101st document, not only is S searched,
>but T is too, even though the result of Q applied to T is still zero
>and will always be zero. The same thing will happen when fetching
>the 201st, 401st and 801st document.
>
>This accounts for my slow performance, and I think yours too. That
>your observed degradation is a power of 2 is a clue.
>
>My performance is especially vulnerable because "slave" Searchables
>in the MultiSearcher are Remote -- accessed via RMI.
>
>I guess I have to code smarter around MultiSearcher. One problem
>you highlight is that Hits is final -- so it is not possible even to
>modify the "100/200/400" cache size logic.
>
>Any ideas from anyone would be much appreciated.
>
>Mark Florence
>CTO, AIRS
>800-897-7714 x 1703
>mflorence@airsmail.com
>
>
>
>
>-----Original Message-----
>From: Don Vaillancourt [mailto:donv@webimpact.com]
>Sent: Monday, July 12, 2004 12:36 pm
>To: Lucene Users List
>Subject: Anyone use MultiSearcher class
>
>
>Hello,
>
>Has anyone used the Multisearcher class?
>
>I have noticed that searching two indexes using this MultiSearcher class
>takes 8 times longer than searching only one index.  I could understand if
>it took 3 to 4 times longer to search due to sorting the two search results
>and stuff, but why 8 times longer.
>
>Is there some optimization that can be done to hasten the search?  Or
>should I just write my own MultiSearcher.  The problem though is that there
>is no way for me to create my own Hits object (no methods are available and
>the class is final).
>
>Anyone have any clue?
>
>Thanks
>
>
>Don Vaillancourt
>Director of Software Development
>
>WEB IMPACT INC.
>416-815-2000 ext. 245
>email: donv@web-impact.com
>web: http://www.web-impact.com
>
>
>
>
>This email message is intended only for the addressee(s)
>and contains information that may be confidential and/or
>copyright.  If you are not the intended recipient please
>notify the sender by reply email and immediately delete
>this email. Use, disclosure or reproduction of this email
>by anyone other than the intended recipient(s) is strictly
>prohibited. No representation is made that this email or
>any attachments are free of viruses. Virus scanning is
>recommended and is the responsibility of the recipient.
>
>
>
>
>
>
>
>
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: donv@web-impact.com
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.

Re: Anyone use MultiSearcher class

Posted by Zilverline info <in...@zilverline.org>.

Hi Don,

Yes, I'm using the MultiSearcher (in Zilverline), and have seen no 
serious performance issues with it. The app performs well with multiple 
indexes, it's responds so quick (with 100k+ documents) that I haven't 
even taken the time to measure the difference to a single index search.
Michael Franken

Don Vaillancourt wrote:

> Hello,
>
> Has anyone used the Multisearcher class?
>
> I have noticed that searching two indexes using this MultiSearcher 
> class takes 8 times longer than searching only one index.  I could 
> understand if it took 3 to 4 times longer to search due to sorting the 
> two search results and stuff, but why 8 times longer.
>
> Is there some optimization that can be done to hasten the search?  Or 
> should I just write my own MultiSearcher.  The problem though is that 
> there is no way for me to create my own Hits object (no methods are 
> available and the class is final).
>
> Anyone have any clue?
>
> Thanks
>
>
> Don Vaillancourt
> Director of Software Development
>
> WEB IMPACT INC.
> 416-815-2000 ext. 245
> email: donv@web-impact.com
> web: http://www.web-impact.com
>
>
>
>
> This email message is intended only for the addressee(s)
> and contains information that may be confidential and/or
> copyright.  If you are not the intended recipient please
> notify the sender by reply email and immediately delete
> this email. Use, disclosure or reproduction of this email
> by anyone other than the intended recipient(s) is strictly
> prohibited. No representation is made that this email or
> any attachments are free of viruses. Virus scanning is
> recommended and is the responsibility of the recipient.
>
>
>
>
>
>
>
>
>
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Anyone use MultiSearcher class

Posted by Mark Florence <mf...@airsmail.com>.

Don, I think I finally understand your problem -- and mine -- with
MultiSearcher. I had tested an implementation of my system using
ParallelMultiSearcher to split a huge index over many computers.
I was very impressed by the results on my test data, but alarmed
after a trial with live data :)

Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
over ALL the Searchables in the MultiSearcher would return 1000
documents. But, the Hits object created by search() will only cache
the first 100 documents. When Hits.doc(101) is called, Hits will
cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
get these extra documents? By calling the MultiSearcher again.

Now consider a MultiSearcher as described above with 2 Searchables.
With respect to Q, Searchable S has 1000 documents, Searchable T
has zero. So to fetch the 101st document, not only is S searched,
but T is too, even though the result of Q applied to T is still zero
and will always be zero. The same thing will happen when fetching
the 201st, 401st and 801st document.

This accounts for my slow performance, and I think yours too. That
your observed degradation is a power of 2 is a clue.

My performance is especially vulnerable because "slave" Searchables
in the MultiSearcher are Remote -- accessed via RMI.

I guess I have to code smarter around MultiSearcher. One problem
you highlight is that Hits is final -- so it is not possible even to
modify the "100/200/400" cache size logic.

Any ideas from anyone would be much appreciated.

Mark Florence
CTO, AIRS
800-897-7714 x 1703
mflorence@airsmail.com




-----Original Message-----
From: Don Vaillancourt [mailto:donv@webimpact.com]
Sent: Monday, July 12, 2004 12:36 pm
To: Lucene Users List
Subject: Anyone use MultiSearcher class


Hello,

Has anyone used the Multisearcher class?

I have noticed that searching two indexes using this MultiSearcher class
takes 8 times longer than searching only one index.  I could understand if
it took 3 to 4 times longer to search due to sorting the two search results
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or
should I just write my own MultiSearcher.  The problem though is that there
is no way for me to create my own Hits object (no methods are available and
the class is final).

Anyone have any clue?

Thanks


Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: donv@web-impact.com
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.













---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Anyone use MultiSearcher class

Posted by Don Vaillancourt <do...@webimpact.com>.

Hello,

Has anyone used the Multisearcher class?

I have noticed that searching two indexes using this MultiSearcher class 
takes 8 times longer than searching only one index.  I could understand if 
it took 3 to 4 times longer to search due to sorting the two search results 
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or 
should I just write my own MultiSearcher.  The problem though is that there 
is no way for me to create my own Hits object (no methods are available and 
the class is final).

Anyone have any clue?

Thanks


Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: donv@web-impact.com
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.

MultiSearcher is very slow

Posted by Don Vaillancourt <do...@webimpact.com>.

Hi all,

I've managed to add multi-index searching capability to my code.  But one 
thing that I have noticed is that Lucene is extremely slow in searching.

For example I have been testing with 2 indexes for the past month or so and 
searching them returns results in under 250ms and sometimes even surprising 
me at under 50ms.  All of this on my desktop computer.

If I search both indexes with the MultiSearcher class, it take over 2000ms 
to search.  I have compared the search results and they do match.  But why 
is it taking so much longer.

I'm assuming that each index is searched separately and then the Hits 
results from each search is merged and sorted.  But still, I don't think 
that this should take any longer than 750ms.

Is it just the way Lucene MultiSearcher class was engineered that makes it 
slow or is there something that I am doing wrong?

Thanks


Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: donv@web-impact.com
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.

Re: Most efficient way to index 14M documents (out of memory/file handles)

Posted by Nader Henein <ns...@bayt.net>.

Here's the thread you want :

http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgId=1722573

Nader Henein

Kevin A. Burton wrote:

> I'm trying to burn an index of 14M documents.
>
> I have two problems.
>
> 1.  I have to run optimize() every 50k documents or I run out of file 
> handles.  this takes TIME and of course is linear to the size of the 
> index so it just gets slower by the time I complete.  It starts to 
> crawl at about 3M documents.
>
> 2.  I eventually will run out of memory in this configuration.
>
> I KNOW this has been covered before but for the life of me I can't 
> find it in the archives, the FAQ or the wiki.
> I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
> every 50k documents.
>
> Does it make sense to just create a new IndexWriter for every 50k docs 
> and then do one big optimize() at the end?
>
> Kevin
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Most efficient way to index 14M documents (out of memory/file handles)

Posted by Doug Cutting <cu...@apache.org>.

A mergeFactor of 5000 is a bad idea.  If you want to index faster, try 
increasing minMergeDocs instead.  If you have lots of memory this can 
probably be 5000 or higher.

Also, why do you optimize before you're done?  That only slows things. 
Perhaps you have to do it because you've set mergeFactor to such an 
extreme value?  I do not recommend a merge factor higher than 100.

Doug


Kevin A. Burton wrote:
> I'm trying to burn an index of 14M documents.
> 
> I have two problems.
> 
> 1.  I have to run optimize() every 50k documents or I run out of file 
> handles.  this takes TIME and of course is linear to the size of the 
> index so it just gets slower by the time I complete.  It starts to crawl 
> at about 3M documents.
> 
> 2.  I eventually will run out of memory in this configuration.
> 
> I KNOW this has been covered before but for the life of me I can't find 
> it in the archives, the FAQ or the wiki.
> I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
> every 50k documents.
> 
> Does it make sense to just create a new IndexWriter for every 50k docs 
> and then do one big optimize() at the end?
> 
> Kevin
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org