You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Gary Moore <ga...@littlebunch.com> on 2008/08/09 00:28:16 UTC

2.3.2 Indexing Performance

Parsing and indexing 4.5 million MARC/XML bibliographic records was 
requiring ~14 hrs. using 2.2.  The same job using 2.3 takes ~ 5 hrs. on 
the same platform --  a quad processor Sun V440 w/8GB memory.   I'm 
using the PerFieldAnalyzerWrapper (StandardAnalyzer and SnowballAnalyzer).

I'm impressed!  Is this typical?

Gary Moore
gary@littlebunch.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 2.3.2 Indexing Performance

Posted by Michael McCandless <lu...@mikemccandless.com>.

Awesome!  Thanks for following up.

Mike

Gary Moore wrote:

> Finally got back to this.  The great bulk of the time is spent  
> parsing/tokenizing.  So, using 10 threads parsing/analyzing the 4.5M  
> docs and feeding them to an IndexWriter took 106 minutes including a  
> final optimization.   The index is 5.6 GB.   I'm tempted to try  
> multiple indexing threads but my guess is it won't buy that much  
> since the async writer more than kept up with the thread queue.
>
> Now, I'm even more impressed with 2.3!
> -Gary
> Michael McCandless wrote:
>>
>> Thanks for the data point!
>>
>> This is expected -- alot of work went into increasing IndexWriter's  
>> throughput in 2.3.
>>
>> Actually, I'd expect even more speedup, if indeed Lucene is the  
>> bottleneck in your app.  You could test how much time just creating/ 
>> parsing & tokenizing the docs (from whatever is holding them)  
>> takes, to see.  Also you might eke more performance out following  
>> the suggestions here:
>>
>>    http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>>
>> Since you've got 4 CPUs and lots of RAM you should definitely use  
>> multiple indexing threads with a large RAM buffer.
>>
>> Mike
>>
>> Gary Moore wrote:
>>
>>> Parsing and indexing 4.5 million MARC/XML bibliographic records  
>>> was requiring ~14 hrs. using 2.2.  The same job using 2.3 takes ~  
>>> 5 hrs. on the same platform --  a quad processor Sun V440 w/8GB  
>>> memory.   I'm using the PerFieldAnalyzerWrapper (StandardAnalyzer  
>>> and SnowballAnalyzer).
>>>
>>> I'm impressed!  Is this typical?
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 2.3.2 Indexing Performance

Posted by Gary Moore <ga...@littlebunch.com>.

Finally got back to this.  The great bulk of the time is spent 
parsing/tokenizing.  So, using 10 threads parsing/analyzing the 4.5M 
docs and feeding them to an IndexWriter took 106 minutes including a 
final optimization.   The index is 5.6 GB.   I'm tempted to try multiple 
indexing threads but my guess is it won't buy that much since the async 
writer more than kept up with the thread queue.

Now, I'm even more impressed with 2.3!
-Gary
Michael McCandless wrote:
>
> Thanks for the data point!
>
> This is expected -- alot of work went into increasing IndexWriter's 
> throughput in 2.3.
>
> Actually, I'd expect even more speedup, if indeed Lucene is the 
> bottleneck in your app.  You could test how much time just 
> creating/parsing & tokenizing the docs (from whatever is holding them) 
> takes, to see.  Also you might eke more performance out following the 
> suggestions here:
>
>     http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>
> Since you've got 4 CPUs and lots of RAM you should definitely use 
> multiple indexing threads with a large RAM buffer.
>
> Mike
>
> Gary Moore wrote:
>
>> Parsing and indexing 4.5 million MARC/XML bibliographic records was 
>> requiring ~14 hrs. using 2.2.  The same job using 2.3 takes ~ 5 hrs. 
>> on the same platform --  a quad processor Sun V440 w/8GB memory.   
>> I'm using the PerFieldAnalyzerWrapper (StandardAnalyzer and 
>> SnowballAnalyzer).
>>
>> I'm impressed!  Is this typical?
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 2.3.2 Indexing Performance

Posted by Michael McCandless <lu...@mikemccandless.com>.

Thanks for the data point!

This is expected -- alot of work went into increasing IndexWriter's  
throughput in 2.3.

Actually, I'd expect even more speedup, if indeed Lucene is the  
bottleneck in your app.  You could test how much time just creating/ 
parsing & tokenizing the docs (from whatever is holding them) takes,  
to see.  Also you might eke more performance out following the  
suggestions here:

     http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

Since you've got 4 CPUs and lots of RAM you should definitely use  
multiple indexing threads with a large RAM buffer.

Mike

Gary Moore wrote:

> Parsing and indexing 4.5 million MARC/XML bibliographic records was  
> requiring ~14 hrs. using 2.2.  The same job using 2.3 takes ~ 5 hrs.  
> on the same platform --  a quad processor Sun V440 w/8GB memory.    
> I'm using the PerFieldAnalyzerWrapper (StandardAnalyzer and  
> SnowballAnalyzer).
>
> I'm impressed!  Is this typical?
>
> Gary Moore
> gary@littlebunch.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org