You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2005/05/16 08:15:08 UTC

[Performance]: IndexWriter again...

Ok, I'm just following up on my email from 29th April titled  
'[Performanc]'  (don't you love it when you send before you've typed  
your subject line completely).  The thread is here:

http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200504.mbox/% 
3C427198C5.5040408@aconex.com%3E

In summary, I still firmly believe that the  
IndexWriter.maybeMergeSegments() is chewing a lot more CPU than would  
be ideal.  So I ran a simple test.  I ran the same test I've done  
before, using mergeFactor(1000) maxBufferedDocs(10000), useCompondFile 
(false), indexing 5 fields (user first/lastname/email address)

As a baseline using the latest SVN source code, I'm getting an  
indexing rate of between 490-515 items/second of a number of runs.

By applying the attached simple patch to IndexWriter, I'm getting  
between 945-970 of a number of test runs.  That's a significant speed  
up.  All the patch is doing is deferring the call to  
maybeMergeSegments so it only does it every 2000 iterations (2000 is  
totally arbitrary on my part).

I've verified with Luke that the index generated contains the same #  
documents, and same # terms, but I have not had a chance to properly  
setup my local environment to run the test cases.

Obviously the attached patch is a dirty hack of the highest order. In  
my case I'm re-indexing from scratch every time, so there may be a  
reason why we shouldn't be doing this sort of deferring of method  
calls.  Perhaps the source code is optimized around incremental/batch  
updates to _existing_ indexes, but creating a new index, but with a  
penalty of creating a new index performs slower than one would like.

Perhaps IndexWriter could benefit from another setting that lets one  
configure how often to call maybeMergeSegments()?  That could of  
course confuse more people than it helps.

I would really appreciate anyones thoughts on this, I'll be very  
happy to be proven wrong because it will just help me understand more  
of Lucene.  I would hope that speeding up indexing would benefit  
everyone?  Particularly the large scale sites out there.

cheers,

Paul Smith




Re: [Performance]: IndexWriter again...

Posted by Yonik Seeley <ys...@gmail.com>.
I like the idea Paul.

As far as how it should be implemented, perhaps a count of docs in
memory should be kept.  It doesn't seem necessary to traverse all of
the segments on every add (it's a linear operation, and will only
result in a merge every "minMergeDocs" or "maxBufferedDocs").

-Yonik

On 5/16/05, Paul Smith <ps...@aconex.com> wrote:
> In summary, I still firmly believe that the IndexWriter.maybeMergeSegments()
> is chewing a lot more CPU than would be ideal.  So I ran a simple test.  I
> ran the same test I've done before, using mergeFactor(1000)
> maxBufferedDocs(10000), useCompondFile(false), indexing 5 fields (user
> first/lastname/email address)
> 
> As a baseline using the latest SVN source code, I'm getting an indexing rate
> of between 490-515 items/second of a number of runs.
> 
> By applying the attached simple patch to IndexWriter, I'm getting between
> 945-970 of a number of test runs.  That's a significant speed up.  All the
> patch is doing is deferring the call to maybeMergeSegments so it only does
> it every 2000 iterations (2000 is totally arbitrary on my part).
> 
> I've verified with Luke that the index generated contains the same #
> documents, and same # terms, but I have not had a chance to properly setup
> my local environment to run the test cases.  
> 
> Obviously the attached patch is a dirty hack of the highest order. In my
> case I'm re-indexing from scratch every time, so there may be a reason why
> we shouldn't be doing this sort of deferring of method calls.  Perhaps the
> source code is optimized around incremental/batch updates to _existing_
> indexes, but creating a new index, but with a penalty of creating a new
> index performs slower than one would like.
> 
> Perhaps IndexWriter could benefit from another setting that lets one
> configure how often to call maybeMergeSegments()?  That could of course
> confuse more people than it helps.
> 
> I would really appreciate anyones thoughts on this, I'll be very happy to be
> proven wrong because it will just help me understand more of Lucene.  I
> would hope that speeding up indexing would benefit everyone?  Particularly
> the large scale sites out there.
> 
> cheers,
> 
> Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance]: IndexWriter again...

Posted by Paul Smith <ps...@aconex.com>.
On 16/05/2005, at 5:00 PM, Paul Elschot wrote:

> On Monday 16 May 2005 08:24, Paul Smith wrote:
>
>> something very odd is going on with my attachments...  sorry for the
>> spam.
>>
>>
> It's usually easier open a bug in bugzilla and post the code and
> the concerns there. The only disadvantage of bugzilla is that
> you can only add attachment after the bug is opened for the first  
> time:
> http://issues.apache.org/bugzilla/enter_bug.cgi
>

Thanks Paul, I'm not sure why subsequent attempts are still stripping  
the attachment, I'll go ahead and file something in bugzilla, and  
cross my fingers I don't look any sillier than I do now.

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance]: IndexWriter again...

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 16 May 2005 08:24, Paul Smith wrote:
> something very odd is going on with my attachments...  sorry for the  
> spam.
> 
It's usually easier open a bug in bugzilla and post the code and
the concerns there. The only disadvantage of bugzilla is that
you can only add attachment after the bug is opened for the first time:
http://issues.apache.org/bugzilla/enter_bug.cgi

Regards,
Paul Elschot



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance]: IndexWriter again...

Posted by Paul Smith <ps...@aconex.com>.
something very odd is going on with my attachments...  sorry for the  
spam.


Re: [Performance]: IndexWriter again...

Posted by Paul Smith <ps...@aconex.com>.
I'm not even going to say anything this time.... :-$

On 16/05/2005, at 4:17 PM, Paul Smith wrote:

> Silly me, here's the patch with the extra code NOT commented out...
>
> Oh my, how embarrassing... :)
>
>
>
> Paul
>
> On 16/05/2005, at 4:15 PM, Paul Smith wrote:
>
>
>> <IndexWriter.patch>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



Re: [Performance]: IndexWriter again...

Posted by Paul Smith <ps...@aconex.com>.
Silly me, here's the patch with the extra code NOT commented out...

Oh my, how embarrassing... :)