You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2005/05/16 08:15:08 UTC
[Performance]: IndexWriter again...
Ok, I'm just following up on my email from 29th April titled
'[Performanc]' (don't you love it when you send before you've typed
your subject line completely). The thread is here:
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200504.mbox/%
3C427198C5.5040408@aconex.com%3E
In summary, I still firmly believe that the
IndexWriter.maybeMergeSegments() is chewing a lot more CPU than would
be ideal. So I ran a simple test. I ran the same test I've done
before, using mergeFactor(1000) maxBufferedDocs(10000), useCompondFile
(false), indexing 5 fields (user first/lastname/email address)
As a baseline using the latest SVN source code, I'm getting an
indexing rate of between 490-515 items/second of a number of runs.
By applying the attached simple patch to IndexWriter, I'm getting
between 945-970 of a number of test runs. That's a significant speed
up. All the patch is doing is deferring the call to
maybeMergeSegments so it only does it every 2000 iterations (2000 is
totally arbitrary on my part).
I've verified with Luke that the index generated contains the same #
documents, and same # terms, but I have not had a chance to properly
setup my local environment to run the test cases.
Obviously the attached patch is a dirty hack of the highest order. In
my case I'm re-indexing from scratch every time, so there may be a
reason why we shouldn't be doing this sort of deferring of method
calls. Perhaps the source code is optimized around incremental/batch
updates to _existing_ indexes, but creating a new index, but with a
penalty of creating a new index performs slower than one would like.
Perhaps IndexWriter could benefit from another setting that lets one
configure how often to call maybeMergeSegments()? That could of
course confuse more people than it helps.
I would really appreciate anyones thoughts on this, I'll be very
happy to be proven wrong because it will just help me understand more
of Lucene. I would hope that speeding up indexing would benefit
everyone? Particularly the large scale sites out there.
cheers,
Paul Smith

Re: [Performance]: IndexWriter again...
Posted by Yonik Seeley <ys...@gmail.com>.
I like the idea Paul.
As far as how it should be implemented, perhaps a count of docs in
memory should be kept. It doesn't seem necessary to traverse all of
the segments on every add (it's a linear operation, and will only
result in a merge every "minMergeDocs" or "maxBufferedDocs").
-Yonik
On 5/16/05, Paul Smith <ps...@aconex.com> wrote:
> In summary, I still firmly believe that the IndexWriter.maybeMergeSegments()
> is chewing a lot more CPU than would be ideal. So I ran a simple test. I
> ran the same test I've done before, using mergeFactor(1000)
> maxBufferedDocs(10000), useCompondFile(false), indexing 5 fields (user
> first/lastname/email address)
>
> As a baseline using the latest SVN source code, I'm getting an indexing rate
> of between 490-515 items/second of a number of runs.
>
> By applying the attached simple patch to IndexWriter, I'm getting between
> 945-970 of a number of test runs. That's a significant speed up. All the
> patch is doing is deferring the call to maybeMergeSegments so it only does
> it every 2000 iterations (2000 is totally arbitrary on my part).
>
> I've verified with Luke that the index generated contains the same #
> documents, and same # terms, but I have not had a chance to properly setup
> my local environment to run the test cases.
>
> Obviously the attached patch is a dirty hack of the highest order. In my
> case I'm re-indexing from scratch every time, so there may be a reason why
> we shouldn't be doing this sort of deferring of method calls. Perhaps the
> source code is optimized around incremental/batch updates to _existing_
> indexes, but creating a new index, but with a penalty of creating a new
> index performs slower than one would like.
>
> Perhaps IndexWriter could benefit from another setting that lets one
> configure how often to call maybeMergeSegments()? That could of course
> confuse more people than it helps.
>
> I would really appreciate anyones thoughts on this, I'll be very happy to be
> proven wrong because it will just help me understand more of Lucene. I
> would hope that speeding up indexing would benefit everyone? Particularly
> the large scale sites out there.
>
> cheers,
>
> Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance]: IndexWriter again...
Posted by Paul Smith <ps...@aconex.com>.
On 16/05/2005, at 5:00 PM, Paul Elschot wrote:
> On Monday 16 May 2005 08:24, Paul Smith wrote:
>
>> something very odd is going on with my attachments... sorry for the
>> spam.
>>
>>
> It's usually easier open a bug in bugzilla and post the code and
> the concerns there. The only disadvantage of bugzilla is that
> you can only add attachment after the bug is opened for the first
> time:
> http://issues.apache.org/bugzilla/enter_bug.cgi
>
Thanks Paul, I'm not sure why subsequent attempts are still stripping
the attachment, I'll go ahead and file something in bugzilla, and
cross my fingers I don't look any sillier than I do now.
cheers,
Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance]: IndexWriter again...
Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 16 May 2005 08:24, Paul Smith wrote:
> something very odd is going on with my attachments... sorry for the
> spam.
>
It's usually easier open a bug in bugzilla and post the code and
the concerns there. The only disadvantage of bugzilla is that
you can only add attachment after the bug is opened for the first time:
http://issues.apache.org/bugzilla/enter_bug.cgi
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance]: IndexWriter again...
Posted by Paul Smith <ps...@aconex.com>.
something very odd is going on with my attachments... sorry for the
spam.
Re: [Performance]: IndexWriter again...
Posted by Paul Smith <ps...@aconex.com>.
I'm not even going to say anything this time.... :-$
On 16/05/2005, at 4:17 PM, Paul Smith wrote:
> Silly me, here's the patch with the extra code NOT commented out...
>
> Oh my, how embarrassing... :)
>
>
>
> Paul
>
> On 16/05/2005, at 4:15 PM, Paul Smith wrote:
>
>
>> <IndexWriter.patch>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance]: IndexWriter again...
Posted by Paul Smith <ps...@aconex.com>.
Silly me, here's the patch with the extra code NOT commented out...
Oh my, how embarrassing... :)