You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Bogdan Ghidireac <bo...@ecstend.com> on 2009/11/20 13:21:56 UTC

IndexWriter.updateDocument performance improvement

Hi,

One of the use case of my application involves updating the index with
10 to 10k docs every few minutes. Because we maintain a PK for each
doc we have to use IndexWriter.updateDocument to be consistent.

The average time for an update when we commit every 10k docs is around
17ms (the IndexWriter buffer is 100MB). I profiled the application for
several hours and I noticed that most of the time is spent in
IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
BufferedDeletes.terms from HashMap to TreeMap to have the terms
ordered and to reduce the number of random seeks on the disk.

I run my tests again with the patched Lucene 2.9.1 and the time has
dropped from 17ms to 2ms. The index has 18GB and 70 million docs.

I cannot send a patch because my company has some strict and time
consuming policies about open source but the change is small and can
be applied easily.

Regards,
Bogdan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.updateDocument performance improvement

Posted by Michael McCandless <lu...@mikemccandless.com>.
Opened LUCENE-2086.

Mike

On Fri, Nov 20, 2009 at 9:43 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> +1
>
> I'll open an issue.
>
> Mike
>
> On Fri, Nov 20, 2009 at 8:11 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> Thanks Bogdan, I've been meaning to bring this up.
>> Solr used a TreeMap in the past (when it handled it's own deletes) for
>> the same exact reason.  In my profiling, I've also seen applyDeletes()
>> taking the bulk of the time with small/simple document indexing.
>>
>> So we should definitely go in sorted order (either via TreeMap or sort
>> the HashMap).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bo...@ecstend.com> wrote:
>>> Hi,
>>>
>>> One of the use case of my application involves updating the index with
>>> 10 to 10k docs every few minutes. Because we maintain a PK for each
>>> doc we have to use IndexWriter.updateDocument to be consistent.
>>>
>>> The average time for an update when we commit every 10k docs is around
>>> 17ms (the IndexWriter buffer is 100MB). I profiled the application for
>>> several hours and I noticed that most of the time is spent in
>>> IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
>>> BufferedDeletes.terms from HashMap to TreeMap to have the terms
>>> ordered and to reduce the number of random seeks on the disk.
>>>
>>> I run my tests again with the patched Lucene 2.9.1 and the time has
>>> dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
>>>
>>> I cannot send a patch because my company has some strict and time
>>> consuming policies about open source but the change is small and can
>>> be applied easily.
>>>
>>> Regards,
>>> Bogdan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.updateDocument performance improvement

Posted by Michael McCandless <lu...@mikemccandless.com>.
+1

I'll open an issue.

Mike

On Fri, Nov 20, 2009 at 8:11 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> Thanks Bogdan, I've been meaning to bring this up.
> Solr used a TreeMap in the past (when it handled it's own deletes) for
> the same exact reason.  In my profiling, I've also seen applyDeletes()
> taking the bulk of the time with small/simple document indexing.
>
> So we should definitely go in sorted order (either via TreeMap or sort
> the HashMap).
>
> -Yonik
> http://www.lucidimagination.com
>
> On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bo...@ecstend.com> wrote:
>> Hi,
>>
>> One of the use case of my application involves updating the index with
>> 10 to 10k docs every few minutes. Because we maintain a PK for each
>> doc we have to use IndexWriter.updateDocument to be consistent.
>>
>> The average time for an update when we commit every 10k docs is around
>> 17ms (the IndexWriter buffer is 100MB). I profiled the application for
>> several hours and I noticed that most of the time is spent in
>> IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
>> BufferedDeletes.terms from HashMap to TreeMap to have the terms
>> ordered and to reduce the number of random seeks on the disk.
>>
>> I run my tests again with the patched Lucene 2.9.1 and the time has
>> dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
>>
>> I cannot send a patch because my company has some strict and time
>> consuming policies about open source but the change is small and can
>> be applied easily.
>>
>> Regards,
>> Bogdan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.updateDocument performance improvement

Posted by Yonik Seeley <yo...@lucidimagination.com>.
Thanks Bogdan, I've been meaning to bring this up.
Solr used a TreeMap in the past (when it handled it's own deletes) for
the same exact reason.  In my profiling, I've also seen applyDeletes()
taking the bulk of the time with small/simple document indexing.

So we should definitely go in sorted order (either via TreeMap or sort
the HashMap).

-Yonik
http://www.lucidimagination.com

On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bo...@ecstend.com> wrote:
> Hi,
>
> One of the use case of my application involves updating the index with
> 10 to 10k docs every few minutes. Because we maintain a PK for each
> doc we have to use IndexWriter.updateDocument to be consistent.
>
> The average time for an update when we commit every 10k docs is around
> 17ms (the IndexWriter buffer is 100MB). I profiled the application for
> several hours and I noticed that most of the time is spent in
> IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
> BufferedDeletes.terms from HashMap to TreeMap to have the terms
> ordered and to reduce the number of random seeks on the disk.
>
> I run my tests again with the patched Lucene 2.9.1 and the time has
> dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
>
> I cannot send a patch because my company has some strict and time
> consuming policies about open source but the change is small and can
> be applied easily.
>
> Regards,
> Bogdan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org