You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ivan Brusic <iv...@brusic.com> on 2012/04/05 20:22:07 UTC

Slow merging after upgrading to 3.5

I recently migrated a legacy Lucene application from 2.3 to 3.5. The
code was filled with numerous custom
filter/analyzers/similarites/collectors. Took about a week to convert
all the token streams to the new API and removed deprecated classes.
Most importantly, there is a collector that enables faceting, which I
suspect might be taken from Solr (never looked into the Solr source
code).

The index is built as a batch process with no searchers using it. The
index contains 30+million documents for a total size around 45gb. The
bulk of the indexing time is during the database calls. The build time
using Lucene 2.3 was around 10 hours.

The code has a collector similar to TimeLimitingCollector (sadly,
there is a ton of custom built code) which collects documents until it
reaches a limit. The way the current index is created, it is essential
that the most important documents (based on business rules) exist at
the beginning of an index (insertion order) to ensure that the appear
even if the collector times out. The first issue we noticed is that
this distribution (which I admit is a hack) is no longer "correct"
using the default TieredMergePolicy. We switched back the log policy
to the existing setup of LogByteSizeMergePolicy with a merge factor of
2. I am assuming the low merge factor is responsible for creating
indices that respect the insertion order of documents. Documents are
now in the correct order, but a optimize (aka forceMerge(1)) takes
around 5 hours were previously there was no slowdown. If we remove the
forceMerge, the commit time takes just as long.

It is difficult to iterate through different settings since waiting
14-15 hours between tests to see the results is too long. What is the
best way to create an optimized index that places documents based on
insertion order at the beginning? The answer should be to write better
queries, but none of the authors of this legacy jumbled code base are
around and we want to avoid rocking the boat on the query side since
the existing search results are satisfactory.

Cheers,

Ivan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Slow merging after upgrading to 3.5

Posted by Michael McCandless <lu...@mikemccandless.com>.
Super, thanks for bringing closure!

Mike McCandless

http://blog.mikemccandless.com

On Wed, Apr 18, 2012 at 5:33 PM, Ivan Brusic <iv...@brusic.com> wrote:
> Just wanted to circle back and report on our progress.
>
> We finally applied the settings to our production environment and the
> improvements have been dramatic. Our indexing time has returned to 2.3
> levels.
>
> Thanks again,
>
> Ivan
>
> On Fri, Apr 6, 2012 at 11:36 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic <iv...@brusic.com> wrote:
>>
>>> On Thu, Apr 5, 2012 at 11:36 AM, Michael McCandless
>>> <lu...@mikemccandless.com> wrote:
>>>> I'm assuming this is a "build once and never change" index...?  Else,
>>>> it sounds like you should never run forceMerge...
>>>
>>> Correct. The forceMerge was merely to preserve the previous 2.3
>>> behavior of using optimize.
>>
>> OK.  Avoid it, unless you can't...
>>
>>>> To preserve insertion order you just need to use one of the
>>>> Log*MergePolicy (which you are already doing).  Merge factor doesn't
>>>> affect this...
>>>
>>> I was never sure why the merge factor was set to 2. My experiences in
>>> the past was to set a high merge factor when doing a batch index.
>>
>> Well, it's not entirely clear... you'd have to test in your env to be sure.
>>
>> My instinct is to use a large (maybe infinite) MF while indexing, and
>> then big MF while forceMerge'ing.
>>
>>>> For the fastest way to get to a single-segment index.... use
>>>> NoMergePolicy while indexing the documents, and set the largest RAM
>>>> buffer you can afford.  This will create tons of segments in the index
>>>> dir, which is fine as long as you will not open a reader on it...
>>>> then:
>>>>
>>>> Open a new IW, with Log*MergePolicy, set a highish (maybe 30)
>>>> mergeFactor, and call forceMerge(1).  You may need to cutover to
>>>> SerialMergeScheduler...
>>>
>>> NoMergePolicy? Never seen that class used before.
>>
>> It's like Log*MP with infinite mergeFactor...
>>
>>> RAM buffer size is
>>> not an issue. Is the limitation still 2048MB?
>>
>> Yes.
>>
>>> Is the fastest way also the best way? :) There will never be a read
>>> open on the index. Your second solution is similar to the existing
>>> code with the exception of the mergeFactor. Will setting the merge
>>> factor to a more reasonable number help with the merge speed?
>>
>> I think you'd have to test in your env.
>>
>> A non-infinite MF is good in that it gets some merges out of the way
>> before the end, ie, you can soak up some otherwise unused IO
>> resources/concurrency while you are indexing... making it less
>> work/time to forceMerge in the end.
>>
>>> What enforces the preservation of the insertion order? The
>>> MergePolicy?
>>
>> MergePolicy does.
>>
>> Though, in 4.0, it's also important you use only 1 thread for
>> indexing.   Prior to 4.0, docIDs were assigned in arrival order,
>> across threads, but with 4.0, each thread gets a private segment, so
>> the docIDs are jumbled.
>>
>>> How does the MergeScheduler affect things?
>>
>> It shouldn't affect docID order.
>>
>>> Used Lucene
>>> on a few projects over the years and I never had to tweak the index
>>> creation.
>>
>> The defaults normally work well... but docID assignment is an impl
>> detail and is free to change across releases...
>>
>>> I guess I need to reread the tuning chapter in LIA, it's
>>> been a few years.
>>
>> ;)
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Slow merging after upgrading to 3.5

Posted by Ivan Brusic <iv...@brusic.com>.
Just wanted to circle back and report on our progress.

We finally applied the settings to our production environment and the
improvements have been dramatic. Our indexing time has returned to 2.3
levels.

Thanks again,

Ivan

On Fri, Apr 6, 2012 at 11:36 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic <iv...@brusic.com> wrote:
>
>> On Thu, Apr 5, 2012 at 11:36 AM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>> I'm assuming this is a "build once and never change" index...?  Else,
>>> it sounds like you should never run forceMerge...
>>
>> Correct. The forceMerge was merely to preserve the previous 2.3
>> behavior of using optimize.
>
> OK.  Avoid it, unless you can't...
>
>>> To preserve insertion order you just need to use one of the
>>> Log*MergePolicy (which you are already doing).  Merge factor doesn't
>>> affect this...
>>
>> I was never sure why the merge factor was set to 2. My experiences in
>> the past was to set a high merge factor when doing a batch index.
>
> Well, it's not entirely clear... you'd have to test in your env to be sure.
>
> My instinct is to use a large (maybe infinite) MF while indexing, and
> then big MF while forceMerge'ing.
>
>>> For the fastest way to get to a single-segment index.... use
>>> NoMergePolicy while indexing the documents, and set the largest RAM
>>> buffer you can afford.  This will create tons of segments in the index
>>> dir, which is fine as long as you will not open a reader on it...
>>> then:
>>>
>>> Open a new IW, with Log*MergePolicy, set a highish (maybe 30)
>>> mergeFactor, and call forceMerge(1).  You may need to cutover to
>>> SerialMergeScheduler...
>>
>> NoMergePolicy? Never seen that class used before.
>
> It's like Log*MP with infinite mergeFactor...
>
>> RAM buffer size is
>> not an issue. Is the limitation still 2048MB?
>
> Yes.
>
>> Is the fastest way also the best way? :) There will never be a read
>> open on the index. Your second solution is similar to the existing
>> code with the exception of the mergeFactor. Will setting the merge
>> factor to a more reasonable number help with the merge speed?
>
> I think you'd have to test in your env.
>
> A non-infinite MF is good in that it gets some merges out of the way
> before the end, ie, you can soak up some otherwise unused IO
> resources/concurrency while you are indexing... making it less
> work/time to forceMerge in the end.
>
>> What enforces the preservation of the insertion order? The
>> MergePolicy?
>
> MergePolicy does.
>
> Though, in 4.0, it's also important you use only 1 thread for
> indexing.   Prior to 4.0, docIDs were assigned in arrival order,
> across threads, but with 4.0, each thread gets a private segment, so
> the docIDs are jumbled.
>
>> How does the MergeScheduler affect things?
>
> It shouldn't affect docID order.
>
>> Used Lucene
>> on a few projects over the years and I never had to tweak the index
>> creation.
>
> The defaults normally work well... but docID assignment is an impl
> detail and is free to change across releases...
>
>> I guess I need to reread the tuning chapter in LIA, it's
>> been a few years.
>
> ;)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Slow merging after upgrading to 3.5

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic <iv...@brusic.com> wrote:

> On Thu, Apr 5, 2012 at 11:36 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> I'm assuming this is a "build once and never change" index...?  Else,
>> it sounds like you should never run forceMerge...
>
> Correct. The forceMerge was merely to preserve the previous 2.3
> behavior of using optimize.

OK.  Avoid it, unless you can't...

>> To preserve insertion order you just need to use one of the
>> Log*MergePolicy (which you are already doing).  Merge factor doesn't
>> affect this...
>
> I was never sure why the merge factor was set to 2. My experiences in
> the past was to set a high merge factor when doing a batch index.

Well, it's not entirely clear... you'd have to test in your env to be sure.

My instinct is to use a large (maybe infinite) MF while indexing, and
then big MF while forceMerge'ing.

>> For the fastest way to get to a single-segment index.... use
>> NoMergePolicy while indexing the documents, and set the largest RAM
>> buffer you can afford.  This will create tons of segments in the index
>> dir, which is fine as long as you will not open a reader on it...
>> then:
>>
>> Open a new IW, with Log*MergePolicy, set a highish (maybe 30)
>> mergeFactor, and call forceMerge(1).  You may need to cutover to
>> SerialMergeScheduler...
>
> NoMergePolicy? Never seen that class used before.

It's like Log*MP with infinite mergeFactor...

> RAM buffer size is
> not an issue. Is the limitation still 2048MB?

Yes.

> Is the fastest way also the best way? :) There will never be a read
> open on the index. Your second solution is similar to the existing
> code with the exception of the mergeFactor. Will setting the merge
> factor to a more reasonable number help with the merge speed?

I think you'd have to test in your env.

A non-infinite MF is good in that it gets some merges out of the way
before the end, ie, you can soak up some otherwise unused IO
resources/concurrency while you are indexing... making it less
work/time to forceMerge in the end.

> What enforces the preservation of the insertion order? The
> MergePolicy?

MergePolicy does.

Though, in 4.0, it's also important you use only 1 thread for
indexing.   Prior to 4.0, docIDs were assigned in arrival order,
across threads, but with 4.0, each thread gets a private segment, so
the docIDs are jumbled.

> How does the MergeScheduler affect things?

It shouldn't affect docID order.

> Used Lucene
> on a few projects over the years and I never had to tweak the index
> creation.

The defaults normally work well... but docID assignment is an impl
detail and is free to change across releases...

> I guess I need to reread the tuning chapter in LIA, it's
> been a few years.

;)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Slow merging after upgrading to 3.5

Posted by Ivan Brusic <iv...@brusic.com>.
Hi Mike,

Response inline:

On Thu, Apr 5, 2012 at 11:36 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> I'm assuming this is a "build once and never change" index...?  Else,
> it sounds like you should never run forceMerge...

Correct. The forceMerge was merely to preserve the previous 2.3
behavior of using optimize.

> To preserve insertion order you just need to use one of the
> Log*MergePolicy (which you are already doing).  Merge factor doesn't
> affect this...

I was never sure why the merge factor was set to 2. My experiences in
the past was to set a high merge factor when doing a batch index.

> For the fastest way to get to a single-segment index.... use
> NoMergePolicy while indexing the documents, and set the largest RAM
> buffer you can afford.  This will create tons of segments in the index
> dir, which is fine as long as you will not open a reader on it...
> then:
>
> Open a new IW, with Log*MergePolicy, set a highish (maybe 30)
> mergeFactor, and call forceMerge(1).  You may need to cutover to
> SerialMergeScheduler...

NoMergePolicy? Never seen that class used before. RAM buffer size is
not an issue. Is the limitation still 2048MB?

Is the fastest way also the best way? :) There will never be a read
open on the index. Your second solution is similar to the existing
code with the exception of the mergeFactor. Will setting the merge
factor to a more reasonable number help with the merge speed?

What enforces the preservation of the insertion order? The
MergePolicy? How does the MergeScheduler affect things?  Used Lucene
on a few projects over the years and I never had to tweak the index
creation. I guess I need to reread the tuning chapter in LIA, it's
been a few years.

Cheers,

Ivan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Slow merging after upgrading to 3.5

Posted by Michael McCandless <lu...@mikemccandless.com>.
I'm assuming this is a "build once and never change" index...?  Else,
it sounds like you should never run forceMerge...

To preserve insertion order you just need to use one of the
Log*MergePolicy (which you are already doing).  Merge factor doesn't
affect this...

For the fastest way to get to a single-segment index.... use
NoMergePolicy while indexing the documents, and set the largest RAM
buffer you can afford.  This will create tons of segments in the index
dir, which is fine as long as you will not open a reader on it...
then:

Open a new IW, with Log*MergePolicy, set a highish (maybe 30)
mergeFactor, and call forceMerge(1).  You may need to cutover to
SerialMergeScheduler...

Mike McCandless

http://blog.mikemccandless.com

On Thu, Apr 5, 2012 at 2:22 PM, Ivan Brusic <iv...@brusic.com> wrote:
> I recently migrated a legacy Lucene application from 2.3 to 3.5. The
> code was filled with numerous custom
> filter/analyzers/similarites/collectors. Took about a week to convert
> all the token streams to the new API and removed deprecated classes.
> Most importantly, there is a collector that enables faceting, which I
> suspect might be taken from Solr (never looked into the Solr source
> code).
>
> The index is built as a batch process with no searchers using it. The
> index contains 30+million documents for a total size around 45gb. The
> bulk of the indexing time is during the database calls. The build time
> using Lucene 2.3 was around 10 hours.
>
> The code has a collector similar to TimeLimitingCollector (sadly,
> there is a ton of custom built code) which collects documents until it
> reaches a limit. The way the current index is created, it is essential
> that the most important documents (based on business rules) exist at
> the beginning of an index (insertion order) to ensure that the appear
> even if the collector times out. The first issue we noticed is that
> this distribution (which I admit is a hack) is no longer "correct"
> using the default TieredMergePolicy. We switched back the log policy
> to the existing setup of LogByteSizeMergePolicy with a merge factor of
> 2. I am assuming the low merge factor is responsible for creating
> indices that respect the insertion order of documents. Documents are
> now in the correct order, but a optimize (aka forceMerge(1)) takes
> around 5 hours were previously there was no slowdown. If we remove the
> forceMerge, the commit time takes just as long.
>
> It is difficult to iterate through different settings since waiting
> 14-15 hours between tests to see the results is too long. What is the
> best way to create an optimized index that places documents based on
> insertion order at the beginning? The answer should be to write better
> queries, but none of the authors of this legacy jumbled code base are
> around and we want to avoid rocking the boat on the query side since
> the existing search results are satisfactory.
>
> Cheers,
>
> Ivan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org