You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2011/05/02 15:03:53 UTC

MergePolicy Thresholds

Hi

Today, LogMP allows you to set different thresholds for segments sizes,
thereby allowing you to control the largest segment that will be
considered for merge + the largest segment your index will hold (=~
threshold * mergeFactor).

So, if you want to end up w/ say 20GB segments, you can set
maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.

However, this often does not achieve your desired goal -- if the index
contains 5 and 7 GB segments, they will never be merged b/c they are
bigger than the threshold. I am willing to spend the CPU and IO resources
to end up w/ 20 GB segments, whether I'm merging 10 segments together or
only 2. After I reach a 20GB segment, it can rest peacefully, at least
until I increase the threshold.

So I wonder, first, if this threshold (i.e., largest segment size you
would like to end up with) is more natural to set than thee current
thresholds,
from the application level? I.e., wouldn't it be a simpler threshold to set
instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
and mergeFactor?

Second, should this be an addition to LogMP, or a different
type of MP. One that adheres to only those two factors (perhaps the
segSize threshold should be allowed to set differently for optimize and
regular merges). It can pick segments for merge such that it maximizes
the result segment size (i.e., don't necessarily merge in sequential
order), but not more than mergeFactor.

I guess, if we think that maxResultSegmentSizeMB is more intuitive than
the current thresholds, application-wise, then this change should go
into LogMP. Otherwise, it feels like a different MP is needed, because
LogMP is already complicated and another threshold would confuse things.

What do you think of this? Am I trying to optimize too much? :)

Shai

Re: MergePolicy Thresholds

Posted by Earwin Burrfoot <ea...@gmail.com>.
>> The problem is - each person needs his own set of knobs (or thinks he
>> needs them) for MergePolicy, and I can't call any of these sets
>> superior to others :/
>
> I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.
>
>> It neatly avoids uber-merges
>
> I didn't see that I can define what "uber-merge" is, right? Can I tell it to
> stop merging segments of some size? E.g., if my index grew to 100 segments,
> 40GB each, I don't think that merging 10 40GB segments (to create 400GB
> segment) is going to speed up my search, for instance. A 40GB segment
> (probably much less) is already big enough to not be touched anymore.
No, you can't. But you can tell it to have exactly (not 'at most') N
top-tier segments and try to keep their sizes close with merges.
Whatever that size may be.
And this is exactly what I want. And defining max cap on segment size
is not what I want.

So the same set of knobs can be intuitive and meaningful for one
person, and useless for another. And you can't pick the "best" one.

> Will BalancedMP stop merging such segments (if all segments are of that
> order of magnitude)?
>
> Shai
>
> On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot <ea...@gmail.com> wrote:
>>
>> Dunno, I'm quite happy with numLargeSegments (you critically
>> misspelled it). It neatly avoids uber-merges, keeps the number of
>> segments at bay, and does not require to recalculate thresholds when
>> my expected index size changes.
>>
>> The problem is - each person needs his own set of knobs (or thinks he
>> needs them) for MergePolicy, and I can't call any of these sets
>> superior to others :/
>>
>> 2011/5/2 Shai Erera <se...@gmail.com>:
>> > I did look at it, but I didn't find that it answers this particular need
>> > (ending with a segment no bigger than X). Perhaps by tweaking several
>> > parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
>> > achieve
>> > something, but it's not very clear what is the right combination.
>> >
>> > Which is related to one of the points -- is it not more intuitive for an
>> > app
>> > to set this threshold (if it needs any thresholds), than tweaking all of
>> > those parameters? If so, then we only need two thresholds (size +
>> > mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
>> > (perhaps w/ some adaptations) to derive a merge plan.
>> >
>> > Shai
>> >
>> > On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot <ea...@gmail.com>
>> > wrote:
>> >>
>> >> Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
>> >>
>> >> On Mon, May 2, 2011 at 17:03, Shai Erera <se...@gmail.com> wrote:
>> >> > Hi
>> >> >
>> >> > Today, LogMP allows you to set different thresholds for segments
>> >> > sizes,
>> >> > thereby allowing you to control the largest segment that will be
>> >> > considered for merge + the largest segment your index will hold (=~
>> >> > threshold * mergeFactor).
>> >> >
>> >> > So, if you want to end up w/ say 20GB segments, you can set
>> >> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
>> >> >
>> >> > However, this often does not achieve your desired goal -- if the
>> >> > index
>> >> > contains 5 and 7 GB segments, they will never be merged b/c they are
>> >> > bigger than the threshold. I am willing to spend the CPU and IO
>> >> > resources
>> >> > to end up w/ 20 GB segments, whether I'm merging 10 segments together
>> >> > or
>> >> > only 2. After I reach a 20GB segment, it can rest peacefully, at
>> >> > least
>> >> > until I increase the threshold.
>> >> >
>> >> > So I wonder, first, if this threshold (i.e., largest segment size you
>> >> > would like to end up with) is more natural to set than thee current
>> >> > thresholds,
>> >> > from the application level? I.e., wouldn't it be a simpler threshold
>> >> > to
>> >> > set
>> >> > instead of doing weird calculus that depend on
>> >> > maxMergeMB(ForOptimize)
>> >> > and mergeFactor?
>> >> >
>> >> > Second, should this be an addition to LogMP, or a different
>> >> > type of MP. One that adheres to only those two factors (perhaps the
>> >> > segSize threshold should be allowed to set differently for optimize
>> >> > and
>> >> > regular merges). It can pick segments for merge such that it
>> >> > maximizes
>> >> > the result segment size (i.e., don't necessarily merge in sequential
>> >> > order), but not more than mergeFactor.
>> >> >
>> >> > I guess, if we think that maxResultSegmentSizeMB is more intuitive
>> >> > than
>> >> > the current thresholds, application-wise, then this change should go
>> >> > into LogMP. Otherwise, it feels like a different MP is needed,
>> >> > because
>> >> > LogMP is already complicated and another threshold would confuse
>> >> > things.
>> >> >
>> >> > What do you think of this? Am I trying to optimize too much? :)
>> >> >
>> >> > Shai
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Kirill Zakharenko/Кирилл Захаренко
>> >> E-Mail/Jabber: earwin@gmail.com
>> >> Phone: +7 (495) 683-567-4
>> >> ICQ: 104465785
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко
>> E-Mail/Jabber: earwin@gmail.com
>> Phone: +7 (495) 683-567-4
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: earwin@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Shai Erera <se...@gmail.com>.
>
> The problem is - each person needs his own set of knobs (or thinks he
> needs them) for MergePolicy, and I can't call any of these sets
> superior to others :/
>

I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.

It neatly avoids uber-merges
>

I didn't see that I can define what "uber-merge" is, right? Can I tell it to
stop merging segments of some size? E.g., if my index grew to 100 segments,
40GB each, I don't think that merging 10 40GB segments (to create 400GB
segment) is going to speed up my search, for instance. A 40GB segment
(probably much less) is already big enough to not be touched anymore.

Will BalancedMP stop merging such segments (if all segments are of that
order of magnitude)?

Shai

On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot <ea...@gmail.com> wrote:

> Dunno, I'm quite happy with numLargeSegments (you critically
> misspelled it). It neatly avoids uber-merges, keeps the number of
> segments at bay, and does not require to recalculate thresholds when
> my expected index size changes.
>
> The problem is - each person needs his own set of knobs (or thinks he
> needs them) for MergePolicy, and I can't call any of these sets
> superior to others :/
>
> 2011/5/2 Shai Erera <se...@gmail.com>:
> > I did look at it, but I didn't find that it answers this particular need
> > (ending with a segment no bigger than X). Perhaps by tweaking several
> > parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
> achieve
> > something, but it's not very clear what is the right combination.
> >
> > Which is related to one of the points -- is it not more intuitive for an
> app
> > to set this threshold (if it needs any thresholds), than tweaking all of
> > those parameters? If so, then we only need two thresholds (size +
> > mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
> > (perhaps w/ some adaptations) to derive a merge plan.
> >
> > Shai
> >
> > On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot <ea...@gmail.com>
> wrote:
> >>
> >> Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
> >>
> >> On Mon, May 2, 2011 at 17:03, Shai Erera <se...@gmail.com> wrote:
> >> > Hi
> >> >
> >> > Today, LogMP allows you to set different thresholds for segments
> sizes,
> >> > thereby allowing you to control the largest segment that will be
> >> > considered for merge + the largest segment your index will hold (=~
> >> > threshold * mergeFactor).
> >> >
> >> > So, if you want to end up w/ say 20GB segments, you can set
> >> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
> >> >
> >> > However, this often does not achieve your desired goal -- if the index
> >> > contains 5 and 7 GB segments, they will never be merged b/c they are
> >> > bigger than the threshold. I am willing to spend the CPU and IO
> >> > resources
> >> > to end up w/ 20 GB segments, whether I'm merging 10 segments together
> or
> >> > only 2. After I reach a 20GB segment, it can rest peacefully, at least
> >> > until I increase the threshold.
> >> >
> >> > So I wonder, first, if this threshold (i.e., largest segment size you
> >> > would like to end up with) is more natural to set than thee current
> >> > thresholds,
> >> > from the application level? I.e., wouldn't it be a simpler threshold
> to
> >> > set
> >> > instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
> >> > and mergeFactor?
> >> >
> >> > Second, should this be an addition to LogMP, or a different
> >> > type of MP. One that adheres to only those two factors (perhaps the
> >> > segSize threshold should be allowed to set differently for optimize
> and
> >> > regular merges). It can pick segments for merge such that it maximizes
> >> > the result segment size (i.e., don't necessarily merge in sequential
> >> > order), but not more than mergeFactor.
> >> >
> >> > I guess, if we think that maxResultSegmentSizeMB is more intuitive
> than
> >> > the current thresholds, application-wise, then this change should go
> >> > into LogMP. Otherwise, it feels like a different MP is needed, because
> >> > LogMP is already complicated and another threshold would confuse
> things.
> >> >
> >> > What do you think of this? Am I trying to optimize too much? :)
> >> >
> >> > Shai
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Kirill Zakharenko/Кирилл Захаренко
> >> E-Mail/Jabber: earwin@gmail.com
> >> Phone: +7 (495) 683-567-4
> >> ICQ: 104465785
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко
> E-Mail/Jabber: earwin@gmail.com
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: MergePolicy Thresholds

Posted by Earwin Burrfoot <ea...@gmail.com>.
Dunno, I'm quite happy with numLargeSegments (you critically
misspelled it). It neatly avoids uber-merges, keeps the number of
segments at bay, and does not require to recalculate thresholds when
my expected index size changes.

The problem is - each person needs his own set of knobs (or thinks he
needs them) for MergePolicy, and I can't call any of these sets
superior to others :/

2011/5/2 Shai Erera <se...@gmail.com>:
> I did look at it, but I didn't find that it answers this particular need
> (ending with a segment no bigger than X). Perhaps by tweaking several
> parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve
> something, but it's not very clear what is the right combination.
>
> Which is related to one of the points -- is it not more intuitive for an app
> to set this threshold (if it needs any thresholds), than tweaking all of
> those parameters? If so, then we only need two thresholds (size +
> mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
> (perhaps w/ some adaptations) to derive a merge plan.
>
> Shai
>
> On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot <ea...@gmail.com> wrote:
>>
>> Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
>>
>> On Mon, May 2, 2011 at 17:03, Shai Erera <se...@gmail.com> wrote:
>> > Hi
>> >
>> > Today, LogMP allows you to set different thresholds for segments sizes,
>> > thereby allowing you to control the largest segment that will be
>> > considered for merge + the largest segment your index will hold (=~
>> > threshold * mergeFactor).
>> >
>> > So, if you want to end up w/ say 20GB segments, you can set
>> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
>> >
>> > However, this often does not achieve your desired goal -- if the index
>> > contains 5 and 7 GB segments, they will never be merged b/c they are
>> > bigger than the threshold. I am willing to spend the CPU and IO
>> > resources
>> > to end up w/ 20 GB segments, whether I'm merging 10 segments together or
>> > only 2. After I reach a 20GB segment, it can rest peacefully, at least
>> > until I increase the threshold.
>> >
>> > So I wonder, first, if this threshold (i.e., largest segment size you
>> > would like to end up with) is more natural to set than thee current
>> > thresholds,
>> > from the application level? I.e., wouldn't it be a simpler threshold to
>> > set
>> > instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
>> > and mergeFactor?
>> >
>> > Second, should this be an addition to LogMP, or a different
>> > type of MP. One that adheres to only those two factors (perhaps the
>> > segSize threshold should be allowed to set differently for optimize and
>> > regular merges). It can pick segments for merge such that it maximizes
>> > the result segment size (i.e., don't necessarily merge in sequential
>> > order), but not more than mergeFactor.
>> >
>> > I guess, if we think that maxResultSegmentSizeMB is more intuitive than
>> > the current thresholds, application-wise, then this change should go
>> > into LogMP. Otherwise, it feels like a different MP is needed, because
>> > LogMP is already complicated and another threshold would confuse things.
>> >
>> > What do you think of this? Am I trying to optimize too much? :)
>> >
>> > Shai
>> >
>> >
>>
>>
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко
>> E-Mail/Jabber: earwin@gmail.com
>> Phone: +7 (495) 683-567-4
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: earwin@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Shai Erera <se...@gmail.com>.
I did look at it, but I didn't find that it answers this particular need
(ending with a segment no bigger than X). Perhaps by tweaking several
parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve
something, but it's not very clear what is the right combination.

Which is related to one of the points -- is it not more intuitive for an app
to set this threshold (if it needs any thresholds), than tweaking all of
those parameters? If so, then we only need two thresholds (size +
mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
(perhaps w/ some adaptations) to derive a merge plan.

Shai

On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot <ea...@gmail.com> wrote:

> Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
>
> On Mon, May 2, 2011 at 17:03, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > Today, LogMP allows you to set different thresholds for segments sizes,
> > thereby allowing you to control the largest segment that will be
> > considered for merge + the largest segment your index will hold (=~
> > threshold * mergeFactor).
> >
> > So, if you want to end up w/ say 20GB segments, you can set
> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
> >
> > However, this often does not achieve your desired goal -- if the index
> > contains 5 and 7 GB segments, they will never be merged b/c they are
> > bigger than the threshold. I am willing to spend the CPU and IO resources
> > to end up w/ 20 GB segments, whether I'm merging 10 segments together or
> > only 2. After I reach a 20GB segment, it can rest peacefully, at least
> > until I increase the threshold.
> >
> > So I wonder, first, if this threshold (i.e., largest segment size you
> > would like to end up with) is more natural to set than thee current
> > thresholds,
> > from the application level? I.e., wouldn't it be a simpler threshold to
> set
> > instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
> > and mergeFactor?
> >
> > Second, should this be an addition to LogMP, or a different
> > type of MP. One that adheres to only those two factors (perhaps the
> > segSize threshold should be allowed to set differently for optimize and
> > regular merges). It can pick segments for merge such that it maximizes
> > the result segment size (i.e., don't necessarily merge in sequential
> > order), but not more than mergeFactor.
> >
> > I guess, if we think that maxResultSegmentSizeMB is more intuitive than
> > the current thresholds, application-wise, then this change should go
> > into LogMP. Otherwise, it feels like a different MP is needed, because
> > LogMP is already complicated and another threshold would confuse things.
> >
> > What do you think of this? Am I trying to optimize too much? :)
> >
> > Shai
> >
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко
> E-Mail/Jabber: earwin@gmail.com
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: MergePolicy Thresholds

Posted by Earwin Burrfoot <ea...@gmail.com>.
Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

On Mon, May 2, 2011 at 17:03, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> Today, LogMP allows you to set different thresholds for segments sizes,
> thereby allowing you to control the largest segment that will be
> considered for merge + the largest segment your index will hold (=~
> threshold * mergeFactor).
>
> So, if you want to end up w/ say 20GB segments, you can set
> maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
>
> However, this often does not achieve your desired goal -- if the index
> contains 5 and 7 GB segments, they will never be merged b/c they are
> bigger than the threshold. I am willing to spend the CPU and IO resources
> to end up w/ 20 GB segments, whether I'm merging 10 segments together or
> only 2. After I reach a 20GB segment, it can rest peacefully, at least
> until I increase the threshold.
>
> So I wonder, first, if this threshold (i.e., largest segment size you
> would like to end up with) is more natural to set than thee current
> thresholds,
> from the application level? I.e., wouldn't it be a simpler threshold to set
> instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
> and mergeFactor?
>
> Second, should this be an addition to LogMP, or a different
> type of MP. One that adheres to only those two factors (perhaps the
> segSize threshold should be allowed to set differently for optimize and
> regular merges). It can pick segments for merge such that it maximizes
> the result segment size (i.e., don't necessarily merge in sequential
> order), but not more than mergeFactor.
>
> I guess, if we think that maxResultSegmentSizeMB is more intuitive than
> the current thresholds, application-wise, then this change should go
> into LogMP. Otherwise, it feels like a different MP is needed, because
> LogMP is already complicated and another threshold would confuse things.
>
> What do you think of this? Am I trying to optimize too much? :)
>
> Shai
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: earwin@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Michael McCandless <lu...@mikemccandless.com>.
Thanks Tom!

Sounds like great fun working with such massive data sets :)

Mike

http://blog.mikemccandless.com

On Fri, May 20, 2011 at 7:03 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hi Mike and Shai,
>
>
>
> I was able to index  a few documents with the tieredMergePolicy but I was
> hoping to build a large test index of about 700,000 documents to compare the
> performance against our previous runs.  I was hoping I would be able to
> report on my results in time for the Lucene Revolution conference.
> Unfortunately there was a power outage at our data center last week which
> resulted in a node failure in one of our storage nodes and node rebalancing
> for a cluster of 500 terabytes takes quite a while and totally messes up
> performance measurements.  (Our 6-8 terabytes of large scale search indexes
> shares storage with the repository that holds the 480+ terabytes of page
> images and metadata for the 8 million+ books).   Hopefully I will be able to
> run the tests when I get back.
>
>
>
> Tom
>
>
>
> From: Burton-West, Tom [mailto:tburtonw@umich.edu]
> Sent: Monday, May 09, 2011 4:10 PM
>
> To: dev@lucene.apache.org
> Subject: RE: MergePolicy Thresholds
>
>
>
> Thanks again Shai and Mike.
>
>
>
> Am in the process of downloading and building   r1099998.  Should be able to
> build a test index sometime this week.  I’ll make some guesses on what
> parameters to use based on our previous tests.
>
>
>
> Tom
>
> From: Shai Erera [mailto:serera@gmail.com]
> Sent: Saturday, May 07, 2011 11:33 PM
> To: dev@lucene.apache.org
> Subject: Re: MergePolicy Thresholds
>
>
>
> Hey Tom,
>
> Mike back-ported the changes to 3x, so you can try it out.
>
> FYI,
> Shai
>
> On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>
> Thanks Shai and Mike!
>
> I'll keep an eye on LUCENE-1076.
>
> Tom
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>
> Sent: Tuesday, May 03, 2011 11:15 AM
> To: dev@lucene.apache.org
> Subject: Re: MergePolicy Thresholds
>
> Thanks Shai!
>
> I'm way behind on my 3.x backports -- I'll try to do this soon.
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Tue, May 3, 2011 at 8:10 AM, Shai Erera <se...@gmail.com> wrote:
>> I uploaded a patch to LUCENE-1076.
>>
>> Tom, apparently the patch I've attached before cannot be used, because
>> there
>> are dependencies (in earlier commits on LUCENE-1076) that need to be
>> back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to
>> use
>> this new MP.
>>
>> Shai
>>
>> On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>>
>>> That'd be great, thanks :)
>>>
>>> Yes, let's iterate on the issue!  But: it should still be open, I hope
>>> (I didn't mean to close it yet, since it's not back ported)...
>>>
>>> Mike
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com> wrote:
>>> > Mike, if you want, I can back-port it, as I've already started this
>>> > when
>>> > preparing the patch.
>>> >
>>> > I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok
>>> > on
>>> > 3x too? It'll be a backwards change.
>>> >
>>> > Maybe we should iterate on the issue? I can reopen.
>>> >
>>> > Shai
>>> >
>>> > On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
>>> > <lu...@mikemccandless.com> wrote:
>>> >>
>>> >> Looks good Shai!
>>> >>
>>> >> Comments below too:
>>> >>
>>> >> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
>>> >> > Hi
>>> >> >
>>> >> > I looked into porting it to 3x, and prepared the attached patch. It
>>> >> > only
>>> >> > contains the new TieredMP and Test, as well as the necessary changes
>>> >> > to
>>> >> > LuceneTestCase and IndexWriter. I guess you can start with it (even
>>> >> > just
>>> >> > the
>>> >> > MP and IW changes) to test it on your indexes.
>>> >> >
>>> >> > Mike, I saw that there were many more changes, as part of
>>> >> > LUCENE-1076,
>>> >> > done
>>> >> > to the code. In particular, this MP is now the default (on trunk),
>>> >> > so
>>> >> > I
>>> >> > guess many changes (to tests) were needed because of that. Do you
>>> >> > remember,
>>> >> > if apart from the changes I've included in the patch, other
>>> >> > important
>>> >> > changes w.r.t. this code?
>>> >>
>>> >> The only other changes I can think of were some verbosity improvements
>>> >> to IndexWriter, to support the python script that can make a merge
>>> >> movie from an infoStream output; but that can wait for when I
>>> >> back-port to 3.x...
>>> >>
>>> >> > As we won't change the default MP on 3x, I'm guessing I don't need
>>> >> > to
>>> >> > port
>>> >> > all the changes to 3x.
>>> >>
>>> >> Right, I think.
>>> >>
>>> >> Mike
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: MergePolicy Thresholds

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Mike and Shai,

I was able to index  a few documents with the tieredMergePolicy but I was hoping to build a large test index of about 700,000 documents to compare the performance against our previous runs.  I was hoping I would be able to report on my results in time for the Lucene Revolution conference.  Unfortunately there was a power outage at our data center last week which resulted in a node failure in one of our storage nodes and node rebalancing for a cluster of 500 terabytes takes quite a while and totally messes up performance measurements.  (Our 6-8 terabytes of large scale search indexes shares storage with the repository that holds the 480+ terabytes of page images and metadata for the 8 million+ books).   Hopefully I will be able to run the tests when I get back.

Tom

From: Burton-West, Tom [mailto:tburtonw@umich.edu]
Sent: Monday, May 09, 2011 4:10 PM
To: dev@lucene.apache.org
Subject: RE: MergePolicy Thresholds

Thanks again Shai and Mike.

Am in the process of downloading and building   r1099998.  Should be able to build a test index sometime this week.  I'll make some guesses on what parameters to use based on our previous tests.

Tom
From: Shai Erera [mailto:serera@gmail.com]
Sent: Saturday, May 07, 2011 11:33 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

Hey Tom,

Mike back-ported the changes to 3x, so you can try it out.

FYI,
Shai
On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom <tb...@umich.edu>> wrote:
Thanks Shai and Mike!

I'll keep an eye on LUCENE-1076.

Tom

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com<ma...@mikemccandless.com>]
Sent: Tuesday, May 03, 2011 11:15 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: MergePolicy Thresholds
Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera <se...@gmail.com>> wrote:
> I uploaded a patch to LUCENE-1076.
>
> Tom, apparently the patch I've attached before cannot be used, because there
> are dependencies (in earlier commits on LUCENE-1076) that need to be
> back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
> this new MP.
>
> Shai
>
> On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
> <lu...@mikemccandless.com>> wrote:
>>
>> That'd be great, thanks :)
>>
>> Yes, let's iterate on the issue!  But: it should still be open, I hope
>> (I didn't mean to close it yet, since it's not back ported)...
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com>> wrote:
>> > Mike, if you want, I can back-port it, as I've already started this when
>> > preparing the patch.
>> >
>> > I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok
>> > on
>> > 3x too? It'll be a backwards change.
>> >
>> > Maybe we should iterate on the issue? I can reopen.
>> >
>> > Shai
>> >
>> > On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
>> > <lu...@mikemccandless.com>> wrote:
>> >>
>> >> Looks good Shai!
>> >>
>> >> Comments below too:
>> >>
>> >> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com>> wrote:
>> >> > Hi
>> >> >
>> >> > I looked into porting it to 3x, and prepared the attached patch. It
>> >> > only
>> >> > contains the new TieredMP and Test, as well as the necessary changes
>> >> > to
>> >> > LuceneTestCase and IndexWriter. I guess you can start with it (even
>> >> > just
>> >> > the
>> >> > MP and IW changes) to test it on your indexes.
>> >> >
>> >> > Mike, I saw that there were many more changes, as part of
>> >> > LUCENE-1076,
>> >> > done
>> >> > to the code. In particular, this MP is now the default (on trunk), so
>> >> > I
>> >> > guess many changes (to tests) were needed because of that. Do you
>> >> > remember,
>> >> > if apart from the changes I've included in the patch, other important
>> >> > changes w.r.t. this code?
>> >>
>> >> The only other changes I can think of were some verbosity improvements
>> >> to IndexWriter, to support the python script that can make a merge
>> >> movie from an infoStream output; but that can wait for when I
>> >> back-port to 3.x...
>> >>
>> >> > As we won't change the default MP on 3x, I'm guessing I don't need to
>> >> > port
>> >> > all the changes to 3x.
>> >>
>> >> Right, I think.
>> >>
>> >> Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
>> >> For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
>> For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


RE: MergePolicy Thresholds

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Thanks again Shai and Mike.

Am in the process of downloading and building   r1099998.  Should be able to build a test index sometime this week.  I'll make some guesses on what parameters to use based on our previous tests.

Tom
From: Shai Erera [mailto:serera@gmail.com]
Sent: Saturday, May 07, 2011 11:33 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

Hey Tom,

Mike back-ported the changes to 3x, so you can try it out.

FYI,
Shai
On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom <tb...@umich.edu>> wrote:
Thanks Shai and Mike!

I'll keep an eye on LUCENE-1076.

Tom

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com<ma...@mikemccandless.com>]
Sent: Tuesday, May 03, 2011 11:15 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: MergePolicy Thresholds
Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera <se...@gmail.com>> wrote:
> I uploaded a patch to LUCENE-1076.
>
> Tom, apparently the patch I've attached before cannot be used, because there
> are dependencies (in earlier commits on LUCENE-1076) that need to be
> back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
> this new MP.
>
> Shai
>
> On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
> <lu...@mikemccandless.com>> wrote:
>>
>> That'd be great, thanks :)
>>
>> Yes, let's iterate on the issue!  But: it should still be open, I hope
>> (I didn't mean to close it yet, since it's not back ported)...
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com>> wrote:
>> > Mike, if you want, I can back-port it, as I've already started this when
>> > preparing the patch.
>> >
>> > I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok
>> > on
>> > 3x too? It'll be a backwards change.
>> >
>> > Maybe we should iterate on the issue? I can reopen.
>> >
>> > Shai
>> >
>> > On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
>> > <lu...@mikemccandless.com>> wrote:
>> >>
>> >> Looks good Shai!
>> >>
>> >> Comments below too:
>> >>
>> >> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com>> wrote:
>> >> > Hi
>> >> >
>> >> > I looked into porting it to 3x, and prepared the attached patch. It
>> >> > only
>> >> > contains the new TieredMP and Test, as well as the necessary changes
>> >> > to
>> >> > LuceneTestCase and IndexWriter. I guess you can start with it (even
>> >> > just
>> >> > the
>> >> > MP and IW changes) to test it on your indexes.
>> >> >
>> >> > Mike, I saw that there were many more changes, as part of
>> >> > LUCENE-1076,
>> >> > done
>> >> > to the code. In particular, this MP is now the default (on trunk), so
>> >> > I
>> >> > guess many changes (to tests) were needed because of that. Do you
>> >> > remember,
>> >> > if apart from the changes I've included in the patch, other important
>> >> > changes w.r.t. this code?
>> >>
>> >> The only other changes I can think of were some verbosity improvements
>> >> to IndexWriter, to support the python script that can make a merge
>> >> movie from an infoStream output; but that can wait for when I
>> >> back-port to 3.x...
>> >>
>> >> > As we won't change the default MP on 3x, I'm guessing I don't need to
>> >> > port
>> >> > all the changes to 3x.
>> >>
>> >> Right, I think.
>> >>
>> >> Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
>> >> For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
>> For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


Re: MergePolicy Thresholds

Posted by Shai Erera <se...@gmail.com>.
Hey Tom,

Mike back-ported the changes to 3x, so you can try it out.

FYI,
Shai

On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom <tb...@umich.edu> wrote:

> Thanks Shai and Mike!
>
> I'll keep an eye on LUCENE-1076.
>
> Tom
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Tuesday, May 03, 2011 11:15 AM
> To: dev@lucene.apache.org
> Subject: Re: MergePolicy Thresholds
>
> Thanks Shai!
>
> I'm way behind on my 3.x backports -- I'll try to do this soon.
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Tue, May 3, 2011 at 8:10 AM, Shai Erera <se...@gmail.com> wrote:
> > I uploaded a patch to LUCENE-1076.
> >
> > Tom, apparently the patch I've attached before cannot be used, because
> there
> > are dependencies (in earlier commits on LUCENE-1076) that need to be
> > back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to
> use
> > this new MP.
> >
> > Shai
> >
> > On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> That'd be great, thanks :)
> >>
> >> Yes, let's iterate on the issue!  But: it should still be open, I hope
> >> (I didn't mean to close it yet, since it's not back ported)...
> >>
> >> Mike
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com> wrote:
> >> > Mike, if you want, I can back-port it, as I've already started this
> when
> >> > preparing the patch.
> >> >
> >> > I noticed that you added a "throws IOE" to IW.setInfoStream -- is it
> ok
> >> > on
> >> > 3x too? It'll be a backwards change.
> >> >
> >> > Maybe we should iterate on the issue? I can reopen.
> >> >
> >> > Shai
> >> >
> >> > On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
> >> > <lu...@mikemccandless.com> wrote:
> >> >>
> >> >> Looks good Shai!
> >> >>
> >> >> Comments below too:
> >> >>
> >> >> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
> >> >> > Hi
> >> >> >
> >> >> > I looked into porting it to 3x, and prepared the attached patch. It
> >> >> > only
> >> >> > contains the new TieredMP and Test, as well as the necessary
> changes
> >> >> > to
> >> >> > LuceneTestCase and IndexWriter. I guess you can start with it (even
> >> >> > just
> >> >> > the
> >> >> > MP and IW changes) to test it on your indexes.
> >> >> >
> >> >> > Mike, I saw that there were many more changes, as part of
> >> >> > LUCENE-1076,
> >> >> > done
> >> >> > to the code. In particular, this MP is now the default (on trunk),
> so
> >> >> > I
> >> >> > guess many changes (to tests) were needed because of that. Do you
> >> >> > remember,
> >> >> > if apart from the changes I've included in the patch, other
> important
> >> >> > changes w.r.t. this code?
> >> >>
> >> >> The only other changes I can think of were some verbosity
> improvements
> >> >> to IndexWriter, to support the python script that can make a merge
> >> >> movie from an infoStream output; but that can wait for when I
> >> >> back-port to 3.x...
> >> >>
> >> >> > As we won't change the default MP on 3x, I'm guessing I don't need
> to
> >> >> > port
> >> >> > all the changes to 3x.
> >> >>
> >> >> Right, I think.
> >> >>
> >> >> Mike
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

RE: MergePolicy Thresholds

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Thanks Shai and Mike!

I'll keep an eye on LUCENE-1076.

Tom

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Tuesday, May 03, 2011 11:15 AM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera <se...@gmail.com> wrote:
> I uploaded a patch to LUCENE-1076.
>
> Tom, apparently the patch I've attached before cannot be used, because there
> are dependencies (in earlier commits on LUCENE-1076) that need to be
> back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
> this new MP.
>
> Shai
>
> On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> That'd be great, thanks :)
>>
>> Yes, let's iterate on the issue!  But: it should still be open, I hope
>> (I didn't mean to close it yet, since it's not back ported)...
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com> wrote:
>> > Mike, if you want, I can back-port it, as I've already started this when
>> > preparing the patch.
>> >
>> > I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok
>> > on
>> > 3x too? It'll be a backwards change.
>> >
>> > Maybe we should iterate on the issue? I can reopen.
>> >
>> > Shai
>> >
>> > On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> Looks good Shai!
>> >>
>> >> Comments below too:
>> >>
>> >> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
>> >> > Hi
>> >> >
>> >> > I looked into porting it to 3x, and prepared the attached patch. It
>> >> > only
>> >> > contains the new TieredMP and Test, as well as the necessary changes
>> >> > to
>> >> > LuceneTestCase and IndexWriter. I guess you can start with it (even
>> >> > just
>> >> > the
>> >> > MP and IW changes) to test it on your indexes.
>> >> >
>> >> > Mike, I saw that there were many more changes, as part of
>> >> > LUCENE-1076,
>> >> > done
>> >> > to the code. In particular, this MP is now the default (on trunk), so
>> >> > I
>> >> > guess many changes (to tests) were needed because of that. Do you
>> >> > remember,
>> >> > if apart from the changes I've included in the patch, other important
>> >> > changes w.r.t. this code?
>> >>
>> >> The only other changes I can think of were some verbosity improvements
>> >> to IndexWriter, to support the python script that can make a merge
>> >> movie from an infoStream output; but that can wait for when I
>> >> back-port to 3.x...
>> >>
>> >> > As we won't change the default MP on 3x, I'm guessing I don't need to
>> >> > port
>> >> > all the changes to 3x.
>> >>
>> >> Right, I think.
>> >>
>> >> Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Michael McCandless <lu...@mikemccandless.com>.
Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera <se...@gmail.com> wrote:
> I uploaded a patch to LUCENE-1076.
>
> Tom, apparently the patch I've attached before cannot be used, because there
> are dependencies (in earlier commits on LUCENE-1076) that need to be
> back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
> this new MP.
>
> Shai
>
> On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> That'd be great, thanks :)
>>
>> Yes, let's iterate on the issue!  But: it should still be open, I hope
>> (I didn't mean to close it yet, since it's not back ported)...
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com> wrote:
>> > Mike, if you want, I can back-port it, as I've already started this when
>> > preparing the patch.
>> >
>> > I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok
>> > on
>> > 3x too? It'll be a backwards change.
>> >
>> > Maybe we should iterate on the issue? I can reopen.
>> >
>> > Shai
>> >
>> > On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> Looks good Shai!
>> >>
>> >> Comments below too:
>> >>
>> >> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
>> >> > Hi
>> >> >
>> >> > I looked into porting it to 3x, and prepared the attached patch. It
>> >> > only
>> >> > contains the new TieredMP and Test, as well as the necessary changes
>> >> > to
>> >> > LuceneTestCase and IndexWriter. I guess you can start with it (even
>> >> > just
>> >> > the
>> >> > MP and IW changes) to test it on your indexes.
>> >> >
>> >> > Mike, I saw that there were many more changes, as part of
>> >> > LUCENE-1076,
>> >> > done
>> >> > to the code. In particular, this MP is now the default (on trunk), so
>> >> > I
>> >> > guess many changes (to tests) were needed because of that. Do you
>> >> > remember,
>> >> > if apart from the changes I've included in the patch, other important
>> >> > changes w.r.t. this code?
>> >>
>> >> The only other changes I can think of were some verbosity improvements
>> >> to IndexWriter, to support the python script that can make a merge
>> >> movie from an infoStream output; but that can wait for when I
>> >> back-port to 3.x...
>> >>
>> >> > As we won't change the default MP on 3x, I'm guessing I don't need to
>> >> > port
>> >> > all the changes to 3x.
>> >>
>> >> Right, I think.
>> >>
>> >> Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Shai Erera <se...@gmail.com>.
I uploaded a patch to LUCENE-1076.

Tom, apparently the patch I've attached before cannot be used, because there
are dependencies (in earlier commits on LUCENE-1076) that need to be
back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
this new MP.

Shai

On Tue, May 3, 2011 at 1:00 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> That'd be great, thanks :)
>
> Yes, let's iterate on the issue!  But: it should still be open, I hope
> (I didn't mean to close it yet, since it's not back ported)...
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com> wrote:
> > Mike, if you want, I can back-port it, as I've already started this when
> > preparing the patch.
> >
> > I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok
> on
> > 3x too? It'll be a backwards change.
> >
> > Maybe we should iterate on the issue? I can reopen.
> >
> > Shai
> >
> > On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> Looks good Shai!
> >>
> >> Comments below too:
> >>
> >> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
> >> > Hi
> >> >
> >> > I looked into porting it to 3x, and prepared the attached patch. It
> only
> >> > contains the new TieredMP and Test, as well as the necessary changes
> to
> >> > LuceneTestCase and IndexWriter. I guess you can start with it (even
> just
> >> > the
> >> > MP and IW changes) to test it on your indexes.
> >> >
> >> > Mike, I saw that there were many more changes, as part of LUCENE-1076,
> >> > done
> >> > to the code. In particular, this MP is now the default (on trunk), so
> I
> >> > guess many changes (to tests) were needed because of that. Do you
> >> > remember,
> >> > if apart from the changes I've included in the patch, other important
> >> > changes w.r.t. this code?
> >>
> >> The only other changes I can think of were some verbosity improvements
> >> to IndexWriter, to support the python script that can make a merge
> >> movie from an infoStream output; but that can wait for when I
> >> back-port to 3.x...
> >>
> >> > As we won't change the default MP on 3x, I'm guessing I don't need to
> >> > port
> >> > all the changes to 3x.
> >>
> >> Right, I think.
> >>
> >> Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: MergePolicy Thresholds

Posted by Michael McCandless <lu...@mikemccandless.com>.
That'd be great, thanks :)

Yes, let's iterate on the issue!  But: it should still be open, I hope
(I didn't mean to close it yet, since it's not back ported)...

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 5:51 AM, Shai Erera <se...@gmail.com> wrote:
> Mike, if you want, I can back-port it, as I've already started this when
> preparing the patch.
>
> I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok on
> 3x too? It'll be a backwards change.
>
> Maybe we should iterate on the issue? I can reopen.
>
> Shai
>
> On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Looks good Shai!
>>
>> Comments below too:
>>
>> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
>> > Hi
>> >
>> > I looked into porting it to 3x, and prepared the attached patch. It only
>> > contains the new TieredMP and Test, as well as the necessary changes to
>> > LuceneTestCase and IndexWriter. I guess you can start with it (even just
>> > the
>> > MP and IW changes) to test it on your indexes.
>> >
>> > Mike, I saw that there were many more changes, as part of LUCENE-1076,
>> > done
>> > to the code. In particular, this MP is now the default (on trunk), so I
>> > guess many changes (to tests) were needed because of that. Do you
>> > remember,
>> > if apart from the changes I've included in the patch, other important
>> > changes w.r.t. this code?
>>
>> The only other changes I can think of were some verbosity improvements
>> to IndexWriter, to support the python script that can make a merge
>> movie from an infoStream output; but that can wait for when I
>> back-port to 3.x...
>>
>> > As we won't change the default MP on 3x, I'm guessing I don't need to
>> > port
>> > all the changes to 3x.
>>
>> Right, I think.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Shai Erera <se...@gmail.com>.
Mike, if you want, I can back-port it, as I've already started this when
preparing the patch.

I noticed that you added a "throws IOE" to IW.setInfoStream -- is it ok on
3x too? It'll be a backwards change.

Maybe we should iterate on the issue? I can reopen.

Shai

On Tue, May 3, 2011 at 12:36 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Looks good Shai!
>
> Comments below too:
>
> On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > I looked into porting it to 3x, and prepared the attached patch. It only
> > contains the new TieredMP and Test, as well as the necessary changes to
> > LuceneTestCase and IndexWriter. I guess you can start with it (even just
> the
> > MP and IW changes) to test it on your indexes.
> >
> > Mike, I saw that there were many more changes, as part of LUCENE-1076,
> done
> > to the code. In particular, this MP is now the default (on trunk), so I
> > guess many changes (to tests) were needed because of that. Do you
> remember,
> > if apart from the changes I've included in the patch, other important
> > changes w.r.t. this code?
>
> The only other changes I can think of were some verbosity improvements
> to IndexWriter, to support the python script that can make a merge
> movie from an infoStream output; but that can wait for when I
> back-port to 3.x...
>
> > As we won't change the default MP on 3x, I'm guessing I don't need to
> port
> > all the changes to 3x.
>
> Right, I think.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: MergePolicy Thresholds

Posted by Michael McCandless <lu...@mikemccandless.com>.
Looks good Shai!

Comments below too:

On Tue, May 3, 2011 at 5:29 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> I looked into porting it to 3x, and prepared the attached patch. It only
> contains the new TieredMP and Test, as well as the necessary changes to
> LuceneTestCase and IndexWriter. I guess you can start with it (even just the
> MP and IW changes) to test it on your indexes.
>
> Mike, I saw that there were many more changes, as part of LUCENE-1076, done
> to the code. In particular, this MP is now the default (on trunk), so I
> guess many changes (to tests) were needed because of that. Do you remember,
> if apart from the changes I've included in the patch, other important
> changes w.r.t. this code?

The only other changes I can think of were some verbosity improvements
to IndexWriter, to support the python script that can make a merge
movie from an infoStream output; but that can wait for when I
back-port to 3.x...

> As we won't change the default MP on 3x, I'm guessing I don't need to port
> all the changes to 3x.

Right, I think.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Shai Erera <se...@gmail.com>.
Hi

I looked into porting it to 3x, and prepared the attached patch. It only
contains the new TieredMP and Test, as well as the necessary changes to
LuceneTestCase and IndexWriter. I guess you can start with it (even just the
MP and IW changes) to test it on your indexes.

Mike, I saw that there were many more changes, as part of LUCENE-1076, done
to the code. In particular, this MP is now the default (on trunk), so I
guess many changes (to tests) were needed because of that. Do you remember,
if apart from the changes I've included in the patch, other important
changes w.r.t. this code?

As we won't change the default MP on 3x, I'm guessing I don't need to port
all the changes to 3x.

Shai

On Mon, May 2, 2011 at 9:41 PM, Burton-West, Tom <tb...@umich.edu> wrote:

> Hi Shai and Mike,
>
> Testing the TieredMP on our large indexes has been on my todo list since I
> read Mikes blog post
>
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
> .
>
> If you port it to the 3.x branch Shai, I'll be more than happy to test it
> with our very large (300GB+) indexes.  Besides being able to set the max
> merged segment size, I'm especially interested in using the
>  maxSegmentsPerTier parameter.
>
> From Mike's blog post:
> " ...maxSegmentsPerTier that lets you set the allowed width (number of
> segments) of each stair in the staircase. This is nice because it decouples
> how many segments to merge at a time from how wide the staircase can be."
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Monday, May 02, 2011 2:19 PM
> To: dev@lucene.apache.org
> Subject: Re: MergePolicy Thresholds
>
> I think it should be an easy port...
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Mon, May 2, 2011 at 2:16 PM, Shai Erera <se...@gmail.com> wrote:
> > Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
> > way, or do you think it can easily be ported to 3x?
> > Shai
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

RE: MergePolicy Thresholds

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Shai and Mike,

Testing the TieredMP on our large indexes has been on my todo list since I read Mikes blog post
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.

If you port it to the 3.x branch Shai, I'll be more than happy to test it with our very large (300GB+) indexes.  Besides being able to set the max merged segment size, I'm especially interested in using the  maxSegmentsPerTier parameter.

>From Mike's blog post:
" ...maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be."

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Monday, May 02, 2011 2:19 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

I think it should be an easy port...

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 2:16 PM, Shai Erera <se...@gmail.com> wrote:
> Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
> way, or do you think it can easily be ported to 3x?
> Shai
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Michael McCandless <lu...@mikemccandless.com>.
I think it should be an easy port...

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 2:16 PM, Shai Erera <se...@gmail.com> wrote:
> Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
> way, or do you think it can easily be ported to 3x?
> Shai
>
> On Mon, May 2, 2011 at 6:34 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Actually the new TieredMergePolicy (only on trunk currently but I plan
>> to backport for 3.2) lets you set the max merged segment size
>> (maxMergedSegmentMB).
>>
>> It's only an "estimate", but if it's set, it tries to pick a merge
>> reaching around that target size.
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Mon, May 2, 2011 at 9:03 AM, Shai Erera <se...@gmail.com> wrote:
>> > Hi
>> >
>> > Today, LogMP allows you to set different thresholds for segments sizes,
>> > thereby allowing you to control the largest segment that will be
>> > considered for merge + the largest segment your index will hold (=~
>> > threshold * mergeFactor).
>> >
>> > So, if you want to end up w/ say 20GB segments, you can set
>> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
>> >
>> > However, this often does not achieve your desired goal -- if the index
>> > contains 5 and 7 GB segments, they will never be merged b/c they are
>> > bigger than the threshold. I am willing to spend the CPU and IO
>> > resources
>> > to end up w/ 20 GB segments, whether I'm merging 10 segments together or
>> > only 2. After I reach a 20GB segment, it can rest peacefully, at least
>> > until I increase the threshold.
>> >
>> > So I wonder, first, if this threshold (i.e., largest segment size you
>> > would like to end up with) is more natural to set than thee current
>> > thresholds,
>> > from the application level? I.e., wouldn't it be a simpler threshold to
>> > set
>> > instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
>> > and mergeFactor?
>> >
>> > Second, should this be an addition to LogMP, or a different
>> > type of MP. One that adheres to only those two factors (perhaps the
>> > segSize threshold should be allowed to set differently for optimize and
>> > regular merges). It can pick segments for merge such that it maximizes
>> > the result segment size (i.e., don't necessarily merge in sequential
>> > order), but not more than mergeFactor.
>> >
>> > I guess, if we think that maxResultSegmentSizeMB is more intuitive than
>> > the current thresholds, application-wise, then this change should go
>> > into LogMP. Otherwise, it feels like a different MP is needed, because
>> > LogMP is already complicated and another threshold would confuse things.
>> >
>> > What do you think of this? Am I trying to optimize too much? :)
>> >
>> > Shai
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: MergePolicy Thresholds

Posted by Shai Erera <se...@gmail.com>.
Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
way, or do you think it can easily be ported to 3x?

Shai

On Mon, May 2, 2011 at 6:34 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Actually the new TieredMergePolicy (only on trunk currently but I plan
> to backport for 3.2) lets you set the max merged segment size
> (maxMergedSegmentMB).
>
> It's only an "estimate", but if it's set, it tries to pick a merge
> reaching around that target size.
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Mon, May 2, 2011 at 9:03 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > Today, LogMP allows you to set different thresholds for segments sizes,
> > thereby allowing you to control the largest segment that will be
> > considered for merge + the largest segment your index will hold (=~
> > threshold * mergeFactor).
> >
> > So, if you want to end up w/ say 20GB segments, you can set
> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
> >
> > However, this often does not achieve your desired goal -- if the index
> > contains 5 and 7 GB segments, they will never be merged b/c they are
> > bigger than the threshold. I am willing to spend the CPU and IO resources
> > to end up w/ 20 GB segments, whether I'm merging 10 segments together or
> > only 2. After I reach a 20GB segment, it can rest peacefully, at least
> > until I increase the threshold.
> >
> > So I wonder, first, if this threshold (i.e., largest segment size you
> > would like to end up with) is more natural to set than thee current
> > thresholds,
> > from the application level? I.e., wouldn't it be a simpler threshold to
> set
> > instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
> > and mergeFactor?
> >
> > Second, should this be an addition to LogMP, or a different
> > type of MP. One that adheres to only those two factors (perhaps the
> > segSize threshold should be allowed to set differently for optimize and
> > regular merges). It can pick segments for merge such that it maximizes
> > the result segment size (i.e., don't necessarily merge in sequential
> > order), but not more than mergeFactor.
> >
> > I guess, if we think that maxResultSegmentSizeMB is more intuitive than
> > the current thresholds, application-wise, then this change should go
> > into LogMP. Otherwise, it feels like a different MP is needed, because
> > LogMP is already complicated and another threshold would confuse things.
> >
> > What do you think of this? Am I trying to optimize too much? :)
> >
> > Shai
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: MergePolicy Thresholds

Posted by Michael McCandless <lu...@mikemccandless.com>.
Actually the new TieredMergePolicy (only on trunk currently but I plan
to backport for 3.2) lets you set the max merged segment size
(maxMergedSegmentMB).

It's only an "estimate", but if it's set, it tries to pick a merge
reaching around that target size.

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 9:03 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> Today, LogMP allows you to set different thresholds for segments sizes,
> thereby allowing you to control the largest segment that will be
> considered for merge + the largest segment your index will hold (=~
> threshold * mergeFactor).
>
> So, if you want to end up w/ say 20GB segments, you can set
> maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
>
> However, this often does not achieve your desired goal -- if the index
> contains 5 and 7 GB segments, they will never be merged b/c they are
> bigger than the threshold. I am willing to spend the CPU and IO resources
> to end up w/ 20 GB segments, whether I'm merging 10 segments together or
> only 2. After I reach a 20GB segment, it can rest peacefully, at least
> until I increase the threshold.
>
> So I wonder, first, if this threshold (i.e., largest segment size you
> would like to end up with) is more natural to set than thee current
> thresholds,
> from the application level? I.e., wouldn't it be a simpler threshold to set
> instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
> and mergeFactor?
>
> Second, should this be an addition to LogMP, or a different
> type of MP. One that adheres to only those two factors (perhaps the
> segSize threshold should be allowed to set differently for optimize and
> regular merges). It can pick segments for merge such that it maximizes
> the result segment size (i.e., don't necessarily merge in sequential
> order), but not more than mergeFactor.
>
> I guess, if we think that maxResultSegmentSizeMB is more intuitive than
> the current thresholds, application-wise, then this change should go
> into LogMP. Otherwise, it feels like a different MP is needed, because
> LogMP is already complicated and another threshold would confuse things.
>
> What do you think of this? Am I trying to optimize too much? :)
>
> Shai
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org