You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ning Li <ni...@gmail.com> on 2007/05/02 15:56:50 UTC

Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

On 3/31/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor.  With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment.  Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.

Hi Mike,

When a 20th level 0 segment triggers a 20th level 1 segment which
triggers a 20th level 2 segment... we are still optimizing, aren't we?
Am I missing something here?

Regards,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Yonik Seeley" <yo...@apache.org> wrote:
> On 5/3/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> > I like your idea to keep "delete count per segment" in the segments
> > file.  This information is certainly useful to the merge policy
> > because it should proportionally reducde a segments size according to
> > what %tg of its docs are deleted, and, it should favor merging
> > segments with high # deletes to free up the storage.
> 
> Looking at numDocs instead of maxDoc would be a simple way.

But we [currently] need a SegmentReader to get that information right?  I think
Ning was proposing that we store this (numDocs) in the segments_N file to save
having to open SegmentReader for each segment we may want to merge.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Yonik Seeley <yo...@apache.org>.
On 5/3/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> I like your idea to keep "delete count per segment" in the segments
> file.  This information is certainly useful to the merge policy
> because it should proportionally reducde a segments size according to
> what %tg of its docs are deleted, and, it should favor merging
> segments with high # deletes to free up the storage.

Looking at numDocs instead of maxDoc would be a simple way.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Ning Li" <ni...@gmail.com> wrote:

> > It would merge based on size (not # docs), would be free to merge
> > adjacent segments (not just rightmost segments), and would merge N
> > (configurable) at a time.  The part that's still unclear is how it
> > chooses when to "trigger" a merge and how specifically it picks which
> > N segments to merge (maybe: the series of N adjacent segments that are
> > "most similar" in size, but favoring smaller segments over larger
> > ones).
> 
> Those two are very good questions. It's a challenge to make it work in
> all case. One example is the sandwich case, where two large segments
> sandwich a small one. I'll think about it... It'd be even better if we
> can take deletes into consideration: it's more beneficial to merge a
> segment with more deletes. Right now, we have to open an IndexReader
> to get the number of deletes. We could store that in segments file if
> we decide IndexWriter/MergePolicy will need that...

Yes the sandwich case would be challenging, though, how would you get
to the sandwich case in the first place?  I guess if RAM had flushed
that way; or if many deletes accumulated on the middle one.  But I
don't think merging would tend to produce sandwich cases itself (since
it would have merged that middle one).

I like your idea to keep "delete count per segment" in the segments
file.  This information is certainly useful to the merge policy
because it should proportionally reducde a segments size according to
what %tg of its docs are deleted, and, it should favor merging
segments with high # deletes to free up the storage.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Ning Li <ni...@gmail.com>.
Steve, Mike,

Thanks for the explanation! I meant cascading but wrote optimizing. So
it still cascades merges.

> It would merge based on size (not # docs), would be free to merge
> adjacent segments (not just rightmost segments), and would merge N
> (configurable) at a time.  The part that's still unclear is how it
> chooses when to "trigger" a merge and how specifically it picks which
> N segments to merge (maybe: the series of N adjacent segments that are
> "most similar" in size, but favoring smaller segments over larger
> ones).

Those two are very good questions. It's a challenge to make it work in
all case. One example is the sandwich case, where two large segments
sandwich a small one. I'll think about it... It'd be even better if we
can take deletes into consideration: it's more beneficial to merge a
segment with more deletes. Right now, we have to open an IndexReader
to get the number of deletes. We could store that in segments file if
we decide IndexWriter/MergePolicy will need that...

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <ma...@mikemccandless.com>.
"Yonik Seeley" <yo...@apache.org> wrote:
> On 5/2/07, Michael McCandless <ma...@mikemccandless.com> wrote:
> > It would merge based on size (not # docs), would be free to merge
> > adjacent segments (not just rightmost segments), and would merge N
> > (configurable) at a time.
> 
> Hopefully it will always be easy to switch the meaning of "size" so it
> could be in bytes or number of documents.  Just keeping in mind the
> ParallelReader...

Good point.  I had been thinking we would keep the current merge
policy as a choice that users of ParallelReader could use.

But we could in addition give this new merge policy a "switch" as
whether size is measured by #docs vs by #bytes.  The benefit for users
of ParallelReader would then be that the policy takes deletes into
account and has freedom to merge adjacent (not just rightmost)
segments vs the current merge policy.  This seems like the way to go.

Mike
-- 
  Michael McCandless
  mail@mikemccandless.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Yonik Seeley <yo...@apache.org>.
On 5/2/07, Michael McCandless <ma...@mikemccandless.com> wrote:
> It would merge based on size (not # docs), would be free to merge
> adjacent segments (not just rightmost segments), and would merge N
> (configurable) at a time.

Hopefully it will always be easy to switch the meaning of "size" so it
could be in bytes or number of documents.  Just keeping in mind the
ParallelReader...

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <ma...@mikemccandless.com>.
"Ning Li" <ni...@gmail.com> wrote:
> On 3/31/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> > Create merge policy that doesn't periodically inadvertently optimize
> > --------------------------------------------------------------------
> > So we could make a small change to the policy by only merging the
> > first mergeFactor segments once we hit 2X the merge factor.  With
> > mergeFactor=10, when we have created the 20th level 0 (just flushed)
> > segment, we merge the first 10 into a level 1 segment.  Then on
> > creating another 10 level 0 segments, we merge the second set of 10
> > level 0 segments into a level 1 segment, etc.
> 
> Hi Mike,
> 
> When a 20th level 0 segment triggers a 20th level 1 segment which
> triggers a 20th level 2 segment... we are still optimizing, aren't we?
> Am I missing something here?

The merge would "cascade" in this case, but, would not optimize (you
will have > 1 segments in the end).

Each time you cascade you only merge the first 10 at each level, so
after cascading you would have 1 level 3 segment, 10 level 2 segments,
10 level 1 segments and 10 level 0 segments.

I'm actually using this merge policy in my patch for LUCENE-843 when
merging the flushed "partial" segments.  This is only used when
IndexWriter is opened with autoCommit=false, and, you add lots
and lots of documents (so RAM flushes many times).

But, I like the proposed merge policy at the end of LUCENE-845 even
better for Lucene's normal merges.

It would merge based on size (not # docs), would be free to merge
adjacent segments (not just rightmost segments), and would merge N
(configurable) at a time.  The part that's still unclear is how it
chooses when to "trigger" a merge and how specifically it picks which
N segments to merge (maybe: the series of N adjacent segments that are
"most similar" in size, but favoring smaller segments over larger
ones).

Mike
-- 
  Michael McCandless
  mail@mikemccandless.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Steven Parkes <st...@esseff.org>.
As I understand it (correct me if I'm wrong, Mike), there will be a
cascading merge in this case, but not an optimize, because you don't
merge all (twenty) segments at each level, just half of them (ten).

-----Original Message-----
From: Ning Li [mailto:ning.li.li@gmail.com] 
Sent: Wednesday, May 02, 2007 6:57 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Created: (LUCENE-854) Create merge policy that
doesn't periodically inadvertently optimize

On 3/31/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor.  With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment.  Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.

Hi Mike,

When a 20th level 0 segment triggers a 20th level 1 segment which
triggers a 20th level 2 segment... we are still optimizing, aren't we?
Am I missing something here?

Regards,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org