You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/03/31 14:15:25 UTC

[jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Create merge policy that doesn't periodically inadvertently optimize
--------------------------------------------------------------------

                 Key: LUCENE-854
                 URL: https://issues.apache.org/jira/browse/LUCENE-854
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
    Affects Versions: 2.2
            Reporter: Michael McCandless
         Assigned To: Michael McCandless
            Priority: Minor
             Fix For: 2.2


The current merge policy, at every maxBufferedDocs *
power-of-mergeFactor docs added, will do a fully cascaded merge, which
is the same as an optimize.

I think this is not good because at that "optimization poin", the
particular addDocument call is [surprisingly] very expensive.  While,
amortized over all addDocument calls, the cost is low, the cost is
paid "up front" and in a very "bunched up" manner.

I think of this as "pay it forward": you are paying the full cost of
an optimize right now on the expectation / hope that you will be
adding a great many more docs.  But, if you don't add that many more
docs, then, the amortized cost for your index is in fact far higher
than it should have been.  Better to "pay as you go" instead.

So we could make a small change to the policy by only merging the
first mergeFactor segments once we hit 2X the merge factor.  With
mergeFactor=10, when we have created the 20th level 0 (just flushed)
segment, we merge the first 10 into a level 1 segment.  Then on
creating another 10 level 0 segments, we merge the second set of 10
level 0 segments into a level 1 segment, etc.

With this new merge policy, an index that's a bit bigger than a
current "optimization point" would then have a lower amortized cost
per document.  Plus the merge cost is less "bunched up" and less "pay
it forward": instead you pay for what you are actually using.

We can start by creating this merge policy (probably, combined with
with the "by size not by doc count" segment level computation from
LUCENE-845) and then later decide whether we should make it the
default merge policy.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Yonik Seeley" <yo...@apache.org> wrote:
> On 5/3/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> > I like your idea to keep "delete count per segment" in the segments
> > file.  This information is certainly useful to the merge policy
> > because it should proportionally reducde a segments size according to
> > what %tg of its docs are deleted, and, it should favor merging
> > segments with high # deletes to free up the storage.
> 
> Looking at numDocs instead of maxDoc would be a simple way.

But we [currently] need a SegmentReader to get that information right?  I think
Ning was proposing that we store this (numDocs) in the segments_N file to save
having to open SegmentReader for each segment we may want to merge.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Yonik Seeley <yo...@apache.org>.
On 5/3/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> I like your idea to keep "delete count per segment" in the segments
> file.  This information is certainly useful to the merge policy
> because it should proportionally reducde a segments size according to
> what %tg of its docs are deleted, and, it should favor merging
> segments with high # deletes to free up the storage.

Looking at numDocs instead of maxDoc would be a simple way.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Ning Li" <ni...@gmail.com> wrote:

> > It would merge based on size (not # docs), would be free to merge
> > adjacent segments (not just rightmost segments), and would merge N
> > (configurable) at a time.  The part that's still unclear is how it
> > chooses when to "trigger" a merge and how specifically it picks which
> > N segments to merge (maybe: the series of N adjacent segments that are
> > "most similar" in size, but favoring smaller segments over larger
> > ones).
> 
> Those two are very good questions. It's a challenge to make it work in
> all case. One example is the sandwich case, where two large segments
> sandwich a small one. I'll think about it... It'd be even better if we
> can take deletes into consideration: it's more beneficial to merge a
> segment with more deletes. Right now, we have to open an IndexReader
> to get the number of deletes. We could store that in segments file if
> we decide IndexWriter/MergePolicy will need that...

Yes the sandwich case would be challenging, though, how would you get
to the sandwich case in the first place?  I guess if RAM had flushed
that way; or if many deletes accumulated on the middle one.  But I
don't think merging would tend to produce sandwich cases itself (since
it would have merged that middle one).

I like your idea to keep "delete count per segment" in the segments
file.  This information is certainly useful to the merge policy
because it should proportionally reducde a segments size according to
what %tg of its docs are deleted, and, it should favor merging
segments with high # deletes to free up the storage.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Ning Li <ni...@gmail.com>.
Steve, Mike,

Thanks for the explanation! I meant cascading but wrote optimizing. So
it still cascades merges.

> It would merge based on size (not # docs), would be free to merge
> adjacent segments (not just rightmost segments), and would merge N
> (configurable) at a time.  The part that's still unclear is how it
> chooses when to "trigger" a merge and how specifically it picks which
> N segments to merge (maybe: the series of N adjacent segments that are
> "most similar" in size, but favoring smaller segments over larger
> ones).

Those two are very good questions. It's a challenge to make it work in
all case. One example is the sandwich case, where two large segments
sandwich a small one. I'll think about it... It'd be even better if we
can take deletes into consideration: it's more beneficial to merge a
segment with more deletes. Right now, we have to open an IndexReader
to get the number of deletes. We could store that in segments file if
we decide IndexWriter/MergePolicy will need that...

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <ma...@mikemccandless.com>.
"Yonik Seeley" <yo...@apache.org> wrote:
> On 5/2/07, Michael McCandless <ma...@mikemccandless.com> wrote:
> > It would merge based on size (not # docs), would be free to merge
> > adjacent segments (not just rightmost segments), and would merge N
> > (configurable) at a time.
> 
> Hopefully it will always be easy to switch the meaning of "size" so it
> could be in bytes or number of documents.  Just keeping in mind the
> ParallelReader...

Good point.  I had been thinking we would keep the current merge
policy as a choice that users of ParallelReader could use.

But we could in addition give this new merge policy a "switch" as
whether size is measured by #docs vs by #bytes.  The benefit for users
of ParallelReader would then be that the policy takes deletes into
account and has freedom to merge adjacent (not just rightmost)
segments vs the current merge policy.  This seems like the way to go.

Mike
-- 
  Michael McCandless
  mail@mikemccandless.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Yonik Seeley <yo...@apache.org>.
On 5/2/07, Michael McCandless <ma...@mikemccandless.com> wrote:
> It would merge based on size (not # docs), would be free to merge
> adjacent segments (not just rightmost segments), and would merge N
> (configurable) at a time.

Hopefully it will always be easy to switch the meaning of "size" so it
could be in bytes or number of documents.  Just keeping in mind the
ParallelReader...

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Michael McCandless <ma...@mikemccandless.com>.
"Ning Li" <ni...@gmail.com> wrote:
> On 3/31/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> > Create merge policy that doesn't periodically inadvertently optimize
> > --------------------------------------------------------------------
> > So we could make a small change to the policy by only merging the
> > first mergeFactor segments once we hit 2X the merge factor.  With
> > mergeFactor=10, when we have created the 20th level 0 (just flushed)
> > segment, we merge the first 10 into a level 1 segment.  Then on
> > creating another 10 level 0 segments, we merge the second set of 10
> > level 0 segments into a level 1 segment, etc.
> 
> Hi Mike,
> 
> When a 20th level 0 segment triggers a 20th level 1 segment which
> triggers a 20th level 2 segment... we are still optimizing, aren't we?
> Am I missing something here?

The merge would "cascade" in this case, but, would not optimize (you
will have > 1 segments in the end).

Each time you cascade you only merge the first 10 at each level, so
after cascading you would have 1 level 3 segment, 10 level 2 segments,
10 level 1 segments and 10 level 0 segments.

I'm actually using this merge policy in my patch for LUCENE-843 when
merging the flushed "partial" segments.  This is only used when
IndexWriter is opened with autoCommit=false, and, you add lots
and lots of documents (so RAM flushes many times).

But, I like the proposed merge policy at the end of LUCENE-845 even
better for Lucene's normal merges.

It would merge based on size (not # docs), would be free to merge
adjacent segments (not just rightmost segments), and would merge N
(configurable) at a time.  The part that's still unclear is how it
chooses when to "trigger" a merge and how specifically it picks which
N segments to merge (maybe: the series of N adjacent segments that are
"most similar" in size, but favoring smaller segments over larger
ones).

Mike
-- 
  Michael McCandless
  mail@mikemccandless.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Steven Parkes <st...@esseff.org>.
As I understand it (correct me if I'm wrong, Mike), there will be a
cascading merge in this case, but not an optimize, because you don't
merge all (twenty) segments at each level, just half of them (ten).

-----Original Message-----
From: Ning Li [mailto:ning.li.li@gmail.com] 
Sent: Wednesday, May 02, 2007 6:57 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Created: (LUCENE-854) Create merge policy that
doesn't periodically inadvertently optimize

On 3/31/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor.  With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment.  Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.

Hi Mike,

When a 20th level 0 segment triggers a 20th level 1 segment which
triggers a 20th level 2 segment... we are still optimizing, aren't we?
Am I missing something here?

Regards,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by Ning Li <ni...@gmail.com>.
On 3/31/07, Michael McCandless (JIRA) <ji...@apache.org> wrote:
> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor.  With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment.  Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.

Hi Mike,

When a 20th level 0 segment triggers a 20th level 1 segment which
triggers a 20th level 2 segment... we are still optimizing, aren't we?
Am I missing something here?

Regards,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-854:
--------------------------------------

    Fix Version/s:     (was: 2.2)

> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
>
>                 Key: LUCENE-854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-854
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>
> The current merge policy, at every maxBufferedDocs *
> power-of-mergeFactor docs added, will do a fully cascaded merge, which
> is the same as an optimize.
> I think this is not good because at that "optimization poin", the
> particular addDocument call is [surprisingly] very expensive.  While,
> amortized over all addDocument calls, the cost is low, the cost is
> paid "up front" and in a very "bunched up" manner.
> I think of this as "pay it forward": you are paying the full cost of
> an optimize right now on the expectation / hope that you will be
> adding a great many more docs.  But, if you don't add that many more
> docs, then, the amortized cost for your index is in fact far higher
> than it should have been.  Better to "pay as you go" instead.
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor.  With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment.  Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.
> With this new merge policy, an index that's a bit bigger than a
> current "optimization point" would then have a lower amortized cost
> per document.  Plus the merge cost is less "bunched up" and less "pay
> it forward": instead you pay for what you are actually using.
> We can start by creating this merge policy (probably, combined with
> with the "by size not by doc count" segment level computation from
> LUCENE-845) and then later decide whether we should make it the
> default merge policy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org