You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2010/10/13 18:52:42 UTC

[jira] Created: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
----------------------------------------------------------------

                 Key: LUCENE-2701
                 URL: https://issues.apache.org/jira/browse/LUCENE-2701
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Shai Erera
            Assignee: Shai Erera
             Fix For: 3.1, 4.0


LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ <maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration.

As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl.

I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-2701:
-------------------------------

    Attachment: LUCENE-2701.patch

You're right about the code - the 'else if' is in case there is one not optimized segment to the right. Added a comment and combined them into one OR-ed if. Also added a test case.

OneMerge.totalSizeInBytes -- no one calls it now, but I would like to write a MP which will, and remove merges that exceed a specified total size. It's just a service method, instead of you needing to write it on your own. I renamed it to totalBytesSize. And on the way added totalNumDocs, doing the same for the number of docs.

bq. Maybe note somewhere that now optimize (when there's a maxMergeDocs/MB constraint) is able to merge fewer than mergeFactor segments at a time?

Wasn't it able to do so even before? E.g. if maxNumSegments < numSegments < mergeFactor?

> Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2701
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2701
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch
>
>
> LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ <maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration.
> As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl.
> I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera resolved LUCENE-2701.
--------------------------------

    Resolution: Fixed

Committed revision 1025544 (3x).
Committed revision 1025577 (trunk).

> Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2701
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2701
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2701.patch, LUCENE-2701.patch, LUCENE-2701.patch
>
>
> LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ <maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration.
> As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl.
> I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-2701:
-------------------------------

    Attachment: LUCENE-2701.patch

Added support for maxMergeDocs as well. Also, I created a test class for size bounded optimize and added several test cases.

I think it's ready to commit, but I'll wait a few days for some reviews.

> Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2701
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2701
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2701.patch, LUCENE-2701.patch
>
>
> LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ <maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration.
> As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl.
> I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-2701:
-------------------------------

    Attachment: LUCENE-2701.patch

Patch adds maxMergeMB handling to optimize as well. If there are no segments exceeding the threshold, then only maxNumSegments constraint is taken into account. Basically I've created two private methods findMergesForOptimizeMaxMergeSize and findMergesForOptimizeMaxNumSegments (the original logic). findMergesForOptimize calls the relevant one.

I've also changed some members to protected and methods as well, for really easy extension of LMP. As a result, I removed two methods from BalancedSegmentsMP that were copied over from LMP.

I took the opportunity to change OneMerge.segments and userCompoundfile to public - they are final so no risk of changing from the outside. But otherwise, if you would like to write a MP which queries the OneMerge objects, you can't. I added totalSize() to return the total size in bytes of that merge.

Test + CHANGES entry as well.

> Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2701
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2701
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2701.patch
>
>
> LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ <maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration.
> As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl.
> I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2701) Factor maxMergeSize into findMergesForOptimize in LogMergePolicy

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921329#action_12921329 ] 

Michael McCandless commented on LUCENE-2701:
--------------------------------------------

Patch looks good!

Maybe rename OneMerge.totalSize -> totalSizeInBytes?  Hmm does anyone
actually call this new method?

Maybe note somewhere that now optimize (when there's a maxMergeDocs/MB
constraint) is able to merge fewer than mergeFactor segments at a
time?

This code is a bit confusing:

{noformat}
       if (last - start - 1 > 1) {
         // there is more than 1 segment to the right of this one.
         spec.add(new OneMerge(infos.range(start + 1, last), useCompoundFile));
       } else if (start != last - 1 && !isOptimized(infos.info(start + 1))) {
          spec.add(new OneMerge(infos.range(start + 1, last), useCompoundFile));
       }
{noformat}

Both if clauses are doing the same thing right?  (Ie merging the chunk
of segs to the right). Maybe put a comment explaining the 2nd one?  (I
think it's for the case where there's 1 segment to our right but it's
not optimized, eg the CFS differs?).  Or maybe consolidate into a single
if?


> Factor maxMergeSize into findMergesForOptimize in LogMergePolicy
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2701
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2701
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2701.patch, LUCENE-2701.patch
>
>
> LogMergePolicy allows you to specify a maxMergeSize in MB, which is taken into consideration in regular merges, yet ignored by findMergesForOptimze. I think it'd be good if we take that into consideration even when optimizing. This will allow the caller to specify two constraints: maxNumSegments and maxMergeMB. Obviously both may not be satisfied, and therefore we will guarantee that if there is any segment above the threshold, the threshold constraint takes precedence and therefore you may end up w/ <maxNumSegments (if it's not 1) after optimize. Otherwise, maxNumSegments is taken into consideration.
> As part of this change, I plan to change some methods to protected (from private) and members as well. I realized that if one wishes to implement his own LMP extension, he needs to either put it under o.a.l.index or copy some code over to his impl.
> I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org