You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yasuhiro Matsuda (JIRA)" <ji...@apache.org> on 2009/05/12 18:33:45 UTC

[jira] Created: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

LogMergePolicy should use the number of deleted docs when deciding which segments to merge
------------------------------------------------------------------------------------------

                 Key: LUCENE-1634
                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Yasuhiro Matsuda


I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.

I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.

This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1634:
------------------------------------------

    Assignee: Michael McCandless

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1634.
----------------------------------------

    Resolution: Fixed

Thanks Yasuhiro!

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch, LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709015#action_12709015 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

bq. So to implement your own MergePolicy, you have to resort back to sneaking the class into the package.

Right, this is currently necessary for a custom MergePolicy/Scheduler.  It's been discussed before:

  http://www.nabble.com/MergePolicy-public-but-SegmentInfos-package-protected--tt22687527.html

I suppose since merge selection needs so little info about a segment, we could make a public thine wrapper/veneer that exposes a limited number of things.  Or maybe we go whole hog and simply make SegmentInfos/SegmentInfo public.

bq.  it is hidden because these methods are package protected

If we could javadoc certain package protected classes, that'd give you the javadocs at least.  Or we should simply make these methods public?

bq. Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is not used, also popular methods such as set/getMergeFactor etc. are only applicable to LogMergePolicy.

But the notion of a mergeFactor is very much a LogMergePolicy specific thing.  Other merge policies might not limit themselves to always merging mergeFactor segments at once.  These are convenience methods on IndexWriter (that simply forward the request to the MergePolicy).

bq. my guess is that set/getCompoundFile should be applicable to all implementations of MergePolicy

I think that'd make sense.  (I can't remember exactly why, but way back when, I think there was some reason for not doing so...)

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "John Wang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708986#action_12708986 ] 

John Wang commented on LUCENE-1634:
-----------------------------------

This is actually referring to the optimize(int) call, which selectively merges segments to make sure the total segments in the index is less or equal to the specified number.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1634:
---------------------------------------

         Priority: Minor  (was: Major)
    Fix Version/s: 2.9

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708993#action_12708993 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

bq. As the API current stands, I think the behavior is to assume a subclass of LogMergePolicy.

Hmm -- that's not intended.  What goes wrong if you use a custom MergePolicy that's not a LogMergePolicy?

(There are utility methods on IndexWriter, eg set/getUseCompoundFile, etc., that will only work if your MergePolicy is a LogMergePolicy, but that shouldn't prevent you from using a non-LogMergePolicy as long as you don't use these utility methods).

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "John Wang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709003#action_12709003 ] 

John Wang commented on LUCENE-1634:
-----------------------------------

>> So let's proceed with this patch, once you've added setter/getter.
Can you please elaborate on this? Add setter/getter on what? The number of target segment is already an input parameter? do you mean some sort of normalization factor on the how much to "punish" segments with high deleted docs?

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708991#action_12708991 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

bq. This is actually referring to the optimize(int) call

Ahh, woops -- sorry, I missed that the first time around.  But I don't think the patch addresses how optimize(int) selects its merges?

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Yasuhiro Matsuda (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yasuhiro Matsuda updated LUCENE-1634:
-------------------------------------

    Attachment: LUCENE-1634.patch

I posted a patch.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Yasuhiro Matsuda (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yasuhiro Matsuda updated LUCENE-1634:
-------------------------------------

    Attachment: LUCENE-1634.patch

I added setCalibrateSizeByDeletes/getCalibrateSizeByDeletes. Sorry that the method names are a bit too wordy.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch, LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708898#action_12708898 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

Actually, optimize() always merges all segments down to 1,
irrespective of deletes.  I think you're referring to "normal" merges?

I think this approach is reasonable and a good step forward.  Can you
make this behaviour get/settable?  I think we should default to the
old behaviour until 3.0, and then switch it to default to the new one
at 3.0 (to preserve back compat).

Thinking more about this... I think we can further improve how
Log*MergePolicy takes deletes into account.  Ie, why not explicitly
measure the deletions and bias merge selection to favor merging away
segments that have the most deletions?

This might require relaxing the merge policy so that it's allowed to
pick fewer than mergeFactor segments to merge at once; perhaps it's
given a min/max mergeFactor.

Likely such a change should be a new merge policy...


> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709441#action_12709441 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

Patch looks good, thanks Yasuhiro.  I plan to commit in a day or two.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch, LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "John Wang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709002#action_12709002 ] 

John Wang commented on LUCENE-1634:
-----------------------------------

RE: implementing custom MergePolicy
Let me describe in detail on problems of implementing a custom MergePolicy:

1) In IndexWriter code, such methods on MergePolicy is called, e.g. findMergesForOptimize. I believe that is the contract for implementing your own MergePolicy. However, it is "hidden" by the javadoc in terms of documentation, and furthermore, it is hidden because these methods are package protected. So to implement your own MergePolicy, you have to resort back to sneaking the class into the package.

2) Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is not used, also popular methods such as set/getMergeFactor etc. are only applicable to LogMergePolicy. (Just to clarify, useCompoundFile is a package-level protected method on the base MergePolicy class, so my guess is that set/getCompoundFile should be applicable to all implementations of MergePolicy.

This brings up another issue about the practice of having to "sneak" classes into a package. We are looking at making our Lucene code, OSGI compliant, and this becomes an issue because we cannot have multiple "bundles" exporting the same package. Which means, I would have to repackage lucene to include my classes that I have snuck into some lucene packages. I would like to use a standard distribution of  a lucene jar (as suggested/echoed by some luceners).


> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709004#action_12709004 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

I mean a setter/getter to turn on/off "taking deletions into account" in Log*MergePolicy.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "John Wang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709024#action_12709024 ] 

John Wang commented on LUCENE-1634:
-----------------------------------

>>I mean a setter/getter to turn on/off "taking deletions into account" in Log*MergePolicy.
Makes sense. What do you suggest the default behavior to be?
Also, do you think setter/getter is the right approach, since this is very much hidden to the API, e.g. one would have to do this:

LogMergePolicy policy = (LogMergePolicy)idxWriter.getMergePolicy();
policy.setTurnOnSegmentCalcWithDeletes(true);

DO you think instead, we can just add static setter/getter on LogMergePolicy class?


> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "John Wang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708995#action_12708995 ] 

John Wang commented on LUCENE-1634:
-----------------------------------

The current lucene implementation, optimize(int) selects segments to merge based on the file size of the segment file: say the index has 10 segments, and optmize(6) is called,  Lucene finds 4 smallest segments by number of bytes in the segment files. 

This selection criteria is flawed because you can have a very large segment in terms of bytes but very small in terms of numDocs (if many deleted docs). Having these segment files around impacts performance considerably. 

This is what this patch is trying to fix this in a non-intrusive manner by extending the LogMergePolicy and by normalizing the calculation of the segment size including the delete count.


> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "John Wang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708989#action_12708989 ] 

John Wang commented on LUCENE-1634:
-----------------------------------

Comment on implementing a custom merge policy:
As the API current stands, I think the behavior is to assume a subclass of LogMergePolicy. And one cannot subclass LogMergePolicy without injecting the class into the org.apache.lucene.index package, (because the api signature: size(org.apache.lucene.index.SegmentInfo info), SegmentInfo is not an exposed API.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709080#action_12709080 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

bq. What do you suggest the default behavior to be?

I think default to no change, and in 3.0 we flip it?

bq. DO you think instead, we can just add static setter/getter on LogMergePolicy class?

I'd lean towards non-static setter/getters.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708999#action_12708999 ] 

Michael McCandless commented on LUCENE-1634:
--------------------------------------------

bq. say the index has 10 segments, and optmize(6) is called, Lucene finds 4 smallest segments by number of bytes in the segment files.

Oh yeah, you're right: for the last (partial) merge, it uses the size() of the segments to pick the merge that's minimal total size.

So let's proceed with this patch, once you've added setter/getter.

I'd still love to see a merge policy that does a better job explicitly "targeting" segments with lots of deletions.

> LogMergePolicy should use the number of deleted docs when deciding which segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org