You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2014/01/06 19:24:45 UTC

MergePolicy for append-only indices?

Hi,
(cross-posting to both Solr and Lucene user lists because while this is a
Lucene-level question, I suspect a lot of people who know about this or are
interested in this subject are actually on the Solr list)

I have a large append-only index and I looked at merge policies hoping to
identify one that is naturally more suitable for indices without any
updates and deletions, just adds.

I've read
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/index/TieredMergePolicy.htmland
the javadocs for its cousins, but it doesn't look like any of them is
more suited for append-only index than the other ones and Tiered MP having
more knobs is probably the best one to use.....

I was wondering if I was missing something, if one of the MPs is in fact
better for append-only indices OR if one can suggest how one could write a
custom MP that's specialized for append-only indices.

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

Re: MergePolicy for append-only indices?

Posted by Otis Gospodnetic <ot...@gmail.com>.
Thanks Mike(s) & Co.
Added https://issues.apache.org/jira/browse/LUCENE-5419

Sounds like a killer feature :)

Otis



On Wed, Jan 8, 2014 at 4:17 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov
> <ms...@safaribooksonline.com> wrote:
> > I think the key optimization when there are no deletions is that you
> don't
> > need to renumber documents and can bulk-copy blocks of contiguous
> documents,
> > and that is independent of merge policy. I think :)
>
> Merging of term vectors and stored fields will always use bulk-copy
> for contiguous chunks of non-deleted docs, so for the append-only case
> these will be the max chunk size and be efficient.
>
> We have no codec that implements bulk merging for postings, which
> would be interesting to pursue: in the append-only case it's possible,
> and merging of postings is normally by far the most time consuming
> step of a merge.
>
> Also, no RAM will be used holding the doc mapping, since the docIDs
> don't change.
>
> These benefits are independent of the MergePolicy.
>
> I think TieredMergePolicy will work fine for append-only; I'm not sure
> how you'd improve on its approach.  It will in general renumber the
> docs, so if that's a problem, apps should use LogByteSizeMP.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

Re: MergePolicy for append-only indices?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> I think the key optimization when there are no deletions is that you don't
> need to renumber documents and can bulk-copy blocks of contiguous documents,
> and that is independent of merge policy. I think :)

Merging of term vectors and stored fields will always use bulk-copy
for contiguous chunks of non-deleted docs, so for the append-only case
these will be the max chunk size and be efficient.

We have no codec that implements bulk merging for postings, which
would be interesting to pursue: in the append-only case it's possible,
and merging of postings is normally by far the most time consuming
step of a merge.

Also, no RAM will be used holding the doc mapping, since the docIDs
don't change.

These benefits are independent of the MergePolicy.

I think TieredMergePolicy will work fine for append-only; I'm not sure
how you'd improve on its approach.  It will in general renumber the
docs, so if that's a problem, apps should use LogByteSizeMP.

Mike McCandless

http://blog.mikemccandless.com

Re: MergePolicy for append-only indices?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
I think the key optimization when there are no deletions is that you 
don't need to renumber documents and can bulk-copy blocks of contiguous 
documents, and that is independent of merge policy. I think :)

-Mike

On 01/06/2014 01:54 PM, Shawn Heisey wrote:
> On 1/6/2014 11:24 AM, Otis Gospodnetic wrote:
>> (cross-posting to both Solr and Lucene user lists because while this 
>> is a
>> Lucene-level question, I suspect a lot of people who know about this 
>> or are
>> interested in this subject are actually on the Solr list)
>>
>> I have a large append-only index and I looked at merge policies 
>> hoping to
>> identify one that is naturally more suitable for indices without any
>> updates and deletions, just adds.
>>
>> I've read
>> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/index/TieredMergePolicy.htmland 
>>
>> the javadocs for its cousins, but it doesn't look like any of them is
>> more suited for append-only index than the other ones and Tiered MP 
>> having
>> more knobs is probably the best one to use.....
>>
>> I was wondering if I was missing something, if one of the MPs is in fact
>> better for append-only indices OR if one can suggest how one could 
>> write a
>> custom MP that's specialized for append-only indices.
>
> The Tiered policy was made default for Solr back in the 3.x days. 
> Defaults in both Solr and Lucene don't normally change without some 
> serious thought about the repercussions.
>
> As for what's best for different kinds of indexes (add-only vs 
> update/delete) ... unless there are *enormous* numbers of deletions 
> (whether from updates or pure delete requests), I don't think that 
> affects the decision very much.  The Tiered policy seems like it's 
> probably the best choice either way.  I assume you've seen the 
> following blog post?
>
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html 
>
>
> Thanks,
> Shawn
>


Re: MergePolicy for append-only indices?

Posted by Shawn Heisey <so...@elyograg.org>.
On 1/6/2014 11:24 AM, Otis Gospodnetic wrote:
> (cross-posting to both Solr and Lucene user lists because while this is a
> Lucene-level question, I suspect a lot of people who know about this or are
> interested in this subject are actually on the Solr list)
>
> I have a large append-only index and I looked at merge policies hoping to
> identify one that is naturally more suitable for indices without any
> updates and deletions, just adds.
>
> I've read
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/index/TieredMergePolicy.htmland
> the javadocs for its cousins, but it doesn't look like any of them is
> more suited for append-only index than the other ones and Tiered MP having
> more knobs is probably the best one to use.....
>
> I was wondering if I was missing something, if one of the MPs is in fact
> better for append-only indices OR if one can suggest how one could write a
> custom MP that's specialized for append-only indices.

The Tiered policy was made default for Solr back in the 3.x days.  
Defaults in both Solr and Lucene don't normally change without some 
serious thought about the repercussions.

As for what's best for different kinds of indexes (add-only vs 
update/delete) ... unless there are *enormous* numbers of deletions 
(whether from updates or pure delete requests), I don't think that 
affects the decision very much.  The Tiered policy seems like it's 
probably the best choice either way.  I assume you've seen the following 
blog post?

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Thanks,
Shawn