You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shawn Heisey <ap...@elyograg.org> on 2016/08/10 19:57:53 UTC

How hard would a "wipe all deletes" operation be?

My question is in the context of Solr, but I think it would probably be
best implemented in Lucene, for the benefit of all Lucene-based
software.  I'm describing it here to decide whether I should raise an issue.

I'm after something that would simply rewrite any segment containing
deleted documents, without actually merging the segments.  It would be
*like* a merge, except that it would usually merge one segment to one
segment, instead of many to one.

If the deleted documents are evenly scattered across the whole index
(shard), simply doing forceMerge might be just as efficient, assuming
disk space is not a concern.  A use case with highly-bunched deletes and
a relatively large number of segments would only need to work on some of
the segments, and would complete faster.  I suspect that bunched deletes
are probably common in actual user indexes, at least for the ones where
most deletes are related to document updates.

I don't know what this operation would be called.  I can start the
bikeshedding with something like wipeDeletes.  Using expungeDeletes
would be awesome, but this name is already used as a parameter for
another operation, at least in Solr.

I can imagine two methods, one which has no arguments and one that takes
two float percentage thresholds.

For the second method, the thresholds would control what happens if the
space used by segments with deletes is above or below the threshold. 
The first threshold, which might be called "mergeThreshold" would merge
the segments with deletes into a single segment IF the space used by the
segments with deletes is less than or equal to that percentage of the
whole index.  The second threshold, which might be called
"forceMergeThreshold" would change the request into a forceMerge if the
amount of space used by the segments with deletes is greater than or
equal to that percentage of the whole index.

The no-arg method could go two ways:  Either it *only* rewrites segments
one to one (maybe calling the other method with Float.MIN_VALUE for both
arguments), or it assigns reasonable default values to the two
thresholds, perhaps 30 and 90 percent.

On my dev server, optimizing a 33GB index shard takes over 3500 seconds
-- close to an hour.  I only do the optimize (forceMerge in Lucene) to
clean out deletes so they don't accumulate.  Any performance increase
that I obtain is a nice bonus -- not the reason for the optimize.

I would expect the operation I am describing here to take a fraction of
that time, if it is run on an index that has never been optimized.  My
TMP settings are roughly equivalent to a mergeFactor of 35.  I have the
potential for many segments.

<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
  <int name="maxMergeAtOnce">35</int>
  <int name="segmentsPerTier">35</int>
  <int name="maxMergeAtOnceExplicit">105</int>
</mergePolicy>

Most of my deletes are concentrated in the most recently added
documents.  Normal merging will eliminate some of them, and most of what
is left will be in the first tier of merged segments, which should be
pretty small.  Getting rid of deleted documents should be very efficient
on my indexes with this operation.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: How hard would a "wipe all deletes" operation be?

Posted by David Smiley <da...@gmail.com>.
Good points Jeff.

On Tue, Aug 16, 2016 at 12:37 PM Jeff Wartes <jw...@whitepages.com> wrote:

>
>
> Looks to me like if you’re using TieredMergePolicy,
> forceMergeDeletesPctAllowed is simply a way of including a segment as a
> merge candidate that would not have otherwise been a candidate (due to
> being a different tier). Setting it to zero means never include a segment
> as a merge candidate due to delete pct, setting it to 100 means any segment
> with a delete is a candidate.
>
> Related, TieredMergePolicy .reclaimDeletesWeight alters the likelihood that two candidate segments will be selected for a merge based on delete ratio.
>
> Related, doing optimize with expungeDeletes=true simply causes
> TieredMergePolicy to use the forceMergeDeletesPctAllowed.
>
>
>
> Nothing here really answers Shawn’s use case though, since in all cases
> you’re still required to merge segments to remove deletes. Clearly, we
> could set a very high forceMergeDeletesPctAllowed and reclaimDeletesWeight,
> but also clearly, if we really wanted to remove all deletes, an optimize
> with maxSegments=1 would do the job. It’s just expensive, and it’ll be that
> expensive again as soon as you get more deletes.
>
>
>
> In my mind, I’d want to say “if deletes are more than x% of a given
> segment’s size, just re-write the segment while filtering out the deletes.
> The only reason I might care about the merge policy is that I might prefer
> to do this only on the larger/longer-lived segments that don’t get merged
> often.
>
>
>
>
>
>
>
> *From: *David Smiley <da...@gmail.com>
> *Reply-To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
> *Date: *Monday, August 15, 2016 at 8:56 PM
> *To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
> *Subject: *Re: How hard would a "wipe all deletes" operation be?
>
>
>
> Shawn:
>
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
> Search for "expungeDeletes".  So it's 10% by default.  I did some digging
> and I see this 10% figure is settable on the TieredMergePolicy.  So you
> could modify solrconfig.xml and
> set <forceMergeDeletesPctAllowed>0</forceMergeDeletesPctAllowed> as a
> setting on the merge policy.
>
>
>
> On Thu, Aug 11, 2016 at 2:59 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 8/11/2016 10:58 AM, David Smiley wrote:
> > Note there is a threshold to expungeDeletes such that if there aren't
> > enough deletes in a segment relative to the docs in that segments, it
> > won't do any expunging.
>
> What sort of request do I need to send to force expunging *all* deleted
> documents, even if there's only one in a segment?  Is that possible?
>
> That's the end goal of all this scheming.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
>
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: How hard would a "wipe all deletes" operation be?

Posted by Jeff Wartes <jw...@whitepages.com>.
Looks to me like if you’re using TieredMergePolicy, forceMergeDeletesPctAllowed is simply a way of including a segment as a merge candidate that would not have otherwise been a candidate (due to being a different tier). Setting it to zero means never include a segment as a merge candidate due to delete pct, setting it to 100 means any segment with a delete is a candidate.

Related, TieredMergePolicy .reclaimDeletesWeight alters the likelihood that two candidate segments will be selected for a merge based on delete ratio.
Related, doing optimize with expungeDeletes=true simply causes TieredMergePolicy to use the forceMergeDeletesPctAllowed.

Nothing here really answers Shawn’s use case though, since in all cases you’re still required to merge segments to remove deletes. Clearly, we could set a very high forceMergeDeletesPctAllowed and reclaimDeletesWeight, but also clearly, if we really wanted to remove all deletes, an optimize with maxSegments=1 would do the job. It’s just expensive, and it’ll be that expensive again as soon as you get more deletes.

In my mind, I’d want to say “if deletes are more than x% of a given segment’s size, just re-write the segment while filtering out the deletes. The only reason I might care about the merge policy is that I might prefer to do this only on the larger/longer-lived segments that don’t get merged often.



From: David Smiley <da...@gmail.com>
Reply-To: "dev@lucene.apache.org" <de...@lucene.apache.org>
Date: Monday, August 15, 2016 at 8:56 PM
To: "dev@lucene.apache.org" <de...@lucene.apache.org>
Subject: Re: How hard would a "wipe all deletes" operation be?

Shawn:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers  Search for "expungeDeletes".  So it's 10% by default.  I did some digging and I see this 10% figure is settable on the TieredMergePolicy.  So you could modify solrconfig.xml and set <forceMergeDeletesPctAllowed>0</forceMergeDeletesPctAllowed> as a setting on the merge policy.

On Thu, Aug 11, 2016 at 2:59 PM Shawn Heisey <ap...@elyograg.org>> wrote:
On 8/11/2016 10:58 AM, David Smiley wrote:
> Note there is a threshold to expungeDeletes such that if there aren't
> enough deletes in a segment relative to the docs in that segments, it
> won't do any expunging.

What sort of request do I need to send to force expunging *all* deleted
documents, even if there's only one in a segment?  Is that possible?

That's the end goal of all this scheming.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com

Re: How hard would a "wipe all deletes" operation be?

Posted by David Smiley <da...@gmail.com>.
Shawn:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
Search for "expungeDeletes".  So it's 10% by default.  I did some digging
and I see this 10% figure is settable on the TieredMergePolicy.  So you
could modify solrconfig.xml and
set <forceMergeDeletesPctAllowed>0</forceMergeDeletesPctAllowed> as a
setting on the merge policy.

On Thu, Aug 11, 2016 at 2:59 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 8/11/2016 10:58 AM, David Smiley wrote:
> > Note there is a threshold to expungeDeletes such that if there aren't
> > enough deletes in a segment relative to the docs in that segments, it
> > won't do any expunging.
>
> What sort of request do I need to send to force expunging *all* deleted
> documents, even if there's only one in a segment?  Is that possible?
>
> That's the end goal of all this scheming.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: How hard would a "wipe all deletes" operation be?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/11/2016 10:58 AM, David Smiley wrote:
> Note there is a threshold to expungeDeletes such that if there aren't
> enough deletes in a segment relative to the docs in that segments, it
> won't do any expunging.

What sort of request do I need to send to force expunging *all* deleted
documents, even if there's only one in a segment?  Is that possible?

That's the end goal of all this scheming.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: How hard would a "wipe all deletes" operation be?

Posted by David Smiley <da...@gmail.com>.
Note there is a threshold to expungeDeletes such that if there aren't
enough deletes in a segment relative to the docs in that segments, it won't
do any expunging.

On Thu, Aug 11, 2016 at 11:51 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 8/11/2016 6:27 AM, Michael McCandless wrote:
> > You could explore this idea using a custom MergePolicy?​
> >
> > That would let you run perf. tests on the 35 GB index, comparing
> > forceMerge, expungeDeletes (TMP), and expungeDeletes (your new
> > MergePolicy).
>
> That's an interesting idea.  I can attempt that, but I must admit up
> front that I might not be able to figure out how to write it.
>
> Somebody asked me privately how the idea I'm proposing would be
> different than the existing expungeDeletes functionality.  I don't
> really know how it would be different, but I do know that when I send a
> commit with expungeDeletes set to true, the number of deletes in the
> index hasn't changed when it finishes, which happens just as quickly as
> a regular commit.  That's why I started thinking about this approach.
>
> I've seen occasional questions on the Solr list and in the IRC channel
> about how to get rid of deleted documents without waiting for a full
> optimize/forceMerge, so I think there's a demand for the functionality.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: How hard would a "wipe all deletes" operation be?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/11/2016 6:27 AM, Michael McCandless wrote:
> You could explore this idea using a custom MergePolicy?\u200b
>
> That would let you run perf. tests on the 35 GB index, comparing
> forceMerge, expungeDeletes (TMP), and expungeDeletes (your new
> MergePolicy).

That's an interesting idea.  I can attempt that, but I must admit up
front that I might not be able to figure out how to write it.

Somebody asked me privately how the idea I'm proposing would be
different than the existing expungeDeletes functionality.  I don't
really know how it would be different, but I do know that when I send a
commit with expungeDeletes set to true, the number of deletes in the
index hasn't changed when it finishes, which happens just as quickly as
a regular commit.  That's why I started thinking about this approach.

I've seen occasional questions on the Solr list and in the IRC channel
about how to get rid of deleted documents without waiting for a full
optimize/forceMerge, so I think there's a demand for the functionality.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: How hard would a "wipe all deletes" operation be?

Posted by Michael McCandless <lu...@mikemccandless.com>.
You could explore this idea using a custom MergePolicy?​

That would let you run perf. tests on the 35 GB index, comparing
forceMerge, expungeDeletes (TMP), and expungeDeletes (your new MergePolicy).

Mike McCandless

http://blog.mikemccandless.com