You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Chuck Williams <ch...@manawiz.com> on 2006/12/04 22:15:46 UTC

Efficiently expunging deletions of recently added documents

Hi All,

I'd like to open up the API to mergeSegments() in IndexWriter and am
wondering if there are potential problems with this.

I use ParallelReader and ParallelWriter (in jira) extensively as these
provide the basis for fast bulk updates of small metadata fields. 
ParallelReader requires that the subindexes be strictly synchronized by
matching doc ids.  The thorniest problem arises when writing a new
document (with ParallelWriter) generates an exception in some of the
subindexes but not others, as this leaves the subindexes out of sync.

I have recovery for this now that works by deleting the successfully
added subdocuments that are parallel to any unsuccessful subdocument and
then optimizing to expunge the unsuccessful doc-id from those segments
where it had been added.  Optimization is prohibitively expensive for
large indexes, and unnecessary for this recovery.

A much better solution is to have an API in IndexWriter to expunge a
given set of deleted doc ids.  This could merge only enough recent
segments to fully encompass the specified docs, which in this case is
not much since they will be recently added.  The result should be orders
of magnitude performance improvement to the recovery.

I'm planning to make this change and submit a patch for it unless I've
missed something that somebody can point out.  At the same time, I'll
update the ParallelWriter submission as there are a number of bug fixes
plus a substantial general (non-recovery-case) performance improvement
I've just identified and am about to implement.

Thanks for any thoughts. suggestions, or problems you can point out.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Efficiently expunging deletions of recently added documents

Posted by Chuck Williams <ch...@manawiz.com>.
Thanks Ning.  This is all very helpful.  I'll make sure to be consistent
with the new merge policy and its invariant conditions.

Chuck


Ning Li wrote on 12/05/2006 08:01 AM:
> An old issue (http://issues.apache.org/jira/browse/LUCENE-325 new
> method expungeDeleted() added to IndexWriter) requested a similar
> functionality as described in the latter half of your email.
>
> The patch for that issue breaks the invariants of the new merge
> policy. An algorithm similar to that of addIndexesNoOptimize()
> (http://issues.apache.org/jira/browse/LUCENE-528 Optimization for
> IndexWriter.addIndexes()) would solve the problem.
>
> Ning
>
> On 12/5/06, Ning Li <ni...@gmail.com> wrote:
>> > I'd like to open up the API to mergeSegments() in IndexWriter and am
>> > wondering if there are potential problems with this.
>>
>> I'm worried that opening up mergeSegments() could easily break the
>> invariants currently guaranteed by the new merge
>> policy(http://issues.apache.org/jira/browse/LUCENE-672).
>>
>> The two invariants say that if M does not change and segment doc count
>> is not reaching maxMergeDocs:
>> B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
>> 1: If i (left*) and i+1 (right*) are two consecutive segments of doc
>> counts x and y, then f(x) >= f(y).
>> 2: The number of committed segments on the same level (f(n)) <= M.
>>
>> Ning
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Efficiently expunging deletions of recently added documents

Posted by Ning Li <ni...@gmail.com>.
An old issue (http://issues.apache.org/jira/browse/LUCENE-325 new
method expungeDeleted() added to IndexWriter) requested a similar
functionality as described in the latter half of your email.

The patch for that issue breaks the invariants of the new merge
policy. An algorithm similar to that of addIndexesNoOptimize()
(http://issues.apache.org/jira/browse/LUCENE-528 Optimization for
IndexWriter.addIndexes()) would solve the problem.

Ning

On 12/5/06, Ning Li <ni...@gmail.com> wrote:
> > I'd like to open up the API to mergeSegments() in IndexWriter and am
> > wondering if there are potential problems with this.
>
> I'm worried that opening up mergeSegments() could easily break the
> invariants currently guaranteed by the new merge
> policy(http://issues.apache.org/jira/browse/LUCENE-672).
>
> The two invariants say that if M does not change and segment doc count
> is not reaching maxMergeDocs:
> B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
> 1: If i (left*) and i+1 (right*) are two consecutive segments of doc
> counts x and y, then f(x) >= f(y).
> 2: The number of committed segments on the same level (f(n)) <= M.
>
> Ning
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Efficiently expunging deletions of recently added documents

Posted by Ning Li <ni...@gmail.com>.
> I'd like to open up the API to mergeSegments() in IndexWriter and am
> wondering if there are potential problems with this.

I'm worried that opening up mergeSegments() could easily break the
invariants currently guaranteed by the new merge
policy(http://issues.apache.org/jira/browse/LUCENE-672).

The two invariants say that if M does not change and segment doc count
is not reaching maxMergeDocs:
B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
1: If i (left*) and i+1 (right*) are two consecutive segments of doc
counts x and y, then f(x) >= f(y).
2: The number of committed segments on the same level (f(n)) <= M.

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org