You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Renaud Delbru <re...@deri.org> on 2010/03/25 17:55:07 UTC

Flex API - Debugging Segment Merge

Hi,

I am currently benchmarking various compression algorithms using the Sep 
Codec, but I got index corruption exception during the merge process, 
and I would need your help to debug it.

I have reimplemented various algorithms like FOR, Simple9, VInt, PFor 
for the Sep IntBlock Codec. I am benchmarking them now on the wikipedia 
dataset. For some algorithms, FOR, Simple9, etc., I don't encounter 
problems. But using the PFor algorithms, I get a CorruptedIndex 
exception during the merge process (in SepPostingsWriterImpl#startDoc), 
because document are out of order:

Exception in thread "Lucene Merge Thread #0" 
org.apache.lucene.index.MergePolicy$MergeException: 
org.apache.lucene.index.CorruptIndexException: docs out of order (153 <= 
153 )
         at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:471)
         at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:435)
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of 
order (153 <= 153 )
         at 
org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:177)

However, this is happening only when I tried to index the wikipedia 
dataset using the PFor algorithm. I have tried to recreate the error 
using a unit test, creating random document, and performing a merge, but 
in this case the error does not appear.

After some debug, I have noticed that the document id at the end of a 
segment is the same than (or inferior to) the document id of the next 
segment to be merged. However, even by activating Codec.DEBUG=true, I am 
unable to know which are the faulty segments, and the faulty terms 
inside these segments. Could you indicate me a easy way to get this 
information, so I will be able to check these segments and their encoded 
blocks in order to find and understand the problem ?

Thanks in advance,
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Flex API - Debugging Segment Merge

Posted by Michael McCandless <lu...@mikemccandless.com>.
Phew, I'm glad to hear you got to the bottom of it!  Good sleuthing.

And I'm looking forwarded to your results and hopefully patches that
make these various encoding techniques available as flex codecs :)

Mike

On Fri, Mar 26, 2010 at 6:08 PM, Renaud Delbru <re...@deri.org> wrote:
> Hi Michael,
>
> On 25/03/10 19:15, Michael McCandless wrote:
>>>
>>> I am using one single thread for indexing: reading sequentially the list
>>> of
>>> wikipedia articles, putting the content into a single field, and add the
>>> document to the index. Commit is done every 10K documents.
>>>
>>
>> Are you using contrib/benchmark for this?  That makes it very easy to
>> run tests like this... hmm though we need to extend it so you can
>> specify which Codec to use...
>>
>
> No, I have implemented a simple benchmark platform for measuring indexing
> time and query time. But indeed, I saw that you have a wikipedia extractor,
> this could have save us some time.
>>
>> You can instrument the code (or catch the exc in a debugger) to see
>> all these details?
>>
>
> Yes, I did that today, and finally got all the information I needed to find
> the problems.
> It was indeed a bug in my PFor implementation, that was occurring only in
> very rare cases.
>
> I'll start the query benchmark this week end. Let's hope I'll have something
> to share during the next week.
>
> Cheers
> --
> Renaud Delbru
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Flex API - Debugging Segment Merge

Posted by Renaud Delbru <re...@deri.org>.
Hi Michael,

On 25/03/10 19:15, Michael McCandless wrote:
>> I am using one single thread for indexing: reading sequentially the list of
>> wikipedia articles, putting the content into a single field, and add the
>> document to the index. Commit is done every 10K documents.
>>      
> Are you using contrib/benchmark for this?  That makes it very easy to
> run tests like this... hmm though we need to extend it so you can
> specify which Codec to use...
>    
No, I have implemented a simple benchmark platform for measuring 
indexing time and query time. But indeed, I saw that you have a 
wikipedia extractor, this could have save us some time.
> You can instrument the code (or catch the exc in a debugger) to see
> all these details?
>    
Yes, I did that today, and finally got all the information I needed to 
find the problems.
It was indeed a bug in my PFor implementation, that was occurring only 
in very rare cases.

I'll start the query benchmark this week end. Let's hope I'll have 
something to share during the next week.

Cheers
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Flex API - Debugging Segment Merge

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Mar 25, 2010 at 3:04 PM, Renaud Delbru <re...@deri.org> wrote:
> Hi Michael,
>
> On 25/03/10 18:45, Michael McCandless wrote:
>>
>> Hi Renaud,
>>
>> It's great that you're pushing flex forward so much :) You're making
>> some cool sounding codecs!  I'm really looking forward to seeing
>> indexing/searching performance results on Wikipedia...
>>
>
> I'll share them for sure whenever the results are ready ;o).

I'll be waiting eagerly :)

>> It sounds most likely there's a bug in the PFor impl? (Since you don't
>> hit this exception with the others...).
>>
>
> It seems so, but I found strange also that I cannot reproduce it with
> synthetic data.

Hmmm.

>> During merge, each segment's docIDs are rebased according to how many
>> non-deleted docs there are in all prior segments.  One possibility
>> here is a given segment thought it had N deletions but in fact
>> encountered fewer than N while iterating its docs.  This would cause
>> the next segment to have too-low a base which can cause this exact
>> exception on crossing from one segment to the next.  (Ie the very
>> first doc of the next segment will suddenly be<= prior doc(s)).
>>
>> But... if that's happening (ie, bug is in Lucene not in PFor impl),
>> you'd expect the other codecs to hit it too.
>>
>> Are you using multiple threads for indexing?  Are you also mixing in
>> deletions (or updateDocument calls)?
>>
>
> There is no deletion, I just create the index from scratch, and each
> document I am adding as a unique identifier.

Hmmm.

> I am using one single thread for indexing: reading sequentially the list of
> wikipedia articles, putting the content into a single field, and add the
> document to the index. Commit is done every 10K documents.

Are you using contrib/benchmark for this?  That makes it very easy to
run tests like this... hmm though we need to extend it so you can
specify which Codec to use...

> I have tried with different mergeFactors (2, or 20), but whenever the first
> merge occurs, I got this CorruptIndexException.

It's that consistent?  Is it always that the docID is == to one prior?
 Or is the next docID sometimes < the prior one?  And, is it always on
the 1st docID of a new segment?

> I will try to continue to debug, but if I could have at least the faulty
> segment, and the faulty term (or even better, the index of the faulty
> block), I will be able to display the content of the blocks, and see if
> there is some problems in the PFor encoding.

You can instrument the code (or catch the exc in a debugger) to see
all these details?

Or... if you can post a patch of where you are, I can dig, if I can
repro the issue...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Flex API - Debugging Segment Merge

Posted by Renaud Delbru <re...@deri.org>.
Hi Michael,

On 25/03/10 18:45, Michael McCandless wrote:
> Hi Renaud,
>
> It's great that you're pushing flex forward so much :) You're making
> some cool sounding codecs!  I'm really looking forward to seeing
> indexing/searching performance results on Wikipedia...
>    
I'll share them for sure whenever the results are ready ;o).
> It sounds most likely there's a bug in the PFor impl? (Since you don't
> hit this exception with the others...).
>    
It seems so, but I found strange also that I cannot reproduce it with 
synthetic data.
> During merge, each segment's docIDs are rebased according to how many
> non-deleted docs there are in all prior segments.  One possibility
> here is a given segment thought it had N deletions but in fact
> encountered fewer than N while iterating its docs.  This would cause
> the next segment to have too-low a base which can cause this exact
> exception on crossing from one segment to the next.  (Ie the very
> first doc of the next segment will suddenly be<= prior doc(s)).
>
> But... if that's happening (ie, bug is in Lucene not in PFor impl),
> you'd expect the other codecs to hit it too.
>
> Are you using multiple threads for indexing?  Are you also mixing in
> deletions (or updateDocument calls)?
>    
There is no deletion, I just create the index from scratch, and each 
document I am adding as a unique identifier.
I am using one single thread for indexing: reading sequentially the list 
of wikipedia articles, putting the content into a single field, and add 
the document to the index. Commit is done every 10K documents.
I have tried with different mergeFactors (2, or 20), but whenever the 
first merge occurs, I got this CorruptIndexException.

I will try to continue to debug, but if I could have at least the faulty 
segment, and the faulty term (or even better, the index of the faulty 
block), I will be able to display the content of the blocks, and see if 
there is some problems in the PFor encoding.

Cheers,
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Flex API - Debugging Segment Merge

Posted by Michael McCandless <lu...@mikemccandless.com>.
Hi Renaud,

It's great that you're pushing flex forward so much :) You're making
some cool sounding codecs!  I'm really looking forward to seeing
indexing/searching performance results on Wikipedia...

It sounds most likely there's a bug in the PFor impl? (Since you don't
hit this exception with the others...).

During merge, each segment's docIDs are rebased according to how many
non-deleted docs there are in all prior segments.  One possibility
here is a given segment thought it had N deletions but in fact
encountered fewer than N while iterating its docs.  This would cause
the next segment to have too-low a base which can cause this exact
exception on crossing from one segment to the next.  (Ie the very
first doc of the next segment will suddenly be <= prior doc(s)).

But... if that's happening (ie, bug is in Lucene not in PFor impl),
you'd expect the other codecs to hit it too.

Are you using multiple threads for indexing?  Are you also mixing in
deletions (or updateDocument calls)?

Mike

On Thu, Mar 25, 2010 at 12:55 PM, Renaud Delbru <re...@deri.org> wrote:
> Hi,
>
> I am currently benchmarking various compression algorithms using the Sep
> Codec, but I got index corruption exception during the merge process, and I
> would need your help to debug it.
>
> I have reimplemented various algorithms like FOR, Simple9, VInt, PFor for
> the Sep IntBlock Codec. I am benchmarking them now on the wikipedia dataset.
> For some algorithms, FOR, Simple9, etc., I don't encounter problems. But
> using the PFor algorithms, I get a CorruptedIndex exception during the merge
> process (in SepPostingsWriterImpl#startDoc), because document are out of
> order:
>
> Exception in thread "Lucene Merge Thread #0"
> org.apache.lucene.index.MergePolicy$MergeException:
> org.apache.lucene.index.CorruptIndexException: docs out of order (153 <= 153
> )
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:471)
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:435)
> Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order
> (153 <= 153 )
>        at
> org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:177)
>
> However, this is happening only when I tried to index the wikipedia dataset
> using the PFor algorithm. I have tried to recreate the error using a unit
> test, creating random document, and performing a merge, but in this case the
> error does not appear.
>
> After some debug, I have noticed that the document id at the end of a
> segment is the same than (or inferior to) the document id of the next
> segment to be merged. However, even by activating Codec.DEBUG=true, I am
> unable to know which are the faulty segments, and the faulty terms inside
> these segments. Could you indicate me a easy way to get this information, so
> I will be able to check these segments and their encoded blocks in order to
> find and understand the problem ?
>
> Thanks in advance,
> --
> Renaud Delbru
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org