You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Mihai Soloi <mi...@gmail.com> on 2012/06/25 17:55:51 UTC

Checksum mismatch in segments file

Hello everybody,

I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for 
Lucene [1] in order to use it on James mailbox indexing. I've 
implemented HIndexOutput/Input, they're persisting the segments file 
just fine in an HBase table, but when I try to get an IndexWriter from 
my directory, it reads the segment_N file but due to the check in 
SegmentInfos the current checksum is different from the persisted one. 
I've tried finding a solution but I can't reach one. Do you guys have 
any idea why this happens? This is the stack trace:

org.apache.lucene.index.CorruptIndexException: checksum mismatch in 
segments file (resource: ChecksumIndexInput(anonymous IndexInput))
     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
     at 
org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
     at 
org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
     at 
org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)

[1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Checksum mismatch in segments file

Posted by Robert Muir <rc...@gmail.com>.
just to add more information, if you are trying lucene 4.x
(http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/), the
rewrite that Mike describes in segmentinfos is actually removed.

But you still need to use AppendingCodec there because the term
dictionary uses this same trick.

On Tue, Jun 26, 2012 at 6:30 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Hmm, the checksum is there to ensure all bits were persisted properly.
>
> But one trickiness is we first write 4 0 bytes, then seek back and
> write the checksum over those 4 bytes.  Could it be that the HBase
> IndexOutput impl can't handle seeking back and overwriting?
>
> If so, you should have a look at AppendingCodec, which fixes the
> places in Lucene's default codec that seek backwards on write ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Jun 25, 2012 at 11:55 AM, Mihai Soloi <mi...@gmail.com> wrote:
>> Hello everybody,
>>
>> I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for Lucene
>> [1] in order to use it on James mailbox indexing. I've implemented
>> HIndexOutput/Input, they're persisting the segments file just fine in an
>> HBase table, but when I try to get an IndexWriter from my directory, it
>> reads the segment_N file but due to the check in SegmentInfos the current
>> checksum is different from the persisted one. I've tried finding a solution
>> but I can't reach one. Do you guys have any idea why this happens? This is
>> the stack trace:
>>
>> org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
>> file (resource: ChecksumIndexInput(anonymous IndexInput))
>>    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
>>    at
>> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
>>    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
>>    at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
>>    at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)
>>
>> [1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Checksum mismatch in segments file

Posted by Mihai Soloi <mi...@gmail.com>.
Hello Mike and Robert,

I am using the stable version of Lucene(i.e. 3.6) and what is actually 
going on is that the checksum (i.e. a long) is written as 8 bytes: the 
first 4 are 0, then the mismatched checksum value(i.e. checksum-1) is 
written in the next 4(reference: 
ChecksumIndexOutput.prepareCommit()).When finishCommit() happens the 
correct checksum is written to the buffer and then on close it's flushed 
to the directory.

A comment states that this is done for better testing. I've followed the 
code with the debugger and printed out the bytes in the logger and I can 
say that seeking back and overwriting are done as they should be.

You can run the test as 'mvn test 
-Dtest=org.apache.james.mailbox.lucene.hbase.IndexingTest' but there 
will be a lot of byte printing.

I am now looking at the AppendingCodec in version 4, and see if I can 
better use that implementation.

Thank you,
Mihai


On 26.06.2012 13:30, Michael McCandless wrote:
> Hmm, the checksum is there to ensure all bits were persisted properly.
>
> But one trickiness is we first write 4 0 bytes, then seek back and
> write the checksum over those 4 bytes.  Could it be that the HBase
> IndexOutput impl can't handle seeking back and overwriting?
>
> If so, you should have a look at AppendingCodec, which fixes the
> places in Lucene's default codec that seek backwards on write ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Jun 25, 2012 at 11:55 AM, Mihai Soloi <mi...@gmail.com> wrote:
>> Hello everybody,
>>
>> I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for Lucene
>> [1] in order to use it on James mailbox indexing. I've implemented
>> HIndexOutput/Input, they're persisting the segments file just fine in an
>> HBase table, but when I try to get an IndexWriter from my directory, it
>> reads the segment_N file but due to the check in SegmentInfos the current
>> checksum is different from the persisted one. I've tried finding a solution
>> but I can't reach one. Do you guys have any idea why this happens? This is
>> the stack trace:
>>
>> org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
>> file (resource: ChecksumIndexInput(anonymous IndexInput))
>>     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
>>     at
>> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
>>     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
>>     at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
>>     at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)
>>
>> [1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Checksum mismatch in segments file

Posted by Michael McCandless <lu...@mikemccandless.com>.
Hmm, the checksum is there to ensure all bits were persisted properly.

But one trickiness is we first write 4 0 bytes, then seek back and
write the checksum over those 4 bytes.  Could it be that the HBase
IndexOutput impl can't handle seeking back and overwriting?

If so, you should have a look at AppendingCodec, which fixes the
places in Lucene's default codec that seek backwards on write ...

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jun 25, 2012 at 11:55 AM, Mihai Soloi <mi...@gmail.com> wrote:
> Hello everybody,
>
> I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for Lucene
> [1] in order to use it on James mailbox indexing. I've implemented
> HIndexOutput/Input, they're persisting the segments file just fine in an
> HBase table, but when I try to get an IndexWriter from my directory, it
> reads the segment_N file but due to the check in SegmentInfos the current
> checksum is different from the persisted one. I've tried finding a solution
> but I can't reach one. Do you guys have any idea why this happens? This is
> the stack trace:
>
> org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
> file (resource: ChecksumIndexInput(anonymous IndexInput))
>    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
>    at
> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
>    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
>    at
> org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
>    at
> org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)
>
> [1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org