You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Mihai Soloi <mi...@gmail.com> on 2012/06/25 17:55:51 UTC
Checksum mismatch in segments file
Hello everybody,
I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for
Lucene [1] in order to use it on James mailbox indexing. I've
implemented HIndexOutput/Input, they're persisting the segments file
just fine in an HBase table, but when I try to get an IndexWriter from
my directory, it reads the segment_N file but due to the check in
SegmentInfos the current checksum is different from the persisted one.
I've tried finding a solution but I can't reach one. Do you guys have
any idea why this happens? This is the stack trace:
org.apache.lucene.index.CorruptIndexException: checksum mismatch in
segments file (resource: ChecksumIndexInput(anonymous IndexInput))
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
at
org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
at
org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
at
org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)
[1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Checksum mismatch in segments file
Posted by Robert Muir <rc...@gmail.com>.
just to add more information, if you are trying lucene 4.x
(http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/), the
rewrite that Mike describes in segmentinfos is actually removed.
But you still need to use AppendingCodec there because the term
dictionary uses this same trick.
On Tue, Jun 26, 2012 at 6:30 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Hmm, the checksum is there to ensure all bits were persisted properly.
>
> But one trickiness is we first write 4 0 bytes, then seek back and
> write the checksum over those 4 bytes. Could it be that the HBase
> IndexOutput impl can't handle seeking back and overwriting?
>
> If so, you should have a look at AppendingCodec, which fixes the
> places in Lucene's default codec that seek backwards on write ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Jun 25, 2012 at 11:55 AM, Mihai Soloi <mi...@gmail.com> wrote:
>> Hello everybody,
>>
>> I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for Lucene
>> [1] in order to use it on James mailbox indexing. I've implemented
>> HIndexOutput/Input, they're persisting the segments file just fine in an
>> HBase table, but when I try to get an IndexWriter from my directory, it
>> reads the segment_N file but due to the check in SegmentInfos the current
>> checksum is different from the persisted one. I've tried finding a solution
>> but I can't reach one. Do you guys have any idea why this happens? This is
>> the stack trace:
>>
>> org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
>> file (resource: ChecksumIndexInput(anonymous IndexInput))
>> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
>> at
>> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
>> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
>> at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
>> at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)
>>
>> [1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
--
lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Checksum mismatch in segments file
Posted by Mihai Soloi <mi...@gmail.com>.
Hello Mike and Robert,
I am using the stable version of Lucene(i.e. 3.6) and what is actually
going on is that the checksum (i.e. a long) is written as 8 bytes: the
first 4 are 0, then the mismatched checksum value(i.e. checksum-1) is
written in the next 4(reference:
ChecksumIndexOutput.prepareCommit()).When finishCommit() happens the
correct checksum is written to the buffer and then on close it's flushed
to the directory.
A comment states that this is done for better testing. I've followed the
code with the debugger and printed out the bytes in the logger and I can
say that seeking back and overwriting are done as they should be.
You can run the test as 'mvn test
-Dtest=org.apache.james.mailbox.lucene.hbase.IndexingTest' but there
will be a lot of byte printing.
I am now looking at the AppendingCodec in version 4, and see if I can
better use that implementation.
Thank you,
Mihai
On 26.06.2012 13:30, Michael McCandless wrote:
> Hmm, the checksum is there to ensure all bits were persisted properly.
>
> But one trickiness is we first write 4 0 bytes, then seek back and
> write the checksum over those 4 bytes. Could it be that the HBase
> IndexOutput impl can't handle seeking back and overwriting?
>
> If so, you should have a look at AppendingCodec, which fixes the
> places in Lucene's default codec that seek backwards on write ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Jun 25, 2012 at 11:55 AM, Mihai Soloi <mi...@gmail.com> wrote:
>> Hello everybody,
>>
>> I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for Lucene
>> [1] in order to use it on James mailbox indexing. I've implemented
>> HIndexOutput/Input, they're persisting the segments file just fine in an
>> HBase table, but when I try to get an IndexWriter from my directory, it
>> reads the segment_N file but due to the check in SegmentInfos the current
>> checksum is different from the persisted one. I've tried finding a solution
>> but I can't reach one. Do you guys have any idea why this happens? This is
>> the stack trace:
>>
>> org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
>> file (resource: ChecksumIndexInput(anonymous IndexInput))
>> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
>> at
>> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
>> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
>> at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
>> at
>> org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)
>>
>> [1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Checksum mismatch in segments file
Posted by Michael McCandless <lu...@mikemccandless.com>.
Hmm, the checksum is there to ensure all bits were persisted properly.
But one trickiness is we first write 4 0 bytes, then seek back and
write the checksum over those 4 bytes. Could it be that the HBase
IndexOutput impl can't handle seeking back and overwriting?
If so, you should have a look at AppendingCodec, which fixes the
places in Lucene's default codec that seek backwards on write ...
Mike McCandless
http://blog.mikemccandless.com
On Mon, Jun 25, 2012 at 11:55 AM, Mihai Soloi <mi...@gmail.com> wrote:
> Hello everybody,
>
> I'm Mihai, a GSoC student, and I'm implementing an HBaseDirectory for Lucene
> [1] in order to use it on James mailbox indexing. I've implemented
> HIndexOutput/Input, they're persisting the segments file just fine in an
> HBase table, but when I try to get an IndexWriter from my directory, it
> reads the segment_N file but due to the check in SegmentInfos the current
> checksum is different from the persisted one. I've tried finding a solution
> but I can't reach one. Do you guys have any idea why this happens? This is
> the stack trace:
>
> org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
> file (resource: ChecksumIndexInput(anonymous IndexInput))
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:335)
> at
> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:182)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1168)
> at
> org.apache.james.mailbox.lucene.hbase.IndexingTest.getWriter(IndexingTest.java:82)
> at
> org.apache.james.mailbox.lucene.hbase.IndexingTest.testIndexWriter(IndexingTest.java:123)
>
> [1] http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org