You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Sturge <pe...@gmail.com> on 2010/12/01 00:24:43 UTC

Re: Preventing index segment corruption when windows crashes

After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
LockObtainFailedException errors: (excerpt)

   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
obtain timed out:
NativeFSLock@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock


When I run CheckIndex, I get: (excerpt)

 30 of 30: name=_2fi docCount=857
   compound=false
   hasProx=true
   numFiles=8
   size (MB)=0.769
   diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev ${svnver
sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, java.version=1.6.0_18,
java.vendor=Sun Microsystems Inc.}
   no deletions
   test: open reader.........FAILED
   WARNING: fixIndex() would remove reference to this segment; full exception:
org.apache.lucene.index.CorruptIndexException: did not read all bytes from file
"_2fi.fnm": read 1 vs size 512
       at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
       at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
       at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReade
r.java:119)
       at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
       at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
       at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
       at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)

WARNING: 1 broken segments (containing 857 documents) detected


This seems to happen every time Windows 7 crashes, and it would seem
extraordinary bad luck for this tiny test index to be in the middle of
a commit every time.
(it is set to commit every 40secs, but for such a small index it only
takes millis to complete)

Does this seem right? I don't remember seeing so many corruptions in
the index - maybe it is the world of Win7 dodgy drivers, but it would
be worth investigating if there's something amiss in Solr/Lucene when
things go down unexpectedly...

Thanks,
Peter


On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge <pe...@gmail.com> wrote:
> The index itself isn't corrupt - just one of the segment files. This
> means you can read the index (less the offending segment(s)), but once
> this happens it's no longer possible to
> access the documents that were in that segment (they're gone forever),
> nor write/commit to the index (depending on the env/request, you get
> 'Error reading from index file..' and/or WriteLockError)
> (note that for my use case, documents are dynamically created so can't
> be re-indexed).
>
> Restarting Solr fixes the write lock errors (an indirect environmental
> symptom of the problem), and running CheckIndex -fix is the only way
> I've found to repair the index so it can be written to (rewrites the
> corrupted segment(s)).
>
> I guess I was wondering if there's a mechanism that would support
> something akin to a transactional rollback for segments.
>
> Thanks,
> Peter
>
>
>
> On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge <pe...@gmail.com> wrote:
>>> If a Solr index is running at the time of a system halt, this can
>>> often corrupt a segments file, requiring the index to be -fix'ed by
>>> rewriting the offending file.
>>
>> Really?  That shouldn't be possible (if you mean the index is truly
>> corrupt - i.e. you can't open it).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>

Re: Preventing index segment corruption when windows crashes

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Dec 2, 2010 at 4:53 AM, Peter Sturge <pe...@gmail.com> wrote:
> As I'm not familiar with the syncing in Lucene, I couldn't say whether
> there's a specific problem with regards Win7/2008 server etc.
>
> Windows has long had the somewhat odd behaviour of deliberately
> caching file handles after an explicit close(). This has been part of
> NTFS since NT 4 days, but there may be some new behaviour introduced
> in Windows 6.x (and there is a lot of new behaviour) that causes an
> issue. I have also seen this problem in Windows Server 2008 (server
> version of Win7 - same file system).
>
> I'll try some further testing on previous Windows versions, but I've
> not previously come across a single segment corruption on Win 2k3/XP
> after hard failures. In fact, it was when I first encountered this
> problem on Server 2008 that I even discovered CheckIndex existed!
>
> I guess a good question for the community is: Has anyone else
> seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
> Win7)?
>
> Mike, are there any diagnostics/config etc. that I could try to help
> isolate the problem?

Actually it might be easiest to make a standalone Java test, maybe
using Lucene's FSDir, that opens files in sequence (0.bin, 1.bin,
2.bin...), writes verifiable them (eg random bytes from a fixed seed)
and then closes & syncs each one.  Then, crash the box while this is
running.  Finally, run a verify step that checks that the data is
"correct"?  Ie that our attempt to fsync "worked"?

It could very well be that windows 6.x is now "smarter" about fsync in
that it only syncs bytes actually written with the currently open file
descriptor, and not bytes written agains the same file by past file
descriptors (ie via a global buffer cache, like Linux).

Mike

Re: Preventing index segment corruption when windows crashes

Posted by Peter Sturge <pe...@gmail.com>.
As I'm not familiar with the syncing in Lucene, I couldn't say whether
there's a specific problem with regards Win7/2008 server etc.

Windows has long had the somewhat odd behaviour of deliberately
caching file handles after an explicit close(). This has been part of
NTFS since NT 4 days, but there may be some new behaviour introduced
in Windows 6.x (and there is a lot of new behaviour) that causes an
issue. I have also seen this problem in Windows Server 2008 (server
version of Win7 - same file system).

I'll try some further testing on previous Windows versions, but I've
not previously come across a single segment corruption on Win 2k3/XP
after hard failures. In fact, it was when I first encountered this
problem on Server 2008 that I even discovered CheckIndex existed!

I guess a good question for the community is: Has anyone else
seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
Win7)?

Mike, are there any diagnostics/config etc. that I could try to help
isolate the problem?

Many thanks,
Peter



On Thu, Dec 2, 2010 at 9:28 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge <pe...@gmail.com> wrote:
>> The Win7 crashes aren't from disk drivers - they come from, in this
>> case, a Broadcom wireless adapter driver.
>> The corruption comes as a result of the 'hard stop' of Windows.
>>
>> I would imagine this same problem could/would occur on any OS if the
>> plug was pulled from the machine.
>
> Actually, Lucene should be robust to this -- losing power, OS crash,
> hardware failure (as long as the failure doesn't flip bits), etc.
> This is because we do not delete files associated with an old commit
> point until all files referenced by the new commit point are
> successfully fsync'd.
>
> However it sounds like something is wrong, at least on Windows 7.
>
> I suspect it may be how we do the fsync -- if you look in
> FSDirectory.fsync, you'll see that we take a String fileName in.  We
> then open a new read/write RandomAccessFile, and call its
> .getFD().sync().
>
> I think this is potentially risky, ie, it would be better if we called
> .sync() on the original file we had opened for writing and written
> lots of data to, before closing it, instead of closing it, opening a
> new FileDescriptor, and calling sync on it.  We could conceivably take
> this approach, entirely in the Directory impl, by keeping the pool of
> file handles for write open even after .close() was called.  When a
> file is deleted we'd remove it from that pool, and when it's finally
> sync'd we'd then sync it and remove it from the pool.
>
> Could it be that on Windows 7 the way we fsync (opening a new
> FileDescriptor long after the first one was closed) doesn't in fact
> work?
>
> Mike
>

Re: Preventing index segment corruption when windows crashes

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge <pe...@gmail.com> wrote:
> The Win7 crashes aren't from disk drivers - they come from, in this
> case, a Broadcom wireless adapter driver.
> The corruption comes as a result of the 'hard stop' of Windows.
>
> I would imagine this same problem could/would occur on any OS if the
> plug was pulled from the machine.

Actually, Lucene should be robust to this -- losing power, OS crash,
hardware failure (as long as the failure doesn't flip bits), etc.
This is because we do not delete files associated with an old commit
point until all files referenced by the new commit point are
successfully fsync'd.

However it sounds like something is wrong, at least on Windows 7.

I suspect it may be how we do the fsync -- if you look in
FSDirectory.fsync, you'll see that we take a String fileName in.  We
then open a new read/write RandomAccessFile, and call its
.getFD().sync().

I think this is potentially risky, ie, it would be better if we called
.sync() on the original file we had opened for writing and written
lots of data to, before closing it, instead of closing it, opening a
new FileDescriptor, and calling sync on it.  We could conceivably take
this approach, entirely in the Directory impl, by keeping the pool of
file handles for write open even after .close() was called.  When a
file is deleted we'd remove it from that pool, and when it's finally
sync'd we'd then sync it and remove it from the pool.

Could it be that on Windows 7 the way we fsync (opening a new
FileDescriptor long after the first one was closed) doesn't in fact
work?

Mike

Re: Preventing index segment corruption when windows crashes

Posted by Peter Sturge <pe...@gmail.com>.
The Win7 crashes aren't from disk drivers - they come from, in this
case, a Broadcom wireless adapter driver.
The corruption comes as a result of the 'hard stop' of Windows.

I would imagine this same problem could/would occur on any OS if the
plug was pulled from the machine.

Thanks,
Peter


On Thu, Dec 2, 2010 at 4:07 AM, Lance Norskog <go...@gmail.com> wrote:
> Is there any way that Windows 7 and disk drivers are not honoring the
> fsync() calls? That would cause files and/or blocks to get saved out
> of order.
>
> On Tue, Nov 30, 2010 at 3:24 PM, Peter Sturge <pe...@gmail.com> wrote:
>> After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
>> LockObtainFailedException errors: (excerpt)
>>
>>   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
>>   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
>> obtain timed out:
>> NativeFSLock@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock
>>
>>
>> When I run CheckIndex, I get: (excerpt)
>>
>>  30 of 30: name=_2fi docCount=857
>>    compound=false
>>    hasProx=true
>>    numFiles=8
>>    size (MB)=0.769
>>    diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev ${svnver
>> sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, java.version=1.6.0_18,
>> java.vendor=Sun Microsystems Inc.}
>>    no deletions
>>    test: open reader.........FAILED
>>    WARNING: fixIndex() would remove reference to this segment; full exception:
>> org.apache.lucene.index.CorruptIndexException: did not read all bytes from file
>> "_2fi.fnm": read 1 vs size 512
>>        at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
>>        at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
>>        at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReade
>> r.java:119)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
>>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)
>>
>> WARNING: 1 broken segments (containing 857 documents) detected
>>
>>
>> This seems to happen every time Windows 7 crashes, and it would seem
>> extraordinary bad luck for this tiny test index to be in the middle of
>> a commit every time.
>> (it is set to commit every 40secs, but for such a small index it only
>> takes millis to complete)
>>
>> Does this seem right? I don't remember seeing so many corruptions in
>> the index - maybe it is the world of Win7 dodgy drivers, but it would
>> be worth investigating if there's something amiss in Solr/Lucene when
>> things go down unexpectedly...
>>
>> Thanks,
>> Peter
>>
>>
>> On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge <pe...@gmail.com> wrote:
>>> The index itself isn't corrupt - just one of the segment files. This
>>> means you can read the index (less the offending segment(s)), but once
>>> this happens it's no longer possible to
>>> access the documents that were in that segment (they're gone forever),
>>> nor write/commit to the index (depending on the env/request, you get
>>> 'Error reading from index file..' and/or WriteLockError)
>>> (note that for my use case, documents are dynamically created so can't
>>> be re-indexed).
>>>
>>> Restarting Solr fixes the write lock errors (an indirect environmental
>>> symptom of the problem), and running CheckIndex -fix is the only way
>>> I've found to repair the index so it can be written to (rewrites the
>>> corrupted segment(s)).
>>>
>>> I guess I was wondering if there's a mechanism that would support
>>> something akin to a transactional rollback for segments.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>>> On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
>>> <yo...@lucidimagination.com> wrote:
>>>> On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge <pe...@gmail.com> wrote:
>>>>> If a Solr index is running at the time of a system halt, this can
>>>>> often corrupt a segments file, requiring the index to be -fix'ed by
>>>>> rewriting the offending file.
>>>>
>>>> Really?  That shouldn't be possible (if you mean the index is truly
>>>> corrupt - i.e. you can't open it).
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Preventing index segment corruption when windows crashes

Posted by Lance Norskog <go...@gmail.com>.
Is there any way that Windows 7 and disk drivers are not honoring the
fsync() calls? That would cause files and/or blocks to get saved out
of order.

On Tue, Nov 30, 2010 at 3:24 PM, Peter Sturge <pe...@gmail.com> wrote:
> After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
> LockObtainFailedException errors: (excerpt)
>
>   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
>   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
> obtain timed out:
> NativeFSLock@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock
>
>
> When I run CheckIndex, I get: (excerpt)
>
>  30 of 30: name=_2fi docCount=857
>    compound=false
>    hasProx=true
>    numFiles=8
>    size (MB)=0.769
>    diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev ${svnver
> sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, java.version=1.6.0_18,
> java.vendor=Sun Microsystems Inc.}
>    no deletions
>    test: open reader.........FAILED
>    WARNING: fixIndex() would remove reference to this segment; full exception:
> org.apache.lucene.index.CorruptIndexException: did not read all bytes from file
> "_2fi.fnm": read 1 vs size 512
>        at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
>        at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
>        at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReade
> r.java:119)
>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)
>
> WARNING: 1 broken segments (containing 857 documents) detected
>
>
> This seems to happen every time Windows 7 crashes, and it would seem
> extraordinary bad luck for this tiny test index to be in the middle of
> a commit every time.
> (it is set to commit every 40secs, but for such a small index it only
> takes millis to complete)
>
> Does this seem right? I don't remember seeing so many corruptions in
> the index - maybe it is the world of Win7 dodgy drivers, but it would
> be worth investigating if there's something amiss in Solr/Lucene when
> things go down unexpectedly...
>
> Thanks,
> Peter
>
>
> On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge <pe...@gmail.com> wrote:
>> The index itself isn't corrupt - just one of the segment files. This
>> means you can read the index (less the offending segment(s)), but once
>> this happens it's no longer possible to
>> access the documents that were in that segment (they're gone forever),
>> nor write/commit to the index (depending on the env/request, you get
>> 'Error reading from index file..' and/or WriteLockError)
>> (note that for my use case, documents are dynamically created so can't
>> be re-indexed).
>>
>> Restarting Solr fixes the write lock errors (an indirect environmental
>> symptom of the problem), and running CheckIndex -fix is the only way
>> I've found to repair the index so it can be written to (rewrites the
>> corrupted segment(s)).
>>
>> I guess I was wondering if there's a mechanism that would support
>> something akin to a transactional rollback for segments.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>> On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge <pe...@gmail.com> wrote:
>>>> If a Solr index is running at the time of a system halt, this can
>>>> often corrupt a segments file, requiring the index to be -fix'ed by
>>>> rewriting the offending file.
>>>
>>> Really?  That shouldn't be possible (if you mean the index is truly
>>> corrupt - i.e. you can't open it).
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>



-- 
Lance Norskog
goksron@gmail.com