You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Thomas Kappler <Th...@microsoft.com> on 2016/11/08 04:04:46 UTC

Index corruption on NTFS

Hi all,



We're occasionally observing corrupted indexes in production, on Windows Server. We tracked it down to the way NTFS behaves in case of partial writes.



When the disk or the machine fail during a flush, it's possible on NTFS that the file being written to has already been extended to the new length, but the content is not visible yet. For security reasons NTFS will return all 0s for content when reading past the last successfully written point after the system restarts.



Lucene's commit code relies on committing an updated .gen file as the last step of index flush/update. In this case, the file is there, but contains 0s, making it unreadable for Lucene. Failures at this point leave the index in a state that's not readable.



We think that the safest approach, which is robust to reordered writes, is to consider a gen file with all zeroes the same as a non-existing gen file. This assumes that by the time the gen file is fsync'ed all other files have been flushed to disk explicitly. If that's not the case, then there's still exposure to reordered writes.



I don't have a repro at this point. Before digging deeper into this I wanted to see what the Lucene devs think. Does the proposed fix make sense? Any ideas on how to set up a reproducible test for this issue?



We verified this on Elasticsearch 1.7.1 which uses Lucene 4.10.4. Are there significant changes to this area in newer Lucene versions?



// Thomas

RE: Index corruption on NTFS

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

In addition, Lucene 5+ uses atomic renames (new features only available with Java's NIO.2 APIs) for publishing the segments file. It also fsyncs the directory now, so the commit is really only visible if it is *completely* done.

So from my perspective, the bug is fixed in Lucene 5+. Lucene 4.10 still uses legacy file I/O in Java (Java 1.0), so is not immune to such corner cases.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Tuesday, November 8, 2016 10:39 AM
> To: Lucene/Solr dev <de...@lucene.apache.org>;
> Thomas.Kappler@microsoft.com
> Subject: Re: Index corruption on NTFS
> 
> So the segments.gen was written, but then the machine crashed before
> Lucene could fsync it?
> 
> This is a "normal" case, and it should be fine for the filesystem to
> return 0s to Lucene: Lucene is supposed to be robust to this
> situation, and fallback to a directory listing finding the largest
> segments_N file to try.  Do you not see that logic kicking in?
> 
> But, Lucene moved away from the .gen file in 5.0:
> https://issues.apache.org/jira/browse/LUCENE-5925
> 
> Can you reproduce the corruption with newer ES/Lucene versions?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Tue, Nov 8, 2016 at 2:58 AM, Dawid Weiss <da...@gmail.com>
> wrote:
> > Crazy. It would be helpful if you could provide a repro of this as a
> > small Java program one could run on a (small) NTFS partition (perhaps
> > beasting it over and over to simulate this effect?).
> >
> > Dawid
> >
> > On Tue, Nov 8, 2016 at 5:04 AM, Thomas Kappler
> > <Th...@microsoft.com> wrote:
> >> Hi all,
> >>
> >>
> >>
> >> We’re occasionally observing corrupted indexes in production, on
> Windows
> >> Server. We tracked it down to the way NTFS behaves in case of partial
> >> writes.
> >>
> >>
> >>
> >> When the disk or the machine fail during a flush, it’s possible on NTFS that
> >> the file being written to has already been extended to the new length, but
> >> the content is not visible yet. For security reasons NTFS will return all 0s
> >> for content when reading past the last successfully written point after the
> >> system restarts.
> >>
> >>
> >>
> >> Lucene's commit code relies on committing an updated .gen file as the last
> >> step of index flush/update. In this case, the file is there, but contains
> >> 0s, making it unreadable for Lucene. Failures at this point leave the index
> >> in a state that's not readable.
> >>
> >>
> >>
> >> We think that the safest approach, which is robust to reordered writes, is
> >> to consider a gen file with all zeroes the same as a non-existing gen file.
> >> This assumes that by the time the gen file is fsync'ed all other files have
> >> been flushed to disk explicitly. If that's not the case, then there's still
> >> exposure to reordered writes.
> >>
> >>
> >>
> >> I don’t have a repro at this point. Before digging deeper into this I wanted
> >> to see what the Lucene devs think. Does the proposed fix make sense?
> Any
> >> ideas on how to set up a reproducible test for this issue?
> >>
> >>
> >>
> >> We verified this on Elasticsearch 1.7.1 which uses Lucene 4.10.4. Are there
> >> significant changes to this area in newer Lucene versions?
> >>
> >>
> >>
> >> // Thomas
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Index corruption on NTFS

Posted by Michael McCandless <lu...@mikemccandless.com>.

So the segments.gen was written, but then the machine crashed before
Lucene could fsync it?

This is a "normal" case, and it should be fine for the filesystem to
return 0s to Lucene: Lucene is supposed to be robust to this
situation, and fallback to a directory listing finding the largest
segments_N file to try.  Do you not see that logic kicking in?

But, Lucene moved away from the .gen file in 5.0:
https://issues.apache.org/jira/browse/LUCENE-5925

Can you reproduce the corruption with newer ES/Lucene versions?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Nov 8, 2016 at 2:58 AM, Dawid Weiss <da...@gmail.com> wrote:
> Crazy. It would be helpful if you could provide a repro of this as a
> small Java program one could run on a (small) NTFS partition (perhaps
> beasting it over and over to simulate this effect?).
>
> Dawid
>
> On Tue, Nov 8, 2016 at 5:04 AM, Thomas Kappler
> <Th...@microsoft.com> wrote:
>> Hi all,
>>
>>
>>
>> We’re occasionally observing corrupted indexes in production, on Windows
>> Server. We tracked it down to the way NTFS behaves in case of partial
>> writes.
>>
>>
>>
>> When the disk or the machine fail during a flush, it’s possible on NTFS that
>> the file being written to has already been extended to the new length, but
>> the content is not visible yet. For security reasons NTFS will return all 0s
>> for content when reading past the last successfully written point after the
>> system restarts.
>>
>>
>>
>> Lucene's commit code relies on committing an updated .gen file as the last
>> step of index flush/update. In this case, the file is there, but contains
>> 0s, making it unreadable for Lucene. Failures at this point leave the index
>> in a state that's not readable.
>>
>>
>>
>> We think that the safest approach, which is robust to reordered writes, is
>> to consider a gen file with all zeroes the same as a non-existing gen file.
>> This assumes that by the time the gen file is fsync'ed all other files have
>> been flushed to disk explicitly. If that's not the case, then there's still
>> exposure to reordered writes.
>>
>>
>>
>> I don’t have a repro at this point. Before digging deeper into this I wanted
>> to see what the Lucene devs think. Does the proposed fix make sense? Any
>> ideas on how to set up a reproducible test for this issue?
>>
>>
>>
>> We verified this on Elasticsearch 1.7.1 which uses Lucene 4.10.4. Are there
>> significant changes to this area in newer Lucene versions?
>>
>>
>>
>> // Thomas
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Index corruption on NTFS

Posted by Dawid Weiss <da...@gmail.com>.

Crazy. It would be helpful if you could provide a repro of this as a
small Java program one could run on a (small) NTFS partition (perhaps
beasting it over and over to simulate this effect?).

Dawid

On Tue, Nov 8, 2016 at 5:04 AM, Thomas Kappler
<Th...@microsoft.com> wrote:
> Hi all,
>
>
>
> We’re occasionally observing corrupted indexes in production, on Windows
> Server. We tracked it down to the way NTFS behaves in case of partial
> writes.
>
>
>
> When the disk or the machine fail during a flush, it’s possible on NTFS that
> the file being written to has already been extended to the new length, but
> the content is not visible yet. For security reasons NTFS will return all 0s
> for content when reading past the last successfully written point after the
> system restarts.
>
>
>
> Lucene's commit code relies on committing an updated .gen file as the last
> step of index flush/update. In this case, the file is there, but contains
> 0s, making it unreadable for Lucene. Failures at this point leave the index
> in a state that's not readable.
>
>
>
> We think that the safest approach, which is robust to reordered writes, is
> to consider a gen file with all zeroes the same as a non-existing gen file.
> This assumes that by the time the gen file is fsync'ed all other files have
> been flushed to disk explicitly. If that's not the case, then there's still
> exposure to reordered writes.
>
>
>
> I don’t have a repro at this point. Before digging deeper into this I wanted
> to see what the Lucene devs think. Does the proposed fix make sense? Any
> ideas on how to set up a reproducible test for this issue?
>
>
>
> We verified this on Elasticsearch 1.7.1 which uses Lucene 4.10.4. Are there
> significant changes to this area in newer Lucene versions?
>
>
>
> // Thomas
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org