You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Chuck Williams <ch...@manawiz.com> on 2006/09/11 02:24:04 UTC

After kill -9 index was corrupt

Hi All,

An application of ours under development had a memory link that caused
it to slow interminably.  On linux, the application did not response to
kill -15 in a reasonable time, so kill -9 was used to forcibly terminate
it.  After this the segments file contained a reference to a segment
whose index files were not present.  I.e., the index was corrupt and
Lucene could not open it.

A thread dump at the time of the kill -9 shows that Lucene was merging
segments inside IndexWriter.close().  Since segment merging only commits
(updates the segments file) after the newly merged segment(s) are
complete, I expect this is not the actual problem.

Could a kill -9 prevent data from reaching disk for files that were
previously closed?  If so, then Lucene's index can become corrupt after
kill -9.  In this case, it is possible that a prior merge created new
segment index files, updated the segments file, closed everything, the
segments file made it to disk, but the index data files and/or their
directory entries did not.

If this is the case, it seems to me that flush() and
FileDescriptor.sync() are required on each index file prior to close()
to guarantee no corruption.  Additionally a FileDescriptor.sync() is
also probably required on the index directory to ensure the directory
entries have been persisted.

A power failure or other operating system crash could cause this, not
just kill -9.

Does this seem like a possible explanation and fix for what happened? 
Could the same kind of problem happen on Windows?

If this is the issue, then how would people feel about having Lucene do
sync()'s a) always? or b) as an index configuration option?

I need to fix whatever happened and so would submit a patch to resolve it.

Thanks for advice and suggestions,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Michael McCandless <lu...@mikemccandless.com>.
Chuck Williams wrote:
> Hi All,
> 
> I found this issue.  There is no problem in Lucene, and I'd like to
> leave this thread with that assertion to avoid confusing future archive
> searcher/readers.
> 
> The index was actually not corrupt at all.  I use ParallelReader and
> ParallelWriter.  A kill -9 can leave the subindexes out of sync.  My
> recovery code repairs this on restart by noticing the indexes are
> out-of-sync, deleting the document(s) that were added to some
> subindex(es) but not the other(s), then optimizing to resync the doc-ids.
> 
> The issue is that my bulk updater does not at present support compound
> file format and the recovery code forgot to turn that off prior to the
> optimize!  Thus a .cfs file was created, which confused the bulk updater
> -- it did not see a segment that was inside the cfs.
> 
> Sorry for the false alarm and thanks to all who helped with the original
> question/concern,

Phew -- glad to hear this!  Thanks for bringing closure to this issue.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Chuck Williams <ch...@manawiz.com>.
Hi All,

I found this issue.  There is no problem in Lucene, and I'd like to
leave this thread with that assertion to avoid confusing future archive
searcher/readers.

The index was actually not corrupt at all.  I use ParallelReader and
ParallelWriter.  A kill -9 can leave the subindexes out of sync.  My
recovery code repairs this on restart by noticing the indexes are
out-of-sync, deleting the document(s) that were added to some
subindex(es) but not the other(s), then optimizing to resync the doc-ids.

The issue is that my bulk updater does not at present support compound
file format and the recovery code forgot to turn that off prior to the
optimize!  Thus a .cfs file was created, which confused the bulk updater
-- it did not see a segment that was inside the cfs.

Sorry for the false alarm and thanks to all who helped with the original
question/concern,

Chuck


Chuck Williams wrote on 09/11/2006 12:10 PM:
> I do have one module that does custom index operations.  This is my bulk
> updater.  It creates new index files for the segments it modifies and a
> new segments file, then uses the same commit mechanism as merging. 
> I.e., it copes its new segments file into "segments" with the commit
> lock only after all the new index files are closed.  In the problem
> scenario, I don't have any indication that the bulk updater was
> complicit but am of course fully exploring that possibility as well.
>
> The index was only reopened by the process after the kill -9 of the old
> process was completed, so there were not any threads still working on
> the old process.
>
> This remains a mystery.  Thanks for you analysis and suggestions.  If
> you have more ideas, please keep them coming!
>
> Chuck
>
>
> robert engels wrote on 09/11/2006 10:06 AM:
>   
>> I am not stating that you did not uncover a problem. I am only stating
>> that it is not due to OS level caching.
>>
>> Maybe your sequence of events triggered a reread of the index, while
>> some thread was still writing. The reread sees the 'unused segments'
>> and deletes them, and then the other thread writes the updated
>> 'segments' file.
>>
>> From what you state, it seems that you are using some custom code for
>> index writing? (Maybe the NewIndexModified stuff)? Possibly there is
>> an issue there. Do you maybe have your own cleanup code that attempts
>> to remove unused segments from the directory? If so, that appears to
>> be the likely culprit to me.
>>
>> On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote:
>>
>>     
>>> robert engels wrote on 09/11/2006 07:34 AM:
>>>       
>>>> A kill -9 should not affect the OS's writing of dirty buffers
>>>> (including directory modifications). If this were the case, massive
>>>> system corruption would almost always occur every time a kill -9 was
>>>> used with any program.
>>>>
>>>> The only thing a kill -9 affects is user level buffering. The OS
>>>> always maintains a consistent view of directory modifications and or
>>>> file modification that were requesting by programs.
>>>>
>>>> This entire discussion is pointless.
>>>>
>>>>         
>>> Thanks everyone for your analysis.  It appears I do not have any
>>> explanation.  In my case, the process was in gc-limbo due to the memory
>>> leak and having butted up against its -Xmx.  The process was kill -9'd
>>> and then restarted.  The OS never crashed.  The server this is on is
>>> healthy; it has been used continually since this happened without being
>>> rebooted and no file system or any other issues.  When the process was
>>> killed, one thread was merging segments as part of flushing the ram
>>> buffer while closing the index, due to the prior kill -15.  When Lucene
>>> restarted, the segments file contained a segment name for which there
>>> were no corresponding index data files.
>>>
>>> Chuck
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>     
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Chuck Williams <ch...@manawiz.com>.
I do have one module that does custom index operations.  This is my bulk
updater.  It creates new index files for the segments it modifies and a
new segments file, then uses the same commit mechanism as merging. 
I.e., it copes its new segments file into "segments" with the commit
lock only after all the new index files are closed.  In the problem
scenario, I don't have any indication that the bulk updater was
complicit but am of course fully exploring that possibility as well.

The index was only reopened by the process after the kill -9 of the old
process was completed, so there were not any threads still working on
the old process.

This remains a mystery.  Thanks for you analysis and suggestions.  If
you have more ideas, please keep them coming!

Chuck


robert engels wrote on 09/11/2006 10:06 AM:
> I am not stating that you did not uncover a problem. I am only stating
> that it is not due to OS level caching.
>
> Maybe your sequence of events triggered a reread of the index, while
> some thread was still writing. The reread sees the 'unused segments'
> and deletes them, and then the other thread writes the updated
> 'segments' file.
>
> From what you state, it seems that you are using some custom code for
> index writing? (Maybe the NewIndexModified stuff)? Possibly there is
> an issue there. Do you maybe have your own cleanup code that attempts
> to remove unused segments from the directory? If so, that appears to
> be the likely culprit to me.
>
> On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote:
>
>> robert engels wrote on 09/11/2006 07:34 AM:
>>> A kill -9 should not affect the OS's writing of dirty buffers
>>> (including directory modifications). If this were the case, massive
>>> system corruption would almost always occur every time a kill -9 was
>>> used with any program.
>>>
>>> The only thing a kill -9 affects is user level buffering. The OS
>>> always maintains a consistent view of directory modifications and or
>>> file modification that were requesting by programs.
>>>
>>> This entire discussion is pointless.
>>>
>> Thanks everyone for your analysis.  It appears I do not have any
>> explanation.  In my case, the process was in gc-limbo due to the memory
>> leak and having butted up against its -Xmx.  The process was kill -9'd
>> and then restarted.  The OS never crashed.  The server this is on is
>> healthy; it has been used continually since this happened without being
>> rebooted and no file system or any other issues.  When the process was
>> killed, one thread was merging segments as part of flushing the ram
>> buffer while closing the index, due to the prior kill -15.  When Lucene
>> restarted, the segments file contained a segment name for which there
>> were no corresponding index data files.
>>
>> Chuck
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by robert engels <re...@ix.netcom.com>.
I am not stating that you did not uncover a problem. I am only  
stating that it is not due to OS level caching.

Maybe your sequence of events triggered a reread of the index, while  
some thread was still writing. The reread sees the 'unused segments'  
and deletes them, and then the other thread writes the updated  
'segments' file.

 From what you state, it seems that you are using some custom code  
for index writing? (Maybe the NewIndexModified stuff)? Possibly there  
is an issue there. Do you maybe have your own cleanup code that  
attempts to remove unused segments from the directory? If so, that  
appears to be the likely culprit to me.

On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote:

> robert engels wrote on 09/11/2006 07:34 AM:
>> A kill -9 should not affect the OS's writing of dirty buffers
>> (including directory modifications). If this were the case, massive
>> system corruption would almost always occur every time a kill -9 was
>> used with any program.
>>
>> The only thing a kill -9 affects is user level buffering. The OS
>> always maintains a consistent view of directory modifications and or
>> file modification that were requesting by programs.
>>
>> This entire discussion is pointless.
>>
> Thanks everyone for your analysis.  It appears I do not have any
> explanation.  In my case, the process was in gc-limbo due to the  
> memory
> leak and having butted up against its -Xmx.  The process was kill -9'd
> and then restarted.  The OS never crashed.  The server this is on is
> healthy; it has been used continually since this happened without  
> being
> rebooted and no file system or any other issues.  When the process was
> killed, one thread was merging segments as part of flushing the ram
> buffer while closing the index, due to the prior kill -15.  When  
> Lucene
> restarted, the segments file contained a segment name for which there
> were no corresponding index data files.
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Chuck Williams <ch...@manawiz.com>.
robert engels wrote on 09/11/2006 07:34 AM:
> A kill -9 should not affect the OS's writing of dirty buffers
> (including directory modifications). If this were the case, massive
> system corruption would almost always occur every time a kill -9 was
> used with any program.
>
> The only thing a kill -9 affects is user level buffering. The OS
> always maintains a consistent view of directory modifications and or
> file modification that were requesting by programs.
>
> This entire discussion is pointless.
>
Thanks everyone for your analysis.  It appears I do not have any
explanation.  In my case, the process was in gc-limbo due to the memory
leak and having butted up against its -Xmx.  The process was kill -9'd
and then restarted.  The OS never crashed.  The server this is on is
healthy; it has been used continually since this happened without being
rebooted and no file system or any other issues.  When the process was
killed, one thread was merging segments as part of flushing the ram
buffer while closing the index, due to the prior kill -15.  When Lucene
restarted, the segments file contained a segment name for which there
were no corresponding index data files.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by robert engels <re...@ix.netcom.com>.
A kill -9 should not affect the OS's writing of dirty buffers  
(including directory modifications). If this were the case, massive  
system corruption would almost always occur every time a kill -9 was  
used with any program.

The only thing a kill -9 affects is user level buffering. The OS  
always maintains a consistent view of directory modifications and or  
file modification that were requesting by programs.

This entire discussion is pointless.

If the hardware is performing lazy writes, then corruption might be  
caused during a hard-reset (e.g. power failure, hardware failure (cpu  
lockup), or device driver failure in some VERY RARE cases). Even a  
kernel panic should allow the physical devices to flush their dirty  
buffers (as the controller can in many cases detect this).

The only way to prevent this is if the OS exposes a system call to  
synchronously force the hardware to flush any buffers. The Java fsync 
() could use the OS call to peform this operation, but most likely  
would not since the performance penalty would be significant, and  
those users requiring this level of reliability would be better  
served by using UPSs and fault-tolerant systems (at every level -  
CPU, disk, etc.).


On Sep 11, 2006, at 12:13 PM, Paul Elschot wrote:

> On Monday 11 September 2006 15:36, Yonik Seeley wrote:
>> On 9/10/06, Chuck Williams <ch...@manawiz.com> wrote:
>>> Could a kill -9 prevent data from reaching disk for files that were
>>> previously closed?
>>
>> No.  After a close() the OS should have all the data... the process
>> may be killed but the OS will eventually flush all the buffers, etc.
>> File creation is pretty much always synchronous so I have no idea how
>> your problem could have happened (missing segment files).  IO  
>> error or
>> something else temporarily filling up the disk?
>
> "Pretty much always" is logically equivalent to "not always" :) ,  
> see also
> below.
>
>> If you have a power loss or crash, then that *can* cause data loss.
>> There may be mount options to make more file operations synchronous,
>> or you could maybe write your own Directory implementation to make
>> things more synchronous.
>
> New segments will have to be flushed/fsynced before the segments
> file. This could be hidden in a Directory in case a Directory hides  
> Lucene's
> segments file.
>
> Regards,
> Paul Elschot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 11 September 2006 15:36, Yonik Seeley wrote:
> On 9/10/06, Chuck Williams <ch...@manawiz.com> wrote:
> > Could a kill -9 prevent data from reaching disk for files that were
> > previously closed?
> 
> No.  After a close() the OS should have all the data... the process
> may be killed but the OS will eventually flush all the buffers, etc.
> File creation is pretty much always synchronous so I have no idea how
> your problem could have happened (missing segment files).  IO error or
> something else temporarily filling up the disk?

"Pretty much always" is logically equivalent to "not always" :) , see also
below.

> If you have a power loss or crash, then that *can* cause data loss.
> There may be mount options to make more file operations synchronous,
> or you could maybe write your own Directory implementation to make
> things more synchronous.

New segments will have to be flushed/fsynced before the segments
file. This could be hidden in a Directory in case a Directory hides Lucene's
segments file.

Regards,
Paul Elschot
 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:
> On 9/11/06, Michael McCandless <lu...@mikemccandless.com> wrote:
>> However, I do think it would be a good idea to [optionally] add a
>> sync() call on committing the segments file to still be robust to OS /
>> machine crashing... it would slow down performance of indexing but
>> hopefully not by too much since the segments file is small.
> 
> To increase crash resilience, one would want to sync all the segment
> files *before* writing the new segments file.  Of course it's not
> always easy to ensure data is actually on stable storage:
> http://brad.livejournal.com/2116715.html

Ahh yes you're right.

And indeed even with sync. writing of all the segments files, the hard 
drives themselves often cache writes by default these days.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Yonik Seeley <yo...@apache.org>.
On 9/11/06, Michael McCandless <lu...@mikemccandless.com> wrote:
> However, I do think it would be a good idea to [optionally] add a
> sync() call on committing the segments file to still be robust to OS /
> machine crashing... it would slow down performance of indexing but
> hopefully not by too much since the segments file is small.

To increase crash resilience, one would want to sync all the segment
files *before* writing the new segments file.  Of course it's not
always easy to ensure data is actually on stable storage:
http://brad.livejournal.com/2116715.html

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:
> On 9/10/06, Chuck Williams <ch...@manawiz.com> wrote:
>> Could a kill -9 prevent data from reaching disk for files that were
>> previously closed?
> 
> No.  After a close() the OS should have all the data... the process
> may be killed but the OS will eventually flush all the buffers, etc.
> File creation is pretty much always synchronous so I have no idea how
> your problem could have happened (missing segment files).  IO error or
> something else temporarily filling up the disk?
> 
> If you have a power loss or crash, then that *can* cause data loss.
> There may be mount options to make more file operations synchronous,
> or you could maybe write your own Directory implementation to make
> things more synchronous.

Agreed ... it's hard to explain how this could have occurred without
the OS / machine actually going down abruptly.

Is this all on a single machine, local filesystem?

Are you really sure your underlying IO system is "healthy", no silent
file system corruption going on or anything?

However, I do think it would be a good idea to [optionally] add a
sync() call on committing the segments file to still be robust to OS /
machine crashing... it would slow down performance of indexing but
hopefully not by too much since the segments file is small.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Yonik Seeley <yo...@apache.org>.
On 9/10/06, Chuck Williams <ch...@manawiz.com> wrote:
> Could a kill -9 prevent data from reaching disk for files that were
> previously closed?

No.  After a close() the OS should have all the data... the process
may be killed but the OS will eventually flush all the buffers, etc.
File creation is pretty much always synchronous so I have no idea how
your problem could have happened (missing segment files).  IO error or
something else temporarily filling up the disk?

If you have a power loss or crash, then that *can* cause data loss.
There may be mount options to make more file operations synchronous,
or you could maybe write your own Directory implementation to make
things more synchronous.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 11 September 2006 09:50, Chuck Williams wrote:
> 
> Paul Elschot wrote on 09/10/2006 09:15 PM:
> > On Monday 11 September 2006 02:24, Chuck Williams wrote:
> >   
> >> Hi All,
> >>
> >> An application of ours under development had a memory link that caused
> >> it to slow interminably.  On linux, the application did not response to
> >> kill -15 in a reasonable time, so kill -9 was used to forcibly terminate
> >> it.  After this the segments file contained a reference to a segment
> >> whose index files were not present.  I.e., the index was corrupt and
> >> Lucene could not open it.
> >>
> >> A thread dump at the time of the kill -9 shows that Lucene was merging
> >> segments inside IndexWriter.close().  Since segment merging only commits
> >> (updates the segments file) after the newly merged segment(s) are
> >> complete, I expect this is not the actual problem.
> >>
> >> Could a kill -9 prevent data from reaching disk for files that were
> >> previously closed?  If so, then Lucene's index can become corrupt after
> >> kill -9.  In this case, it is possible that a prior merge created new
> >> segment index files, updated the segments file, closed everything, the
> >> segments file made it to disk, but the index data files and/or their
> >> directory entries did not.
> >>
> >> If this is the case, it seems to me that flush() and
> >> FileDescriptor.sync() are required on each index file prior to close()
> >> to guarantee no corruption.  Additionally a FileDescriptor.sync() is
> >> also probably required on the index directory to ensure the directory
> >> entries have been persisted.
> >>     
> >
> > Shouldn't the sync be done after closing the files? I'm using sync in a
> > (un*x) shell script after merges before backups. I'd prefer to have some
> > more of this syncing built into Lucene because the shell sync syncs all
> > disks which might be more than needed. So far I've had no problems,
> > so there was no need to investigate further.
> >   
> I believe FileDescriptor,sync() uses fsync and not sync on linux.  A
> FileDescriptor is no longer valid after the stream is closed, so sync()
> could not be done on a closed stream.  I think the correct protocol is
> flush() the stream, sync() it's FD, then close() it.

From Sun's javadocs: flush(), fsync(), close() is indeed the right order
for a single file.
 
> Paul, do you know if kill -9 can create the situation where bytes from a
> closed file never make it to disk in linux?  I think Lucene needs sync()

What do mean by "never"? The problem with not using flush() is that
the jvm simply does _not_ guarantee that data will ever end up on disk,
which is why I added the sync in the shell script after the document merging.

With flush() and sync the guarantee is only given as far as the OS can use
the disk driver, if the disk actually does not write, there is nothing to be 
done about that, see the link in the other post.

> in any event to be robust with respect to OS crashes, but am wondering
> if this explains my kill -9 problem as well.  It seems bogus to me that

This can explain your problem, data will eventually be written to the
disk by the OS, but when?

> a closed file's bytes would fail to be persisted unless the OS crashed,
> but I can't find any other explanation and I can't find any definitive
> information to affirm or refute this possible side effect of kill -9.
> 
> The issue I've got is that my index can never lose documents.  So I've
> implemented journaling on top of Lucene where only the last
> maxBufferedDocs documents are journaled and the whole journal is reset
> after close().  My application has no way to know when the bytes make it
> to disk, and so cannot manage its journal properly unless Lucene ensures
> index integrity with sync()'s.

Do you also flush/sync the journal to disk? If you need to recover from the
journal, it has to be written to disk before doing "transactions" (adding
docs) in lucene.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Chuck Williams <ch...@manawiz.com>.

Paul Elschot wrote on 09/10/2006 09:15 PM:
> On Monday 11 September 2006 02:24, Chuck Williams wrote:
>   
>> Hi All,
>>
>> An application of ours under development had a memory link that caused
>> it to slow interminably.  On linux, the application did not response to
>> kill -15 in a reasonable time, so kill -9 was used to forcibly terminate
>> it.  After this the segments file contained a reference to a segment
>> whose index files were not present.  I.e., the index was corrupt and
>> Lucene could not open it.
>>
>> A thread dump at the time of the kill -9 shows that Lucene was merging
>> segments inside IndexWriter.close().  Since segment merging only commits
>> (updates the segments file) after the newly merged segment(s) are
>> complete, I expect this is not the actual problem.
>>
>> Could a kill -9 prevent data from reaching disk for files that were
>> previously closed?  If so, then Lucene's index can become corrupt after
>> kill -9.  In this case, it is possible that a prior merge created new
>> segment index files, updated the segments file, closed everything, the
>> segments file made it to disk, but the index data files and/or their
>> directory entries did not.
>>
>> If this is the case, it seems to me that flush() and
>> FileDescriptor.sync() are required on each index file prior to close()
>> to guarantee no corruption.  Additionally a FileDescriptor.sync() is
>> also probably required on the index directory to ensure the directory
>> entries have been persisted.
>>     
>
> Shouldn't the sync be done after closing the files? I'm using sync in a
> (un*x) shell script after merges before backups. I'd prefer to have some
> more of this syncing built into Lucene because the shell sync syncs all
> disks which might be more than needed. So far I've had no problems,
> so there was no need to investigate further.
>   
I believe FileDescriptor,sync() uses fsync and not sync on linux.  A
FileDescriptor is no longer valid after the stream is closed, so sync()
could not be done on a closed stream.  I think the correct protocol is
flush() the stream, sync() it's FD, then close() it.

Paul, do you know if kill -9 can create the situation where bytes from a
closed file never make it to disk in linux?  I think Lucene needs sync()
in any event to be robust with respect to OS crashes, but am wondering
if this explains my kill -9 problem as well.  It seems bogus to me that
a closed file's bytes would fail to be persisted unless the OS crashed,
but I can't find any other explanation and I can't find any definitive
information to affirm or refute this possible side effect of kill -9.

The issue I've got is that my index can never lose documents.  So I've
implemented journaling on top of Lucene where only the last
maxBufferedDocs documents are journaled and the whole journal is reset
after close().  My application has no way to know when the bytes make it
to disk, and so cannot manage its journal properly unless Lucene ensures
index integrity with sync()'s.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: After kill -9 index was corrupt

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 11 September 2006 02:24, Chuck Williams wrote:
> Hi All,
> 
> An application of ours under development had a memory link that caused
> it to slow interminably.  On linux, the application did not response to
> kill -15 in a reasonable time, so kill -9 was used to forcibly terminate
> it.  After this the segments file contained a reference to a segment
> whose index files were not present.  I.e., the index was corrupt and
> Lucene could not open it.
> 
> A thread dump at the time of the kill -9 shows that Lucene was merging
> segments inside IndexWriter.close().  Since segment merging only commits
> (updates the segments file) after the newly merged segment(s) are
> complete, I expect this is not the actual problem.
> 
> Could a kill -9 prevent data from reaching disk for files that were
> previously closed?  If so, then Lucene's index can become corrupt after
> kill -9.  In this case, it is possible that a prior merge created new
> segment index files, updated the segments file, closed everything, the
> segments file made it to disk, but the index data files and/or their
> directory entries did not.
> 
> If this is the case, it seems to me that flush() and
> FileDescriptor.sync() are required on each index file prior to close()
> to guarantee no corruption.  Additionally a FileDescriptor.sync() is
> also probably required on the index directory to ensure the directory
> entries have been persisted.

Shouldn't the sync be done after closing the files? I'm using sync in a
(un*x) shell script after merges before backups. I'd prefer to have some
more of this syncing built into Lucene because the shell sync syncs all
disks which might be more than needed. So far I've had no problems,
so there was no need to investigate further.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org