You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Ian Boston (JIRA)" <ji...@apache.org> on 2015/11/05 10:02:27 UTC

[jira] [Comment Edited] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

    [ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991349#comment-14991349 ] 

Ian Boston edited comment on OAK-3547 at 11/5/15 9:01 AM:
----------------------------------------------------------

[~mreutegg] If an earlier version of the index is used by the writer, there will be holes in the index and items will be missing. There are several options. a) flag the issue to alert admins the index is not healthy, but continue to index using an index that will open. b) Fail the index write and stop indexing completely. c) Fail the index write and start re-indexing automatically.  Of those I think option a will deliver the best continuity. Option b risks wide scale application level issues, option c risks both application level issues and potential unavailability caused by the load or rebuilding an index from scratch. There is no easy answer. 

Now that there are checksums in place I have been seeing more frequent race conditions between the writer and the readers which occasionally open older versions. I think this is because the OakDirectory checks all the files when its opened by computing a checksum of everything referenced. I think that Lucene delays checking the file or checking the internals of a file until its needed, hence any errors are more visible than before.

----

Lucene already has a concept of committing the index by syncing the segment_xx and segment.gen files. I am writing the listing node on sync of either of these or close of the index which has reduced the number of generations. The result appears to be very stable. I have also introduced the concept of mutability as some of the file types are mutable. .del is mutable, so the length and checksum are not checked. If a .del from a later generation is used, that will only delete the lucene docs that were deleted in that later generation. No damage. segments.gen is also mutable. This is more of a problem. It is supposed to be a fallback file with segment_xx used in preference, however if segment.gen is used it will be from the wrong generation and will define the wrong set of segment files for the index. I need to check if segment.gen is ever read. If it is, then I think the OakDirectory needs to map segment.gen to a generational version of the same (ie segment.gen_<epoch>) so that only .del files are mutable. That should make the OakDirectory recoverable.







was (Author: ianeboston):
[~mreutegg] If an earlier version of the index is used by the writer, there will he holes in the index and items will be missing. There are several options. a) flag the issue to alert admins the index is not healthy, but continue to index using an index that will open. b) Fail the index write and stop indexing completely. c) Fail the index write and start re-indexing automatically.  Of those I think option a will deliver the best continuity. Option b risks wide scale application level issues, option c risks both application level issues and potential unavailability caused by the load or rebuilding an index from scratch. There is no easy answer. 

Now that there are checksums in place I have been seeing more frequent race conditions between the writer and the readers which occasionally open older versions. I think this is because the OakDirectory checks all the files when its opened by computing a checksum of everything referenced. I think that Lucene delays checking the file or checking the internals of a file until its needed, hence any errors are more visible than before.

----

Lucene already has a concept of committing the index by syncing the segment_xx and segment.gen files. I am writing the listing node on sync of either of these or close of the index which has reduced the number of generations. The result appears to be very stable. I have also introduced the concept of mutability as some of the file types are mutable. .del is mutable, so the length and checksum are not checked. If a .del from a later generation is used, that will only delete the lucene docs that were deleted in that later generation. No damage. segments.gen is also mutable. This is more of a problem. It is supposed to be a fallback file with segment_xx used in preference, however if segment.gen is used it will be from the wrong generation and will define the wrong set of segment files for the index. I need to check if segment.gen is ever read. If it is, then I think the OakDirectory needs to map segment.gen to a generational version of the same (ie segment.gen_<epoch>) so that only .del files are mutable. That should make the OakDirectory recoverable.






> Improve ability of the OakDirectory to recover from unexpected file errors
> --------------------------------------------------------------------------
>
>                 Key: OAK-3547
>                 URL: https://issues.apache.org/jira/browse/OAK-3547
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>    Affects Versions: 1.4
>            Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way damaged, and exception is thrown which impacts all queries using that index, at times making the index unavailable. This improvement aims to make the OakDirectory recover to a previously ok state by storing which files were involved in previous states, and giving the code some way of checking if they are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)