You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2017/03/27 11:37:41 UTC

[jira] [Updated] (OAK-4114) Cached lucene index gets corrupted in case of unclean shutdown and journal rollback in SegmentNodeStore

     [ https://issues.apache.org/jira/browse/OAK-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chetan Mehrotra updated OAK-4114:
---------------------------------
    Labels: candidate_oak_1_2 resilience  (was: candidate_oak_1_2 candidate_oak_1_4 resilience)

> Cached lucene index gets corrupted in case of unclean shutdown and journal rollback in SegmentNodeStore
> -------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-4114
>                 URL: https://issues.apache.org/jira/browse/OAK-4114
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>            Priority: Critical
>              Labels: candidate_oak_1_2, resilience
>             Fix For: 1.4.15, 1.5.14, 1.6.0, 1.0.38
>
>
> Currently Oak Lucene support would copy index files to local file system as part of CopyOnRead feature. In one of the setup it has been observed that index logic was failing with following error
> {noformat}
> 04.02.2016 17:47:52.391 *WARN* [oak-lucene-3] org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier [/oak:index/lucene] Found local copy for _2ala.cfs in MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 but size of local 9320 differs from remote 3714150. Content would be read from remote file only
> 04.02.2016 17:47:52.399 *WARN* [oak-lucene-3] org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier [/oak:index/lucene] Found local copy for segments_28je in MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 but size of local 1214 differs from remote 1175. Content would be read from remote file only
> 04.02.2016 17:47:52.491 *ERROR* [oak-lucene-3] org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker Failed to open Lucene index at /oak:index/lucene
> org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1953790076 vs expected header=1071082519 (resource: SlicedIndexInput(SlicedIndexInput(_2ala.fnm in _2ala.cfs) in _2ala.cfs slice=8810:9320))
> 	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)
> 	at org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
> 	at org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:215)
> {noformat}
> Here size of __2ala.cfs_ differed from remote copy and possible other index file may have same size but different content. Comparing the modified time of the files with those in Oak it can be seen that one of file system was older than one in Oak
> {noformat}
> _2alr.cfs={name=_2alr.cfs, size=1152402, sizeStr=1.2 MB, modified=Thu Feb 04 17:52:31 GMT 2016, osModified=Feb 4 17:52, osSize=1152402, mismatch=false}
> _2ala.cfe={name=_2ala.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=224, mismatch=false}
> _2ala.si={name=_2ala.si, size=252, sizeStr=252 B, modified=Thu Feb 04 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=252, mismatch=false}
> _2ala.cfs={name=_2ala.cfs, size=3714150, sizeStr=3.7 MB, modified=Thu Feb 04 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=9320, mismatch=true}
> _14u3_29.del={name=_14u3_29.del, size=1244036, sizeStr=1.2 MB, modified=Thu Feb 04 16:37:35 GMT 2016, osModified=Feb 4 16:37, osSize=1244036, mismatch=false}
> _2akw.si={name=_2akw.si, size=252, sizeStr=252 B, modified=Thu Feb 04 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=252, mismatch=false}
> _2akw.cfe={name=_2akw.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=224, mismatch=false}
> _2akw.cfs={name=_2akw.cfs, size=4952761, sizeStr=5.0 MB, modified=Thu Feb 04 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=4952761, mismatch=false}
> {noformat}
> And on same setup the system did saw a rollback in segment node store 
> {noformat}
> -rw-rw-r--. 1 crx crx  25961984 Feb  4 16:47 data01357a.tar
> -rw-rw-r--. 1 crx crx  24385536 Feb  4 16:41 data01357a.tar.bak
> -rw-rw-r--. 1 crx crx    359936 Feb  4 17:18 data01358a.tar
> -rw-rw-r--. 1 crx crx    345088 Feb  4 17:17 data01358a.tar.bak
> -rw-rw-r--. 1 crx crx  70582272 Feb  4 18:35 data01359a.tar
> -rw-rw-r--. 1 crx crx  66359296 Feb  4 18:33 data01359a.tar.bak
> -rw-rw-r--. 1 crx crx    282112 Feb  4 18:46 data01360a.tar
> -rw-rw-r--. 1 crx crx    236544 Feb  4 18:45 data01360a.tar.bak
> -rw-rw-r--. 1 crx crx    138240 Feb  4 18:56 data01361a.tar
> {noformat}
> So one possible cause is that 
> # At some time earlier to 17:17 lucene index got updated and __2ala.cfs_ got created. 
> # Post update the head revision in Segment store was updated but the revision yet to made it to journal log
> # Lucene CopyOnRead logic got event for the change and copied the file
> # System crashed and hence journal did not got updated
> # System restarted and per last entry in journal system suffered with some "data loss" and hence index checkpoint also moved back
> # As checkpoint got reverted index started at earlier state and hence created a file with same name __2ala.cfs_ 
> # CopyOnRead detected file length change and logged a warning routing call to remote
> # However other files like _2ala.si, _2ala.cfe which were created in same commit had same size but likely different content which later cause lucene query to start failing
> In such a case a restart after cleaning the existing index content would have brought back the system to normal state.
> So as a fix we would need to come up with some sanity check at time of system startup



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)