You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by simon <mt...@gmail.com> on 2017/06/08 17:12:08 UTC
Random Index Corruption exceptions during bulk indexing

I'm seeing a randomly occuring Index Corruption exception during a Solr
 data ingest. This can occur anywhere during the 7-8 hours our ingests
take. I've submitted a Solr bug issue to JIRA as this is the environment
I'm using, but it does look as though the error is occurring in Lucene
code, so I thought it might get some extra scrutiny here - I did not get
any useful feedback on the solr-user list.

Some background:

    AWS EC2 server running CentOS 7
    java.runtime.version: 1.8.0_131-b11  (also occurred with 1.8.0_45).

    Solr 6.3.0 (have also seen it with Solr 6.5.1). It did not happen with
Solr 5.4 (which i can't go back to). Oddly enough, I ran Solr 6.3.0
unvenetfully for several weeks before this problem first occurred.
LuceneMatchVersion in solrconfig.xml is set to 6.3.0

    Standalone  (non cloud) environment.

    Our indexing subsystem is a complex Python script which creates
multiple indexing subprocesses in order to make use of multiple cores. Each
subprocess reads records from a MySQL database, does  some significant
preprocessing and sends a batch of documents (defaults to 500) to the Solr
update handler (using the Python 'scorched' module). Each content source
(there are 5-6) requires a separate instantiation of the script, and these
wrapped in a Bash script to run serially.

    When the exception occurs, we always see something like the following
in the solr.log

    ERROR - 2017-06-06 14:37:34.639; [   x:stresstest1]
org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
Exception writing document id med-27840-00384802 to the index; possible
analysis error.
    at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178
    ...
    Caused by: org.apache.lucene.store.AlreadyClosedException: this
IndexWriter is closed
    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
    at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
    at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:279)
    at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
    at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
    ... 42 more
Caused by: java.io.EOFException: read past EOF:
MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")
    at
org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
    at
org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
    at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
        at
org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
    at
org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
    at
org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:58)
    at
org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsProducer(Lucene53NormsFormat.java:82)
    at
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
    at
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
    at
org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
    at
org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
    at
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
    at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
    at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
    at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
    at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
    Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
status indeterminate: remaining=0, please run checkindex for more details
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")))
        at
org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
        at
org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:63)
        ... 12 more

    This is usually followed in very short order by similar exceptions  as
other UpdateHandler threads hit the same IOException.

    I've also seen what I assume is a related error during autocommits-

INFO  - 2017-05-30 18:01:35.264; [   x:build0530]
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
ERROR - 2017-05-30 18:01:36.884; [   x:build0530]
org.apache.solr.common.SolrException; auto commit
error...:org.apache.solr.common.SolrException: Error opening new searcher
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
    at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:630)
    at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer
mismatch (file truncated?): actual footer=-2060254071 vs expected
footer=-1071082520
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0530/index/_15w_Lucene50_0.tip")))
    at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499)
    at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411)
    at
org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520)
    at
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:178)
    at
org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
    at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:292)
    at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
    at
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:106)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
    at
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
    at
org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
    at
org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
    at
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
    at
org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
    at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
    at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:454)
    at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
    at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
    at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1731)
    ... 10 more


    Other Observations:

        It's not assocciated with a specific Lucene index file type (the
'read past EOF' has been reported on .fnm, .nvm, .tip. .si, .dvm files)
        I've configured merges to use the LogByteSizeMergePolicyFactory and
TieredMergePolicyFactory, and I see the failure with either.
        The file system (~600gb) is never more than 50% full so disk space
is not an issue
        I've seen this occur with indexes on both ext4 and xfs file systems
(which have been fsck'ed /repaired, and we're not seeing any hardware
problems reported in the system logs). These file systems are all SSDs.
        Solr is started with a 5Gb heap and I haven't seen heap usage >
3gb; also, there is no concurrent query activity during the indexing
process.
        I can recover from this error by unloading the core, running
fixindex (whixch reports no errors), reloading the core, and continuing
indexing from a checkpoint in the indexing script.

    I've created a test rig (in Python) which can be run independently of
our environment and workflow, and have managed to get this to throw the
exception (the first stack trace above is from a run with that).

    My semi-informed guess is that this is due to a race condition between
segment merges and index updates...

-Simon