You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Simon Rosenthal (JIRA)" <ji...@apache.org> on 2017/11/09 15:04:00 UTC
[jira] [Resolved] (SOLR-10840) Random Index Corruption during bulk indexing

     [ https://issues.apache.org/jira/browse/SOLR-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Rosenthal resolved SOLR-10840.
------------------------------------
    Resolution: Cannot Reproduce

After moving our production Solr server to a new AWS instance, the problem disappeared. Heaven knows why.

> Random Index Corruption during bulk indexing
> --------------------------------------------
>
>                 Key: SOLR-10840
>                 URL: https://issues.apache.org/jira/browse/SOLR-10840
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update
>    Affects Versions: 6.3, 6.5.1
>         Environment: AWS EC2 instance running Centos 7
>            Reporter: Simon Rosenthal
>
> I'm seeing a randomly occuring Index Corruption exception during a Solr  data ingest. This can occur anywhere during the 7-8 hours our ingests take. I'm initially submitting this as a Solr bug as this is the envioronment I'm using, but it does look as though the error is occurring in Lucene code.
> Some background:
>     AWS EC2 server running CentOS 7
>     java.runtime.version: 1.8.0_131-b11  (also occurred with 1.8.0_45).
>     Solr 6.3.0 (have also seen it with Solr 6.5.1). It did not happen with Solr 5.4 9which i can't go back to). Oddly enough, I ran Solr 6.3.0 unvenetfully for several weeks before this problem first occurred.
>     Standalone  (non cloud) environment.
>     Our indexing subsystem is a complex Python script which creates multiple indexing subprocesses in order to make use of multiple cores. Each subprocess reads records from a MySQL database, does  some significant preprocessing and sends a batch of documents (defaults to 500) to the Solr update handler (using the Python 'scorched' module). Each content source (there are 5-6) requires a separate instantiation of the script, and these wrapped in a Bash script to run serially.
>     
>     When the exception occurs, we always see something like the following in the solr.log
>     
>     ERROR - 2017-06-06 14:37:34.639; [   x:stresstest1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Exception writing document id med-27840-00384802 to the index; possible analysis error.
>     at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178
>     ...
>     Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>     at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
>     at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
>     at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
>     at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:279)
>     at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
>     at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
>     ... 42 more
> Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")
>     at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
>     at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
>     at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
>         at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
>     at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
>     at org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:58)
>     at org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsProducer(Lucene53NormsFormat.java:82)
>     at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113)
>     at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
>     at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
>     at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
>     at org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
>     at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
>     at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
>     at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
>     at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>     at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>     Suppressed: org.apache.lucene.index.CorruptIndexException: checksum status indeterminate: remaining=0, please run checkindex for more details (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")))
>         at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
>         at org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:63)
>         ... 12 more
>          
>     This is usually followed in very short order by similar exceptions  as other UpdateHandler threads hit the same IOException.
>     
>     I've also seenwhat I assume is a related error during autocommits-
>     
> INFO  - 2017-05-30 18:01:35.264; [   x:build0530] org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
> ERROR - 2017-05-30 18:01:36.884; [   x:build0530] org.apache.solr.common.SolrException; auto commit error...:org.apache.solr.common.SolrException: Error opening new searcher
>     at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
>     at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
>     at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:630)
>     at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-2060254071 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0530/index/_15w_Lucene50_0.tip")))
>     at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499)
>     at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411)
>     at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520)
>     at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:178)
>     at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
>     at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:292)
>     at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
>     at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:106)
>     at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
>     at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
>     at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
>     at org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
>     at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
>     at org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
>     at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
>     at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:454)
>     at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
>     at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
>     at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
>     at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1731)
>     ... 10 more
>     
>     Other Observations:
>     
>         It's not assocciated with a specific Lucene index file type (the 'read past EOF' has been reported on .fnm, .nvm, .tip. .si, .dvm files)
>         I've configured merges to use the LogByteSizeMergePolicyFactory and TieredMergePolicyFactory, and I see the failure with either.
>         The file system (~600gb) is never more than 50% full so disk space is not an issue
>         I've seen this occur with indexes on both ext4 and xfs file systems (which have been fsck'ed /repaired, and we're not seeing any hardware problems reported in the system logs). These file systems are all SSDs.
>         Solr is started with a 5Gb heap and I haven't seen heap usage > 3gb; also, there is no concurrent query activity during the indexing process.
>         I can recover from this error by unloading the core, running fixindex (whixch reports no errors), reloading the core, and continuing indexing from a checkpoint in the indexing script. 
>         
>     I've created a test rig (in Python) which can be run independently of our environment and workflow, and have managed to get this to throw the exception (the first stack trace above is from a run with that).
>     
>     My semi-informed guess is that this is due to a race condition between segment merges and index updates...
>     
>     
>         
>     



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org