You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hrishikesh Gadre (JIRA)" <ji...@apache.org> on 2015/05/07 22:09:59 UTC
[jira] [Created] (SOLR-7511) Unable to open searcher when chaosmonkey is actively restarting solr and data nodes

Hrishikesh Gadre created SOLR-7511:
--------------------------------------

             Summary: Unable to open searcher when chaosmonkey is actively restarting solr and data nodes
                 Key: SOLR-7511
                 URL: https://issues.apache.org/jira/browse/SOLR-7511
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.10.3
            Reporter: Hrishikesh Gadre


I have a working chaos-monkey setup which is killing (and restarting) solr and data nodes in a round-robin fashion periodically. I wrote a simple Solr client to periodically index and query bunch of documents. After executing the test for some time, Solr returns incorrect number of documents. In the background, I see following errors,

org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856)
        ... 8 more
Caused by: java.io.EOFException: read past EOF
        at org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186)
        at org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46)
        at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
        at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
        at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134)
        at org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358)
        at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792)
        at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
        at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)

The issue here is that the index state for one of the replica is corrupt (verified using Lucene CheckIndex tool). Hence Solr is not able to load the core on that particular instance. 

Interestingly when the other sane replica comes online, it tries to do a peer-sync to this failing replica and gets an error, it also moves to recovering state. As a result this particular shard is completely unavailable for read/write requests. Here is a sample log entries on this sane replica,

Error opening new searcher,trace=org.apache.solr.common.SolrException: SolrCore 'customers_shard1_replica1' is not available due to init failure: Error opening new searcher
        at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288)
        at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
        at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277)


2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync: core=customers_shard1_replica2 url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed
2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's attempt to sync with shard failed, moving to the next candidate
2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ShardLeaderElectionContext: There may be a better leader candidate than us - going back into recovery
2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling election /collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722
2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState: Running recovery - first canceling any ongoing recovery
2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last recovery attempt started 2685ms ago.
2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling recovery attempts - waiting for 7314ms

I am able to reproduce this problem consistently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org