You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Frank J Kelly (JIRA)" <ji...@apache.org> on 2016/02/25 19:59:18 UTC

[jira] [Commented] (SOLR-4900) Leader election deadlock after restarting leader in 4.2.1

    [ https://issues.apache.org/jira/browse/SOLR-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167649#comment-15167649 ] 

Frank J Kelly commented on SOLR-4900:
-------------------------------------

Hit this also. Now we just stop the service first, and wait for a leader election before restart.

Surely there is some way to resolve this "easily"?
If Solr says it is not the leader, ZK should elect a new leader?

> Leader election deadlock after restarting leader in 4.2.1
> ---------------------------------------------------------
>
>                 Key: SOLR-4900
>                 URL: https://issues.apache.org/jira/browse/SOLR-4900
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.2
>         Environment: Linux 64 bit, Tomcat 6.0.35, Java 6u27 64 bit
>            Reporter: John Guerrero
>            Assignee: Mark Miller
>
> Copying post from http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-td4067988.html
> SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP), java 6u27 64 bit 
> 6 nodes, 2 shards, 3 replicas each.  Names changed to r1s2 (replica1 - shard 2), r2s2, and r3s2 for each replica in shard 2. 
> *What we see*:
> * Under production load, we restart a leader (r1s2), and observe in the cloud admin 
> that the old leader is in state "Down" and no new leader is ever elected. 
> * The system will stay like this until we stop the old leader (or cause a ZK timeout...see below). 
> *Please note*: the leader is killed, then kill -9'd 5 seconds later, before restarting.  We have since changed this. 
> *Digging into the logs on the old leader (r1s2 = replica1-shard 2)*:
> * The old leader restarted at 5:23:29 PM, but appears to be stuck in SolrDispatchFilter.init() -- (See recovery at bottom). 
> * It doesn't want to become leader, possibly due to the unclean shutdown. 
> May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions 
> INFO: PeerSync: core=browse url=http://r1s2:8080/solr  Our versions are too old. ourHighThreshold=1436325665147191297 otherLowThreshold=1436325775374548992 
> * It then tries to recover, but cannot, because there is no leader. 
> May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log 
> SEVERE: Error while trying to recover. core=browse:org.apache.solr.common.SolrException: No registered leader was found, collection:browse slice:shard2 
> * Meanwhile, it appears that blocking in init(), prevents the http-8080 handler from starting (See recovery at bottom). 
> *Digging into the other replicas (r2s2)*:
> * For some reason, the old leader (r1s2) remains in the list of replicas that r2s2 attempts to sync to. 
> May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync 
> INFO: PeerSync: core=browse url=http://r2s2:8080/solr START replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/] nUpdates=100 
> * This apparently fails (30 second timeout), possibly due to http-8080 handler not being started on r1s2. 
> May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse 
> WARNING: PeerSync: core=browse url=http://r2s2:8080/solr  exception talking to http://r1s2:8080/solr/browse/, failed 
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://r1s2:8080/solr/browse
> *At this point, the cluster will remain indefinitely without a leader, if nothing else changes.*
> But in this particular instance, we took some stack and heap dumps from r1s2, which paused java 
> long enough to cause a *zookeeper timeout on the old leader (r1s2)*: 
> May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run 
> INFO: Client session timed out, have not heard from server in 38226ms for sessionid 0x23d28e0f584005d, closing socket connection and attempting reconnect 
> Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and succeeded in becoming leader: 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr START replicas=[http://r2s2:8080/solr/browse/] nUpdates=100 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Received 100 versions from r2s2:8080/solr/browse/ 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Our versions are newer. ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded 
> Now that we have a leader, r1s2 can succeed in recovery and finish SolrDispatchFilter.init(), 
> apparently allowing the http-8080 handler to start (r1s2). 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay 
> INFO: No replay needed. core=browse 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery 
> INFO: Replication Recovery was successful - registering as Active. core=browse 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: publishing core=browse state=active 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: numShards not found on descriptor - reading it from system property 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery 
> INFO: Finished recovery process. core=browse 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run 
> INFO: Starting recovery process.  core=browse recoveringAfterStartup=false 
> May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader updateClusterState 
> INFO: Updating cloud state from ZooKeeper... 
> May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init 
> INFO: user.dir=/ 
> May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init 
> *INFO: SolrDispatchFilter.init() done*
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: publishing core=browse state=recovering 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: numShards not found on descriptor - reading it from system property 
> May 28, 2013 5:34:49 PM org.apache.solr.client.solrj.impl.HttpClientUtil createClient 
> INFO: Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 
> May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig deployDirectory 
> INFO: Deploying web application directory ROOT 
> May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol start 
> *INFO: Starting Coyote HTTP/1.1 on http-8080  #<---------------------- http-8080 started!*
> May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start 
> INFO: Starting Coyote AJP/1.3 on ajp-8009 
> May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start 
> *INFO: Server startup in 668324 ms #<------------------------------ 668 sec = 11 minutes to start Catalina.*
> *Our Workaround*:
> * We changed our script to allow 15 seconds before kill -9. 
> * Also, we no longer do a restart.  We just stop the leader and wait for a new leader.  This 
> still results in a few "No registered leader was found" exceptions, but at least the duration is short. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org