You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gopal Patwa <go...@gmail.com> on 2014/02/12 00:01:47 UTC
Replica node down but zookeeper clusterstate not updated

Solr = 4.6.1, attached solrcloud admin console view
Zookeeper 3.4.5  = 3 node ensemble

In my test setup, I have 3 Node SolrCloud setup with 2 shard. Today we had
power failure and all node went down.

I started 3 node zookeeper ensemble first then followed with 3 node
solrcloud, and one of replica ip address was change due to dynamic ip
allocation but zookeeper
clusterstate is not updated with new ip address and it was still holding
old ip address for that bad node.

Do I need to manually update clusterstate in zookeeper? what are my options
if this could happen in production.

Bad node:
old IP:10.249.132.35 (still exist in zookeeper)
new IP: 10.249.133.10

Log from Node1:

11:26:25,242 INFO  [STDOUT] 49170786 [Thread-2-EventThread] INFO
 org.apache.solr.common.cloud.ZkStateReader  â A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 3)
11:26:41,072 INFO  [STDOUT] 49186615 [RecoveryThread] INFO
 org.apache.solr.cloud.ZkController  â publishing
core=genre_shard1_replica1 state=recovering
11:26:41,079 INFO  [STDOUT] 49186622 [RecoveryThread] ERROR
org.apache.solr.cloud.RecoveryStrategy  â Error while trying to recover.
core=genre_shard1_replica1:org.apache.solr.client.solrj.SolrServerException:
Server refused connection at: http://10.249.132.35:8080/solr
11:26:41,079 INFO  [STDOUT]     at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:496)
11:26:41,079 INFO  [STDOUT]     at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
11:26:41,079 INFO  [STDOUT]     at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:221)
11:26:41,079 INFO  [STDOUT]     at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:367)
11:26:41,079 INFO  [STDOUT]     at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:244)
11:26:41,079 INFO  [STDOUT] Caused by:
org.apache.http.conn.HttpHostConnectException: Connection to
http://10.249.132.35:8080 refused


11:27:14,036 INFO  [STDOUT] 49219580 [RecoveryThread] ERROR
org.apache.solr.cloud.RecoveryStrategy  â Recovery failed - trying again...
(9) core=geo_shard1_replica1
11:27:14,037 INFO  [STDOUT] 49219581 [RecoveryThread] INFO
 org.apache.solr.cloud.RecoveryStrategy  â Wait 600.0 seconds before trying
to recover again (10)
11:27:14,958 INFO  [STDOUT] 49220498 [Thread-40] INFO
 org.apache.solr.common.cloud.ZkStateReader  â Updating cloud state from
ZooKeeper...



Log from bad node with new ip address:

11:06:29,551 INFO  [STDOUT] 6234 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â Enough replicas found
to continue.
11:06:29,552 INFO  [STDOUT] 6236 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â I may be the new
leader - try and sync
11:06:29,554 INFO  [STDOUT] 6237 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.SyncStrategy  â Sync replicas to
http://10.249.132.35:8080/solr/venue_shard2_replica2/
11:06:29,555 INFO  [STDOUT] 6239 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=venue_shard2_replica2
url=http://10.249.132.35:8080/solr START replicas=[
http://10.249.132.56:8080/solr/venue_shard2_replica1/] nUpdates=100
11:06:29,556 INFO  [STDOUT] 6240 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=venue_shard2_replica2
url=http://10.249.132.35:8080/solr DONE.  We have no versions.  sync failed.
11:06:29,556 INFO  [STDOUT] 6241 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.SyncStrategy  â Leader's attempt to sync with shard
failed, moving to the next candidate
11:06:29,558 INFO  [STDOUT] 6241 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â We failed sync, but we
have no versions - we can't sync in that case - we were active before, so
become leader anyway
11:06:29,559 INFO  [STDOUT] 6243 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â I am the new leader:
http://10.249.132.35:8080/solr/venue_shard2_replica2/ shard2
11:06:29,561 INFO  [STDOUT] 6245 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.common.cloud.SolrZkClient  â makePath:
/collections/venue/leaders/shard2
11:06:29,577 INFO  [STDOUT] 6261 [Thread-2-EventThread] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=event_shard2_replica2
url=http://10.249.132.35:8080/solr  Received 18 versions from
10.249.132.56:8080/solr/event_shard2_replica1/
11:06:29,578 INFO  [STDOUT] 6263 [Thread-2-EventThread] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=event_shard2_replica2
url=http://10.249.132.35:8080/solr Requesting updates from
10.249.132.56:8080/solr/event_shard2_replica1/n=10versions=[1457764666067386368,
1456709993140060160, 1456709989863260160,
1456709986075803648, 1456709971758546944, 1456709179685208064,
1456709137524064256, 1456709130040377344, 1456707444339113984,
1456707435417829376]
11:06:29,775 INFO  [STDOUT] 6459 [Thread-2-EventThread] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  â
[event_shard2_replica2] {add=[4319389 (1456707435417829376), 43185