You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by dmarini <da...@gmail.com> on 2013/10/08 20:16:15 UTC
Solr 4.4.0 Shard Update Errors (503) but cloud graph shows all green?

Hi!We are running Solr 4.4.0 on a 3 node linux cluster and have about 2
collections storing product data with no problems. Yesterday, I attempted to
create another one of these collections using the Collections API, but I had
forgotten to upload the config to the zookeeper prior to making the call and
it failed spectacularly as expected :).. The API command I ran was to create
a 3 shard collection with a replicationfactor of 2 (maxShardsPerNode) set to
2 since the default understandably causes issues on 3 node clusters.Since I
ran that command however, I see the following message in the red 'SolrCore
Initialization Failures' when I load up the admin for 2 out of 3 of the
nodes (the following is from one of the
boxes):MyNewCollection_shard1_replica2:
org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
Could not find configName for collection MyNewCollection
found:[MyFirstCollection,
MySecondCollection]MyNewCollection_shard3_replica1:
org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
Could not find configName for collection MyNewCollection
found:[MyFirstCollection, MySecondCollection]My first question is, how do I
get this to go away since the cores never actually got created? I looked in
the solr directory and I do not see folders with the core names (which I'm
under the impression that the implicit core walking uses to determine what
cores to attempt to load).Second, and a bit stranger, is that also since I
messed up that command, I now appear to be seeing errors from the admin log
(every 2 seconds) when attempting to update documents in the other 2
collections that were working fine prior to the command being run.
Specifically, I'm seeing these messages repeating over and over near
constantly:14:07:11ERRORSolrCmdDistributorshard update error StdNode:
http://10.0.1.29:8983/solr/MyFirstCollection_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at http://10.0.1.29:8983/solr/MyFirstCollection_shard1_replica2
returned non ok status:503, message:Service
Unavailable14:07:11ERRORSolrCoreRequest says it is coming from leader, but
we are the leader:
distrib.from=http://10.0.1.30:8983/solr/MyFirstCollection_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=214:07:11ERRORSolrCoreorg.apache.solr.common.SolrException:
Request says it is coming from leader, but we are the
leader14:07:11WARNRecoveryStrategyStopping recovery for
zkNodeName=core_node1core=MyFirstCollection_shard1_replica214:07:11WARNRecoveryStrategyWe
have not yet recovered - but we are now the leader!
core=MyFirstCollection_shard1_replica2The first error worries me much, as I
think I'm losing data, but I can directly query that shard from that machine
with no issues and the cloud view from ALL of the machines shows totally
green.I'm not sure how the failed command got the system into this state and
I'm kicking myself for making that mistake to begin with but I'm completely
at a loss for how to attempt to recover since these are live collections
that I can't take down without incurring significant downtime.Any ideas?
Will reloading the cores that are throwing these messages help? can the
zookeeper and solr not have the same idea as to who the leader is for that
shard? and if so, how do I re-introduce consistency there?Appreciate any
help that can be offered.Thanks,--Dave



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-4-0-Shard-Update-Errors-503-but-cloud-graph-shows-all-green-tp4094139.html
Sent from the Solr - User mailing list archive at Nabble.com.