You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by gumatias <gu...@matias.com> on 2012/12/21 19:41:53 UTC

Re: Solrcloud not reachable and after restart just a "no servers hosting shard"

I'm getting the same error. I followed the SolrCloud examples and it didn't
work.. here's basically what I've done:

EXPERIMENT 1: start shards and index documents, search for documents in all
replicas

# Starting Shards
- Shard1 Leader (with zookeeper)
	java -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun
-DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar
start.jar
- Shard1 Replica (with zookeeper)
	java -Djetty.port=7574 -DzkRun
-DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
- Shard2 Leader (with zookeeper)
	java -Djetty.port=8900 -DzkRun
-DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
- Shard2 Replica
	java -Djetty.port=7500
-DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

clusterstate.json: http://dl.dropbox.com/u/7570330/clusterstate.txt

# Indexing sample document
	java -jar post.jar hd.xml
	
# search in all Shards: number of results found: 2
Note: all shards have the same result

EXPERIMENT 2: Kill current Shard1 Leader, expect Shard1 Replica to become
leader, search should still work and results return (is that right?)

# Killing Shard2 Leader

Shard2 Replica logs:
...
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Processed session termination for sessionid: 0x3bbe3403c00001
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x3bbe3403c00000 type:delete cxid:0x4dea zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a Error
Path:/collections/collection1/leaders/shard1 Error:KeeperErrorCode = NoNode
for /collections/collection1/leaders/shard1
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x3bbe3403c00000 type:create cxid:0x4dec zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
NodeExists for /overseer
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
shouldIBeLeader
INFO: Checking if I should try and be the leader.
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
shouldIBeLeader
INFO: My last published State was Active, it's okay to be the leader.
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: I may be the new leader - try and sync
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.SyncStrategy sync
INFO: Sync replicas to
http://Gustavos-MacBook-Pro.local:8900/solr/collection1/
Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=collection1
url=http://Gustavos-MacBook-Pro.local:8900/solr START
replicas=[http://Gustavos-MacBook-Pro.local:8983/solr/collection1/]
nUpdates=100
Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=collection1
url=http://Gustavos-MacBook-Pro.local:8900/solr DONE.  We have no versions. 
sync failed.
Dec 21, 2012 11:57:39 AM org.apache.solr.common.SolrException log
SEVERE: Sync Failed
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
rejoinLeaderElection
INFO: There is a better leader candidate than us - going back into recovery
Dec 21, 2012 11:57:39 AM org.apache.solr.update.DefaultSolrCoreState
doRecovery
INFO: Running recovery - first canceling any ongoing recovery
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy run
INFO: Starting recovery process.  core=collection1
recoveringAfterStartup=false
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Attempting to PeerSync from
http://Gustavos-MacBook-Pro.local:8983/solr/collection1/ core=collection1 -
recoveringAfterStartup=false
Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=collection1
url=http://Gustavos-MacBook-Pro.local:8900/solr START
replicas=[http://Gustavos-MacBook-Pro.local:8983/solr/collection1/]
nUpdates=100
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x3bbe3403c00000 type:delete cxid:0x4df3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a Error
Path:/collections/collection1/leaders/shard1 Error:KeeperErrorCode = NoNode
for /collections/collection1/leaders/shard1
Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
WARNING: no frame of reference to tell of we've missed updates
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: PeerSync Recovery was not successful - trying replication.
core=collection1
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Starting Replication Recovery. core=collection1
Dec 21, 2012 11:57:39 AM org.apache.solr.client.solrj.impl.HttpClientUtil
createClient
INFO: Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x3bbe3403c00000 type:create cxid:0x4df4 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
NodeExists for /overseer
Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=180000
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.quorum.LearnerHandler
run
SEVERE: Unexpected exception causing shutdown while sock still open
java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:375)
	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
	at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
	at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
	at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:416)
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.ClientCnxn$SendThread run
INFO: Unable to read additional data from server sessionid 0x3bbe3403c00000,
likely server has closed socket, closing socket connection and attempting
reconnect
Dec 21, 2012 11:57:39 AM
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker run
WARNING: Connection broken for id 0, my id = 2, error = java.io.IOException:
Channel eof
Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.quorum.LearnerHandler
run
WARNING: ******* GOODBYE /127.0.0.1:58549 ********
Dec 21, 2012 11:57:39 AM
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker run
WARNING: Interrupting SendWorker
Dec 21, 2012 11:57:39 AM
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker run
WARNING: Interrupted while waiting for message on queue
java.lang.InterruptedException
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
	at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
	at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:347)
	at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:622)
Dec 21, 2012 11:57:39 AM
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker run
WARNING: Send worker leaving thread
Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
startConnect
INFO: Opening socket connection to server localhost/127.0.0.1:9900
Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
primeConnection
INFO: Socket connection established to localhost/127.0.0.1:9900, initiating
session
Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn$Factory
run
INFO: Accepted socket connection from /127.0.0.1:58930
Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn
readConnectRequest
INFO: Client attempting to renew session 0x3bbe3403c00000 at
/127.0.0.1:58930
Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn
finishSessionInit
INFO: Established session 0x3bbe3403c00000 with negotiated timeout 15000 for
client /127.0.0.1:58930
Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
readConnectResult
INFO: Session establishment complete on server localhost/127.0.0.1:9900,
sessionid = 0x3bbe3403c00000, negotiated timeout = 15000
Dec 21, 2012 11:57:40 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=179497
Dec 21, 2012 11:57:40 AM org.apache.solr.common.cloud.ZkStateReader
updateClusterState
INFO: Updating cloud state from ZooKeeper... 
Dec 21, 2012 11:57:40 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=178996
Dec 21, 2012 11:57:41 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=178494
Dec 21, 2012 11:57:41 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=177992
Dec 21, 2012 11:57:42 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=177491
Dec 21, 2012 11:57:42 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=176989
Dec 21, 2012 11:57:43 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=176488
Dec 21, 2012 11:57:43 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=175986
Dec 21, 2012 11:57:44 AM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=175484
...

Notes:

- Shard1 replica doesnt become the leader and keeps "waiting until see more
replicas up"

- Search results for all Shards:
	<lst name="error">
		<str name="msg">no servers hosting shard:</str>
		<int name="code">503</int>
	</lst>
	
Dump: http://dl.dropbox.com/u/7570330/dump.txt

What am I doing wrong?



--
View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-not-reachable-and-after-restart-just-a-no-servers-hosting-shard-tp4009786p4028623.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solrcloud not reachable and after restart just a "no servers hosting shard"

Posted by Mark Miller <ma...@gmail.com>.

At least it looks like your hitting that - based on it mentioning no frame of reference to use to sync with - more importantly though, your also hitting another issue - see my email to the user list:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3CD0994A2D-04B0-4A80-AF07-9ADD49B8559A@gmail.com%3E

- Mark

On Dec 21, 2012, at 2:10 PM, Mark Miller <ma...@gmail.com> wrote:

> Your hitting https://issues.apache.org/jira/browse/SOLR-3939
> 
> The luck of hashing must have left the guy trying to become the leader without any docs. Due to SOLR-3939, a node with an empty index cannot become the leader.
> 
> - Mark
> 
> On Dec 21, 2012, at 1:41 PM, gumatias <gu...@matias.com> wrote:
> 
>> I'm getting the same error. I followed the SolrCloud examples and it didn't
>> work.. here's basically what I've done:
>> 
>> EXPERIMENT 1: start shards and index documents, search for documents in all
>> replicas
>> 
>> # Starting Shards
>> - Shard1 Leader (with zookeeper)
>> 	java -Dbootstrap_confdir=./solr/collection1/conf
>> -Dcollection.configName=myconf -DzkRun
>> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar
>> start.jar
>> - Shard1 Replica (with zookeeper)
>> 	java -Djetty.port=7574 -DzkRun
>> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
>> - Shard2 Leader (with zookeeper)
>> 	java -Djetty.port=8900 -DzkRun
>> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
>> - Shard2 Replica
>> 	java -Djetty.port=7500
>> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
>> 
>> clusterstate.json: http://dl.dropbox.com/u/7570330/clusterstate.txt
>> 
>> # Indexing sample document
>> 	java -jar post.jar hd.xml
>> 	
>> # search in all Shards: number of results found: 2
>> Note: all shards have the same result
>> 
>> EXPERIMENT 2: Kill current Shard1 Leader, expect Shard1 Replica to become
>> leader, search should still work and results return (is that right?)
>> 
>> # Killing Shard2 Leader
>> 
>> Shard2 Replica logs:
>> ...
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
>> pRequest
>> INFO: Processed session termination for sessionid: 0x3bbe3403c00001
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
>> pRequest
>> INFO: Got user-level KeeperException when processing
>> sessionid:0x3bbe3403c00000 type:delete cxid:0x4dea zxid:0xfffffffffffffffe
>> txntype:unknown reqpath:n/a Error
>> Path:/collections/collection1/leaders/shard1 Error:KeeperErrorCode = NoNode
>> for /collections/collection1/leaders/shard1
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> INFO: Running the leader process.
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
>> pRequest
>> INFO: Got user-level KeeperException when processing
>> sessionid:0x3bbe3403c00000 type:create cxid:0x4dec zxid:0xfffffffffffffffe
>> txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
>> NodeExists for /overseer
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> INFO: Checking if I should try and be the leader.
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> INFO: My last published State was Active, it's okay to be the leader.
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> INFO: I may be the new leader - try and sync
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.SyncStrategy sync
>> INFO: Sync replicas to
>> http://Gustavos-MacBook-Pro.local:8900/solr/collection1/
>> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
>> INFO: PeerSync: core=collection1
>> url=http://Gustavos-MacBook-Pro.local:8900/solr START
>> replicas=[http://Gustavos-MacBook-Pro.local:8983/solr/collection1/]
>> nUpdates=100
>> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
>> INFO: PeerSync: core=collection1
>> url=http://Gustavos-MacBook-Pro.local:8900/solr DONE.  We have no versions. 
>> sync failed.
>> Dec 21, 2012 11:57:39 AM org.apache.solr.common.SolrException log
>> SEVERE: Sync Failed
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> rejoinLeaderElection
>> INFO: There is a better leader candidate than us - going back into recovery
>> Dec 21, 2012 11:57:39 AM org.apache.solr.update.DefaultSolrCoreState
>> doRecovery
>> INFO: Running recovery - first canceling any ongoing recovery
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy run
>> INFO: Starting recovery process.  core=collection1
>> recoveringAfterStartup=false
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
>> INFO: Attempting to PeerSync from
>> http://Gustavos-MacBook-Pro.local:8983/solr/collection1/ core=collection1 -
>> recoveringAfterStartup=false
>> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
>> INFO: PeerSync: core=collection1
>> url=http://Gustavos-MacBook-Pro.local:8900/solr START
>> replicas=[http://Gustavos-MacBook-Pro.local:8983/solr/collection1/]
>> nUpdates=100
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
>> pRequest
>> INFO: Got user-level KeeperException when processing
>> sessionid:0x3bbe3403c00000 type:delete cxid:0x4df3 zxid:0xfffffffffffffffe
>> txntype:unknown reqpath:n/a Error
>> Path:/collections/collection1/leaders/shard1 Error:KeeperErrorCode = NoNode
>> for /collections/collection1/leaders/shard1
>> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
>> WARNING: no frame of reference to tell of we've missed updates
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
>> INFO: PeerSync Recovery was not successful - trying replication.
>> core=collection1
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
>> INFO: Starting Replication Recovery. core=collection1
>> Dec 21, 2012 11:57:39 AM org.apache.solr.client.solrj.impl.HttpClientUtil
>> createClient
>> INFO: Creating new http client,
>> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> INFO: Running the leader process.
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
>> pRequest
>> INFO: Got user-level KeeperException when processing
>> sessionid:0x3bbe3403c00000 type:create cxid:0x4df4 zxid:0xfffffffffffffffe
>> txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
>> NodeExists for /overseer
>> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=180000
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.quorum.LearnerHandler
>> run
>> SEVERE: Unexpected exception causing shutdown while sock still open
>> java.io.EOFException
>> 	at java.io.DataInputStream.readInt(DataInputStream.java:375)
>> 	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>> 	at
>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
>> 	at
>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>> 	at
>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:416)
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.ClientCnxn$SendThread run
>> INFO: Unable to read additional data from server sessionid 0x3bbe3403c00000,
>> likely server has closed socket, closing socket connection and attempting
>> reconnect
>> Dec 21, 2012 11:57:39 AM
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker run
>> WARNING: Connection broken for id 0, my id = 2, error = java.io.IOException:
>> Channel eof
>> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.quorum.LearnerHandler
>> run
>> WARNING: ******* GOODBYE /127.0.0.1:58549 ********
>> Dec 21, 2012 11:57:39 AM
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker run
>> WARNING: Interrupting SendWorker
>> Dec 21, 2012 11:57:39 AM
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker run
>> WARNING: Interrupted while waiting for message on queue
>> java.lang.InterruptedException
>> 	at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
>> 	at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
>> 	at
>> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:347)
>> 	at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:622)
>> Dec 21, 2012 11:57:39 AM
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker run
>> WARNING: Send worker leaving thread
>> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
>> startConnect
>> INFO: Opening socket connection to server localhost/127.0.0.1:9900
>> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
>> primeConnection
>> INFO: Socket connection established to localhost/127.0.0.1:9900, initiating
>> session
>> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn$Factory
>> run
>> INFO: Accepted socket connection from /127.0.0.1:58930
>> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn
>> readConnectRequest
>> INFO: Client attempting to renew session 0x3bbe3403c00000 at
>> /127.0.0.1:58930
>> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn
>> finishSessionInit
>> INFO: Established session 0x3bbe3403c00000 with negotiated timeout 15000 for
>> client /127.0.0.1:58930
>> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
>> readConnectResult
>> INFO: Session establishment complete on server localhost/127.0.0.1:9900,
>> sessionid = 0x3bbe3403c00000, negotiated timeout = 15000
>> Dec 21, 2012 11:57:40 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=179497
>> Dec 21, 2012 11:57:40 AM org.apache.solr.common.cloud.ZkStateReader
>> updateClusterState
>> INFO: Updating cloud state from ZooKeeper... 
>> Dec 21, 2012 11:57:40 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=178996
>> Dec 21, 2012 11:57:41 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=178494
>> Dec 21, 2012 11:57:41 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=177992
>> Dec 21, 2012 11:57:42 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=177491
>> Dec 21, 2012 11:57:42 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=176989
>> Dec 21, 2012 11:57:43 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=176488
>> Dec 21, 2012 11:57:43 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=175986
>> Dec 21, 2012 11:57:44 AM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=175484
>> ...
>> 
>> Notes:
>> 
>> - Shard1 replica doesnt become the leader and keeps "waiting until see more
>> replicas up"
>> 
>> - Search results for all Shards:
>> 	<lst name="error">
>> 		<str name="msg">no servers hosting shard:</str>
>> 		<int name="code">503</int>
>> 	</lst>
>> 	
>> Dump: http://dl.dropbox.com/u/7570330/dump.txt
>> 
>> What am I doing wrong?
>> 
>> 
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-not-reachable-and-after-restart-just-a-no-servers-hosting-shard-tp4009786p4028623.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solrcloud not reachable and after restart just a "no servers hosting shard"

Posted by Mark Miller <ma...@gmail.com>.

Your hitting https://issues.apache.org/jira/browse/SOLR-3939

The luck of hashing must have left the guy trying to become the leader without any docs. Due to SOLR-3939, a node with an empty index cannot become the leader.

- Mark

On Dec 21, 2012, at 1:41 PM, gumatias <gu...@matias.com> wrote:

> I'm getting the same error. I followed the SolrCloud examples and it didn't
> work.. here's basically what I've done:
> 
> EXPERIMENT 1: start shards and index documents, search for documents in all
> replicas
> 
> # Starting Shards
> - Shard1 Leader (with zookeeper)
> 	java -Dbootstrap_confdir=./solr/collection1/conf
> -Dcollection.configName=myconf -DzkRun
> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar
> start.jar
> - Shard1 Replica (with zookeeper)
> 	java -Djetty.port=7574 -DzkRun
> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
> - Shard2 Leader (with zookeeper)
> 	java -Djetty.port=8900 -DzkRun
> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
> - Shard2 Replica
> 	java -Djetty.port=7500
> -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
> 
> clusterstate.json: http://dl.dropbox.com/u/7570330/clusterstate.txt
> 
> # Indexing sample document
> 	java -jar post.jar hd.xml
> 	
> # search in all Shards: number of results found: 2
> Note: all shards have the same result
> 
> EXPERIMENT 2: Kill current Shard1 Leader, expect Shard1 Replica to become
> leader, search should still work and results return (is that right?)
> 
> # Killing Shard2 Leader
> 
> Shard2 Replica logs:
> ...
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
> pRequest
> INFO: Processed session termination for sessionid: 0x3bbe3403c00001
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
> pRequest
> INFO: Got user-level KeeperException when processing
> sessionid:0x3bbe3403c00000 type:delete cxid:0x4dea zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a Error
> Path:/collections/collection1/leaders/shard1 Error:KeeperErrorCode = NoNode
> for /collections/collection1/leaders/shard1
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> INFO: Running the leader process.
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
> pRequest
> INFO: Got user-level KeeperException when processing
> sessionid:0x3bbe3403c00000 type:create cxid:0x4dec zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
> NodeExists for /overseer
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
> INFO: Checking if I should try and be the leader.
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
> INFO: My last published State was Active, it's okay to be the leader.
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> INFO: I may be the new leader - try and sync
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.SyncStrategy sync
> INFO: Sync replicas to
> http://Gustavos-MacBook-Pro.local:8900/solr/collection1/
> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
> INFO: PeerSync: core=collection1
> url=http://Gustavos-MacBook-Pro.local:8900/solr START
> replicas=[http://Gustavos-MacBook-Pro.local:8983/solr/collection1/]
> nUpdates=100
> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
> INFO: PeerSync: core=collection1
> url=http://Gustavos-MacBook-Pro.local:8900/solr DONE.  We have no versions. 
> sync failed.
> Dec 21, 2012 11:57:39 AM org.apache.solr.common.SolrException log
> SEVERE: Sync Failed
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
> rejoinLeaderElection
> INFO: There is a better leader candidate than us - going back into recovery
> Dec 21, 2012 11:57:39 AM org.apache.solr.update.DefaultSolrCoreState
> doRecovery
> INFO: Running recovery - first canceling any ongoing recovery
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy run
> INFO: Starting recovery process.  core=collection1
> recoveringAfterStartup=false
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Attempting to PeerSync from
> http://Gustavos-MacBook-Pro.local:8983/solr/collection1/ core=collection1 -
> recoveringAfterStartup=false
> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
> INFO: PeerSync: core=collection1
> url=http://Gustavos-MacBook-Pro.local:8900/solr START
> replicas=[http://Gustavos-MacBook-Pro.local:8983/solr/collection1/]
> nUpdates=100
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
> pRequest
> INFO: Got user-level KeeperException when processing
> sessionid:0x3bbe3403c00000 type:delete cxid:0x4df3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a Error
> Path:/collections/collection1/leaders/shard1 Error:KeeperErrorCode = NoNode
> for /collections/collection1/leaders/shard1
> Dec 21, 2012 11:57:39 AM org.apache.solr.update.PeerSync sync
> WARNING: no frame of reference to tell of we've missed updates
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: PeerSync Recovery was not successful - trying replication.
> core=collection1
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Starting Replication Recovery. core=collection1
> Dec 21, 2012 11:57:39 AM org.apache.solr.client.solrj.impl.HttpClientUtil
> createClient
> INFO: Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> INFO: Running the leader process.
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.PrepRequestProcessor
> pRequest
> INFO: Got user-level KeeperException when processing
> sessionid:0x3bbe3403c00000 type:create cxid:0x4df4 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
> NodeExists for /overseer
> Dec 21, 2012 11:57:39 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=180000
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.quorum.LearnerHandler
> run
> SEVERE: Unexpected exception causing shutdown while sock still open
> java.io.EOFException
> 	at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> 	at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
> 	at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> 	at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:416)
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.ClientCnxn$SendThread run
> INFO: Unable to read additional data from server sessionid 0x3bbe3403c00000,
> likely server has closed socket, closing socket connection and attempting
> reconnect
> Dec 21, 2012 11:57:39 AM
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker run
> WARNING: Connection broken for id 0, my id = 2, error = java.io.IOException:
> Channel eof
> Dec 21, 2012 11:57:39 AM org.apache.zookeeper.server.quorum.LearnerHandler
> run
> WARNING: ******* GOODBYE /127.0.0.1:58549 ********
> Dec 21, 2012 11:57:39 AM
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker run
> WARNING: Interrupting SendWorker
> Dec 21, 2012 11:57:39 AM
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker run
> WARNING: Interrupted while waiting for message on queue
> java.lang.InterruptedException
> 	at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
> 	at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
> 	at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:347)
> 	at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:622)
> Dec 21, 2012 11:57:39 AM
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker run
> WARNING: Send worker leaving thread
> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
> startConnect
> INFO: Opening socket connection to server localhost/127.0.0.1:9900
> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
> primeConnection
> INFO: Socket connection established to localhost/127.0.0.1:9900, initiating
> session
> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn$Factory
> run
> INFO: Accepted socket connection from /127.0.0.1:58930
> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn
> readConnectRequest
> INFO: Client attempting to renew session 0x3bbe3403c00000 at
> /127.0.0.1:58930
> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.server.NIOServerCnxn
> finishSessionInit
> INFO: Established session 0x3bbe3403c00000 with negotiated timeout 15000 for
> client /127.0.0.1:58930
> Dec 21, 2012 11:57:40 AM org.apache.zookeeper.ClientCnxn$SendThread
> readConnectResult
> INFO: Session establishment complete on server localhost/127.0.0.1:9900,
> sessionid = 0x3bbe3403c00000, negotiated timeout = 15000
> Dec 21, 2012 11:57:40 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=179497
> Dec 21, 2012 11:57:40 AM org.apache.solr.common.cloud.ZkStateReader
> updateClusterState
> INFO: Updating cloud state from ZooKeeper... 
> Dec 21, 2012 11:57:40 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=178996
> Dec 21, 2012 11:57:41 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=178494
> Dec 21, 2012 11:57:41 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=177992
> Dec 21, 2012 11:57:42 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=177491
> Dec 21, 2012 11:57:42 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=176989
> Dec 21, 2012 11:57:43 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=176488
> Dec 21, 2012 11:57:43 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=175986
> Dec 21, 2012 11:57:44 AM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=175484
> ...
> 
> Notes:
> 
> - Shard1 replica doesnt become the leader and keeps "waiting until see more
> replicas up"
> 
> - Search results for all Shards:
> 	<lst name="error">
> 		<str name="msg">no servers hosting shard:</str>
> 		<int name="code">503</int>
> 	</lst>
> 	
> Dump: http://dl.dropbox.com/u/7570330/dump.txt
> 
> What am I doing wrong?
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-not-reachable-and-after-restart-just-a-no-servers-hosting-shard-tp4009786p4028623.html
> Sent from the Solr - User mailing list archive at Nabble.com.