You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Manuel Le Normand <ma...@gmail.com> on 2013/08/14 10:00:44 UTC

Wrong leader election leads to shard removal

Hello,
My solr cluster runs on RH Linux with tomcat7 servlet.
NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr
4.3

For experimental reasons I splitted my cluster to 2 sub-clusters, each
containing a single replica of each shard.
When connecting back these sub-clusters the sync failed (more than 100 docs
indexed per shard) so a replication process started on sub-cluster #2. Due
to transient storage limitations needed for the replication process, I
removed all the index from sub-cluster #2 before connecting it back, then I
connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load.
The first bulk replications worked well, but after a while an internal
script pkilled all the solr instances, some while replicating. After
starting back the servlet I discovered the disaster - on part of the
replicas that were in a replicating stage there was a wrong zookeeper
leader election - good state replicas (sub-cluster 1) replicated from empty
replicas (sub-cluster 2) ending up in removing all documents in these
shards!!

These are the logs from solr-prod32 (sub cluster #2 - bad state) - the
shard1_replica1 is elected to be leader although it was not before the
replication process (and shouldn't have the higher version number):

2013-08-13 13:39:15.838 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to
continue.
2013-08-13 13:39:15.838 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader -
try and sync
2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync
replicas to http://solr-prod32:5050/solr/raw shard1_replica1/
 2013-08-13 13:39:15.841 [INFO ]
org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client,
config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync:
core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[
http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100
2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync:
core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have
no versions. sync failed.
2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy Leader's
attempt to sync with shard failed, moving to the next canidate
2013-08-13 13:39:15.847 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we
have no versions - we can't sync in that case - we were active before, so
become leader anyway
2013-08-13 13:39:15.847 [INFO ]
org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader:
http://solr-prod32:8080/solr/raw_shard1_replica1/
2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient
makePath: /collections/raw/leaders/shardl
2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
A cluster state change: WatchedEvent state:SyncConnected
type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
(live nodes size: 40)

While in solr-prod02 (sub cluster #1 - good state) I get:
2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
publishing core=raw_shard1_replica2 state=down
2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
numShards not found on descriptor - reading it from system property
2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer
registering core: raw_shard1_replica2
2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController Register
replica - core:raw_shard1_replica2 address:
http://so1r-prod02:8080/solrcollection:raw shard:shard1
2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
A cluster state change: WatchedEvent stare:SyncConnected
type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
(live nodes size: 40)
2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are
httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is
http://solr-prod32:8080/solr/raw_shard1_replica1/
2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No
LogReplay needed for core=raw_shard1_replica2
2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core
needs to recover:raw shard1_replica2
2013-08-13 13:39:17.481 [INFO ] org.apache.solr.update.DefaultSolrCoreState
Running recovery - first canceling any ongoing recovery
2013-08-13 13:39:17.485 [INFO org.apache.solr.common.cloud.ZkStateReader
Updating cloud state from ZooKeeper...
2013-08-13 13:39:17.485 [INFO ] org.apache.solr.cloud.RecoveryStrategy
Starting recovery process. core=raw_shard1_rep1ica2

Why was the leader elected wrongly??

Thanks

Re: Wrong leader election leads to shard removal

Posted by Erick Erickson <er...@gmail.com>.

bq:why does it replicate all the index instead of copying just the
newer formed segments

because there's no guarantee that the segments are identical on the
nodes that make up a shard. The simplest way to conceptualize this
is to consider the autocommit settings on the servers Let's say
the hard commits (which close the current segment and open a new
one) are all set to 1 minute. The fact that the servers are starting
at different times means that the segments on one node will close at
different times than another node.

And that doesn't even consider the complicated cases of possibly
having different segments merged depending on the start/stop
pattern on one of the nodes....

Best,
Erick

On Fri, Aug 16, 2013 at 5:25 AM, Ido Kissos <id...@gmail.com> wrote:

> Yes, I have erased the tlog in replica 2 and it appears that the the first
> replica's tlog was corrupted because of an ungracefull servlet shutdown.
> There was no log for it unfortunately, neither the zookeeper log logged
> anything about this. Is there a a place I could check in the zookeeper what
> exactly happened during this election?
>
> Partly connected - about transient disk that needs to be free for the
> replication after sync failure - why does it replicate all the index
> instead of copying just the newer formed segments? That would require much
> less space than a full copy, wouldn't it?
> Why not making 100 docs for tlog sync configurable?
>

Re: Wrong leader election leads to shard removal

Posted by Ido Kissos <id...@gmail.com>.

Yes, I have erased the tlog in replica 2 and it appears that the the first
replica's tlog was corrupted because of an ungracefull servlet shutdown.
There was no log for it unfortunately, neither the zookeeper log logged
anything about this. Is there a a place I could check in the zookeeper what
exactly happened during this election?

Partly connected - about transient disk that needs to be free for the
replication after sync failure - why does it replicate all the index
instead of copying just the newer formed segments? That would require much
less space than a full copy, wouldn't it?
Why not making 100 docs for tlog sync configurable?

Re: Wrong leader election leads to shard removal

Posted by Mark Miller <ma...@gmail.com>.

On Aug 14, 2013, at 9:01 AM, Manuel Le Normand <ma...@gmail.com> wrote:

> Does this sound like the scenario that happened:
> By removing the index dir from replica 2 I also removed the flog

Did you also remove the tlog dir?

It's normally:

data/index
data/tlog 

> from which
> the zookeeper extracts the version of the two replicas and decides which
> one should be elected to leader. As replica 2 did have no tlog, the zk
> didn't have anyway to compare the 2 registered replicas so it just picked
> arbitrarly one of the replicas to lead, resulting in electing empty
> replicas.

If one had no tlog, it should have recovered from the one that still had a tlog.

> 
> How does the zookeeper compare the 2 tlogs to know which one is more
> recent? does it not rely on the version number shown in the admin UI?

It looks at recent id's in the tlogs of both and compares them.

- Mark

> 
> 
> On Wed, Aug 14, 2013 at 11:00 AM, Manuel Le Normand <
> manuel.lenormand@gmail.com> wrote:
> 
>> Hello,
>> My solr cluster runs on RH Linux with tomcat7 servlet.
>> NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr
>> 4.3
>> 
>> For experimental reasons I splitted my cluster to 2 sub-clusters, each
>> containing a single replica of each shard.
>> When connecting back these sub-clusters the sync failed (more than 100
>> docs indexed per shard) so a replication process started on sub-cluster #2.
>> Due to transient storage limitations needed for the replication process, I
>> removed all the index from sub-cluster #2 before connecting it back, then I
>> connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load.
>> The first bulk replications worked well, but after a while an internal
>> script pkilled all the solr instances, some while replicating. After
>> starting back the servlet I discovered the disaster - on part of the
>> replicas that were in a replicating stage there was a wrong zookeeper
>> leader election - good state replicas (sub-cluster 1) replicated from empty
>> replicas (sub-cluster 2) ending up in removing all documents in these
>> shards!!
>> 
>> These are the logs from solr-prod32 (sub cluster #2 - bad state) - the
>> shard1_replica1 is elected to be leader although it was not before the
>> replication process (and shouldn't have the higher version number):
>> 
>> 2013-08-13 13:39:15.838 [INFO ]
>> org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to
>> continue.
>> 2013-08-13 13:39:15.838 [INFO ]
>> org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader -
>> try and sync
>> 2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync
>> replicas to http://solr-prod32:5050/solr/raw shard1_replica1/
>> 2013-08-13 13:39:15.841 [INFO ]
>> org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client,
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> 2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync:
>> core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[
>> http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100
>> 2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync:
>> core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have
>> no versions. sync failed.
>> 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy
>> Leader's attempt to sync with shard failed, moving to the next canidate
>> 2013-08-13 13:39:15.847 [INFO ]
>> org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we
>> have no versions - we can't sync in that case - we were active before, so
>> become leader anyway
>> 2013-08-13 13:39:15.847 [INFO ]
>> org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader:
>> http://solr-prod32:8080/solr/raw_shard1_replica1/
>> 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient
>> makePath: /collections/raw/leaders/shardl
>> 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
>> A cluster state change: WatchedEvent state:SyncConnected
>> type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
>> (live nodes size: 40)
>> 
>> While in solr-prod02 (sub cluster #1 - good state) I get:
>> 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
>> publishing core=raw_shard1_replica2 state=down
>> 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
>> numShards not found on descriptor - reading it from system property
>> 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer
>> registering core: raw_shard1_replica2
>> 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController
>> Register replica - core:raw_shard1_replica2 address:
>> http://so1r-prod02:8080/solr collection:raw shard:shard1
>> 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
>> A cluster state change: WatchedEvent stare:SyncConnected
>> type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
>> (live nodes size: 40)
>> 2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are
>> httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is
>> http://solr-prod32:8080/solr/raw_shard1_replica1/
>> 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No
>> LogReplay needed for core=raw_shard1_replica2
>> 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core
>> needs to recover:raw shard1_replica2
>> 2013-08-13 13:39:17.481 [INFO ]
>> org.apache.solr.update.DefaultSolrCoreState Running recovery - first
>> canceling any ongoing recovery
>> 2013-08-13 13:39:17.485 [INFO org.apache.solr.common.cloud.ZkStateReader
>> Updating cloud state from ZooKeeper...
>> 2013-08-13 13:39:17.485 [INFO ] org.apache.solr.cloud.RecoveryStrategy
>> Starting recovery process. core=raw_shard1_rep1ica2
>> 
>> Why was the leader elected wrongly??
>> 
>> Thanks
>>

Re: Wrong leader election leads to shard removal

Posted by Manuel Le Normand <ma...@gmail.com>.

Does this sound like the scenario that happened:
By removing the index dir from replica 2 I also removed the tlog from which
the zookeeper extracts the version of the two replicas and decides which
one should be elected to leader. As replica 2 did have no tlog, the zk
didn't have anyway to compare the 2 registered replicas so it just picked
arbitrarly one of the replicas to lead, resulting in electing empty
replicas.

How does the zookeeper compare the 2 tlogs to know which one is more
recent? does it not rely on the version number shown in the admin UI?


On Wed, Aug 14, 2013 at 11:00 AM, Manuel Le Normand <
manuel.lenormand@gmail.com> wrote:

> Hello,
> My solr cluster runs on RH Linux with tomcat7 servlet.
> NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr
> 4.3
>
> For experimental reasons I splitted my cluster to 2 sub-clusters, each
> containing a single replica of each shard.
> When connecting back these sub-clusters the sync failed (more than 100
> docs indexed per shard) so a replication process started on sub-cluster #2.
> Due to transient storage limitations needed for the replication process, I
> removed all the index from sub-cluster #2 before connecting it back, then I
> connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load.
> The first bulk replications worked well, but after a while an internal
> script pkilled all the solr instances, some while replicating. After
> starting back the servlet I discovered the disaster - on part of the
> replicas that were in a replicating stage there was a wrong zookeeper
> leader election - good state replicas (sub-cluster 1) replicated from empty
> replicas (sub-cluster 2) ending up in removing all documents in these
> shards!!
>
> These are the logs from solr-prod32 (sub cluster #2 - bad state) - the
> shard1_replica1 is elected to be leader although it was not before the
> replication process (and shouldn't have the higher version number):
>
> 2013-08-13 13:39:15.838 [INFO ]
> org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to
> continue.
> 2013-08-13 13:39:15.838 [INFO ]
> org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader -
> try and sync
> 2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync
> replicas to http://solr-prod32:5050/solr/raw shard1_replica1/
>  2013-08-13 13:39:15.841 [INFO ]
> org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client,
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> 2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync:
> core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[
> http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100
> 2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync:
> core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have
> no versions. sync failed.
> 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy
> Leader's attempt to sync with shard failed, moving to the next canidate
> 2013-08-13 13:39:15.847 [INFO ]
> org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we
> have no versions - we can't sync in that case - we were active before, so
> become leader anyway
> 2013-08-13 13:39:15.847 [INFO ]
> org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader:
> http://solr-prod32:8080/solr/raw_shard1_replica1/
> 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient
> makePath: /collections/raw/leaders/shardl
> 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
> A cluster state change: WatchedEvent state:SyncConnected
> type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
> (live nodes size: 40)
>
> While in solr-prod02 (sub cluster #1 - good state) I get:
> 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
> publishing core=raw_shard1_replica2 state=down
> 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController
> numShards not found on descriptor - reading it from system property
> 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer
> registering core: raw_shard1_replica2
> 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController
> Register replica - core:raw_shard1_replica2 address:
> http://so1r-prod02:8080/solr collection:raw shard:shard1
> 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader
> A cluster state change: WatchedEvent stare:SyncConnected
> type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
> (live nodes size: 40)
> 2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are
> httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is
> http://solr-prod32:8080/solr/raw_shard1_replica1/
> 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No
> LogReplay needed for core=raw_shard1_replica2
> 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core
> needs to recover:raw shard1_replica2
> 2013-08-13 13:39:17.481 [INFO ]
> org.apache.solr.update.DefaultSolrCoreState Running recovery - first
> canceling any ongoing recovery
> 2013-08-13 13:39:17.485 [INFO org.apache.solr.common.cloud.ZkStateReader
> Updating cloud state from ZooKeeper...
> 2013-08-13 13:39:17.485 [INFO ] org.apache.solr.cloud.RecoveryStrategy
> Starting recovery process. core=raw_shard1_rep1ica2
>
> Why was the leader elected wrongly??
>
> Thanks
>