You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Gregory Chanan (JIRA)" <ji...@apache.org> on 2015/11/17 01:58:11 UTC

[jira] [Commented] (SOLR-7198) Deleting a collection during leader election results in left over znodes in ZK

    [ https://issues.apache.org/jira/browse/SOLR-7198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007731#comment-15007731 ] 

Gregory Chanan commented on SOLR-7198:
--------------------------------------

I was able to reproduce this by adding some timeouts in various places.  It looks like in 4.10.3 this prevents a collection with that name from being created; on trunk you can still create a collection, although this isn't great because it leaves extra unnecessary data in ZK.

> Deleting a collection during leader election results in left over znodes in ZK 
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-7198
>                 URL: https://issues.apache.org/jira/browse/SOLR-7198
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10.3
>            Reporter: Vamsee Yarlagadda
>            Assignee: Gregory Chanan
>
> I am seeing this issue while trying to run a collection delete operation in the middle of leader election process.
> Contents of ZK (after issuing collection delete and waiting for some time)
> {code}
>   /
>     /aliases.json
>     /clusterstate.json
>     /collections
>       SolrUpgrade_collection
>         leaders
>            shard2_1
>     /configs
>     /live_nodes
>     /overseer
>     /overseer_elect
>     /solr
>     /solr.xml
> Contents of znode shard2_1:
> version0
> aversion0
> children_count0
> ctimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
> cversion0
> czxid22815
> ephemeralOwner93427899815755800
> mtimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
> mzxid22815
> pzxid22815
> dataLength194
> {
>   "core":"SolrUpgrade_collection_shard2_1_replica2",
>   "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>   "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"
> }
> {code}
> Clusterstate.json before running a delete collection
> {code}
> {
>   "shards":{
>     "shard1":{
>       "range":"80000000-ffffffff",
>       "state":"active",
>       "replicas":{
>         "core_node1":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard1_replica2",
>           "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
>           "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr",
>           "leader":"true"},
>         "core_node2":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard1_replica1",
>           "node_name":"search-testing-c5-ha-2.vpc.cloudera.com:8983_solr",
>           "base_url":"http://search-testing-c5-ha-2.vpc.cloudera.com:8983/solr"}}},
>     "shard2_0":{
>       "range":"0-3fffffff",
>       "state":"active",
>       "replicas":{
>         "core_node5":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_0_replica1",
>           "node_name":"search-testing-c5-ha-3.vpc.cloudera.com:8983_solr",
>           "base_url":"http://search-testing-c5-ha-3.vpc.cloudera.com:8983/solr",
>           "leader":"true"},
>         "core_node7":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_0_replica2",
>           "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
>           "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr"}}},
>     "shard2_1":{
>       "range":"40000000-7fffffff",
>       "state":"active",
>       "replicas":{
>         "core_node8":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_1_replica2",
>           "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>           "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"},
>         "core_node9":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_1_replica3",
>           "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>           "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"}}}},
>   "maxShardsPerNode":"10",
>   "router":"compositeId",
>   "replicationFactor":"2",
>   "autoAddReplicas":"false",
>   "routerSpec":{"name":"compositeId"}}}
> {code}
> As we can notice, shard (*shard2_1*) doesn't have any leader and i can see from the logs that the replicas of the shard just started the leader election process. And here are the Solr logs from one of the above replicas which eventually becomes the leader and registers in ZK even though the collection was deleted.
> {code}
> 2015-03-05 22:05:25,383 INFO org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process for shard shard2_1
> 2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: Checking if I (core=SolrUpgrade_collection_shard2_1_replica2,coreNodeName=core_node8) should try and be the leader.
> 2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: My last published State was Active, it's okay to be the leader.
> 2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: I may be the new leader - try and sync
> 2015-03-05 22:05:25,506 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:25,620 INFO org.apache.solr.core.SolrCore.Request: [SolrUpgrade_collection_shard2_1_replica2] webapp=/solr path=/select params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=1 
> 2015-03-05 22:05:25,623 INFO org.apache.solr.core.SolrCore.Request: [SolrUpgrade_collection_shard2_1_replica3] webapp=/solr path=/select params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=0 
> 2015-03-05 22:05:27,392 INFO org.apache.solr.cloud.ElectionContext: canceling election /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node8-n_0000000001
> 2015-03-05 22:05:27,393 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica3]  CLOSING SolrCore org.apache.solr.core.SolrCore@559b52d3
> 2015-03-05 22:05:27,393 INFO org.apache.solr.update.UpdateHandler: closing DirectUpdateHandler2{commits=0,autocommit maxTime=60000ms,autocommits=0,soft autocommit maxTime=1000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.SolrCoreState: Closing SolrCoreState
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: SolrCoreState ref count has reached 0 - closing IndexWriter
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: closing IndexWriter with IndexWriterCloser
> 2015-03-05 22:05:27,397 INFO org.apache.solr.servlet.SolrDispatchFilter: [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica2&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2} status=0 QTime=5 
> 2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process for shard shard2_1
> 2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ElectionContext: canceling election /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,400 WARN org.apache.solr.cloud.LeaderElector: 
> org.apache.solr.common.SolrException: SolrCore not found:SolrUpgrade_collection_shard2_1_replica3 in []
> 	at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:211)
> 	at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
> 	at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
> 	at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
> 	at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
> 	at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> 2015-03-05 22:05:27,415 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica3] Closing main searcher on request.
> 2015-03-05 22:05:27,438 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:27,456 INFO org.apache.solr.core.CachingDirectoryFactory: Closing HdfsDirectoryFactory - 2 directories currently being tracked
> 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index;done=false>>]
> 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
> 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
> 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data;done=false>>]
> 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,471 INFO org.apache.solr.core.CachingDirectoryFactory: Removing directory after core close: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,490 INFO org.apache.solr.cloud.ElectionContext: canceling election /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,491 WARN org.apache.solr.cloud.ElectionContext: cancelElection did not find election node to remove /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,493 INFO org.apache.solr.servlet.SolrDispatchFilter: [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica3&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2} status=0 QTime=100 
> 2015-03-05 22:05:27,741 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Sync replicas to http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/
> 2015-03-05 22:05:27,888 ERROR org.apache.solr.cloud.SyncStrategy: Sync Failed:org.apache.solr.common.cloud.ZooKeeperException: Could not find collection in zk: SolrUpgrade_collection []
> 	at org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:600)
> 	at org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:588)
> 	at org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:583)
> 	at org.apache.solr.cloud.SyncStrategy.syncWithReplicas(SyncStrategy.java:152)
> 	at org.apache.solr.cloud.SyncStrategy.syncReplicas(SyncStrategy.java:122)
> 	at org.apache.solr.cloud.SyncStrategy.sync(SyncStrategy.java:101)
> 	at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:267)
> 	at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
> 	at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
> 	at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
> 	at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
> 	at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Leader's attempt to sync with shard failed, moving to the next candidate
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.ShardLeaderElectionContext: We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.ShardLeaderElectionContext: I am the new leader: http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/ shard2_1
> 2015-03-05 22:05:27,888 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica2]  CLOSING SolrCore org.apache.solr.core.SolrCore@595167
> 2015-03-05 22:05:27,889 INFO org.apache.solr.update.UpdateHandler: closing DirectUpdateHandler2{commits=1,autocommit maxTime=60000ms,autocommits=0,soft autocommit maxTime=1000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.SolrCoreState: Closing SolrCoreState
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: SolrCoreState ref count has reached 0 - closing IndexWriter
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: closing IndexWriter with IndexWriterCloser
> 2015-03-05 22:05:27,905 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica2] Closing main searcher on request.
> 2015-03-05 22:05:27,911 INFO org.apache.solr.core.CachingDirectoryFactory: Closing HdfsDirectoryFactory - 2 directories currently being tracked
> 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index;done=false>>]
> 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
> 2015-03-05 22:05:27,924 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
> 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data;done=false>>]
> 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:27,925 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:28,026 INFO org.apache.solr.core.CachingDirectoryFactory: Removing directory after core close: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:28,155 INFO org.apache.solr.common.cloud.SolrZkClient: makePath: /collections/SolrUpgrade_collection/leaders/shard2_1
> 2015-03-05 22:05:28,287 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> {code}
> This might be because of a missing check at line
>  https://github.com/apache/lucene-solr/blob/372e8448021d00d3466b45da8a6e57736354eee8/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java#L138
> As the makePath creates all the missing directories recursively.
> May be we can check whether the collection znode exists before we create the leader znode? Thoughts? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org