You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2013/05/11 09:03:16 UTC

[jira] [Commented] (SOLR-4744) Version conflict error during shard split test

    [ https://issues.apache.org/jira/browse/SOLR-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13655181#comment-13655181 ] 

Shalin Shekhar Mangar commented on SOLR-4744:
---------------------------------------------

Consider the following scenario
# The overseer collection processor asks overseer to update the state of parent to INACTIVE and the sub shards to ACTIVE
# The parent shard leader receives an update request
# The parent shard leader thinks that it is still the leader of an ACTIVE shard and therefore tries to send the request to the sub shard leaders (FROMLEADER update containing "from.shard.parent" param). This is done asynchronously so the client has already been given a success status.
# The sub shard leader receives such a request but it's cluster state is already up to date and therefore rejects the update saying that it is already a leader and not in construction state any more.
# The parent shard leader asks sub shard leader to recover which is basically no-op for sub shard leaders
# The sub shard misses such a document update

SOLR-4795 exposed the underlying problem clearly. The exceptions in the log on jenkins are now:
{code}
[junit4:junit4]   1> INFO  - 2013-05-10 17:12:00.128; org.apache.solr.update.processor.LogUpdateProcessor; [collection1_shard1_1_replica1] webapp=/sx path=/update params={distrib.from=http://127.0.0.1:47193/sx/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {} 0 1
[junit4:junit4]   1> INFO  - 2013-05-10 17:12:00.128; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/sx path=/update params={distrib.from=http://127.0.0.1:47193/sx/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {add=[296 (1434667890899943424)]} 0 1
[junit4:junit4]   1> ERROR - 2013-05-10 17:12:00.129; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Request says it is coming from parent shard leader but we are not in construction state
[junit4:junit4]   1> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:327)
[junit4:junit4]   1> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:232)
[junit4:junit4]   1> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:394)
[junit4:junit4]   1> 	at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
[junit4:junit4]   1> 	at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
[junit4:junit4]   1> 	at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
[junit4:junit4]   1> 	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
[junit4:junit4]   1> 	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
[junit4:junit4]   1> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
[junit4:junit4]   1> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1832)
[junit4:junit4]   1> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
[junit4:junit4]   1> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
[junit4:junit4]   1> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
[junit4:junit4]   1> 	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
{code}

These happen right after the state is switched until both parent leader and sub shard leader have the latest cluster state.

The possible fixes are:
# Create a new recovery strategy for sub shard replication
# Replicate to sub shard leader synchronously (before local update)
# Switch parent shard to INACTIVE first, wait for it to receive the cluster state and then switch sub shards to ACTIVE -- Clients would receive failures on updates for a short time but such failures should already be handled by clients (because of host failures), we should be okay. Sub shards failures must be handled so that we always end up with the shard range being available somewhere.

Thoughts? [~yseeley@gmail.com], [~markrmiller@gmail.com], [~anshumg]
                
> Version conflict error during shard split test
> ----------------------------------------------
>
>                 Key: SOLR-4744
>                 URL: https://issues.apache.org/jira/browse/SOLR-4744
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>             Fix For: 4.4
>
>
> ShardSplitTest fails sometimes with the following error:
> {code}
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state invoked for collection: collection1
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1 to inactive
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_0 to active
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_1 to active
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.873; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= path=/update params={wt=javabin&version=2} {add=[169 (1432319507166134272)]} 0 2
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.884; org.apache.solr.update.processor.LogUpdateProcessor; [collection1_shard1_1_replica1] webapp= path=/update params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {} 0 1
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.885; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= path=/update params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {add=[169 (1432319507173474304)]} 0 2
> [junit4:junit4]   1> ERROR - 2013-04-14 19:05:26.885; org.apache.solr.common.SolrException; shard update error StdNode: http://127.0.0.1:41028/collection1_shard1_1_replica1/:org.apache.solr.common.SolrException: version conflict for 169 expected=1432319507173474304 actual=-1
> [junit4:junit4]   1> 	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:404)
> [junit4:junit4]   1> 	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> [junit4:junit4]   1> 	at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> [junit4:junit4]   1> 	at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> [junit4:junit4]   1> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> [junit4:junit4]   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> [junit4:junit4]   1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> [junit4:junit4]   1> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> [junit4:junit4]   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> [junit4:junit4]   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> [junit4:junit4]   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [junit4:junit4]   1> 	at java.lang.Thread.run(Thread.java:679)
> [junit4:junit4]   1> 
> [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.886; org.apache.solr.update.processor.DistributedUpdateProcessor; try and ask http://127.0.0.1:41028 to recover
> {code}
> The failure is hard to reproduce and very timing sensitive. These kind of failures have always been seen right after "updateshardstate" action.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org