You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "dsmiley (via GitHub)" <gi...@apache.org> on 2023/03/23 03:38:45 UTC

[GitHub] [solr] dsmiley opened a new pull request, #1484: SOLR-16693: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

dsmiley opened a new pull request, #1484:
URL: https://github.com/apache/solr/pull/1484

   https://issues.apache.org/jira/browse/SOLR-16693
   
   RE infamous error: "ClusterState says we are the leader ... but locally we don't think so." from DistributedZkUpdateProcessor.
   
   As it happens, I have a test in a fork of Solr that causes this failure half the time on a split shard test that is rather simple (notwithstanding inherent complexities of shard splits itself). After debugging it, I came to a similar to conclusion – this error should be caught and retried by the caller. It turns out, this is as easy as changing the HTTP status code from SERVICE_UNAVAILABLE to INVALID_STATE.
   
   I see another problem based on my test. A shard being split (a so-called parent shard) or that which recently completed (thus may have state INACTIVE) receives docs from a client (the test) and forwards to the sub-shards. But a sub-shard fails for the error shown above, and it does not bubble this up to the client; it's swallowed as okay. Changing the status code may fix for invalid state but wouldn't for other general errors (e.g. host went down suddenly). The result is data loss.
   
   I don't have a test to contribute for this, at least not yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1484: SOLR-11685: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1484:
URL: https://github.com/apache/solr/pull/1484#issuecomment-1505877485

   I would like to revisit one of the changes here, the `Request says it is coming from parent shard leader but we are in active state` flow. I am looking at this from the ShardSplitTest failures pov and I don't think a retry will help in any way.
   If after a shard split, the new slice reached the 'active' state and is still receiving some buffer update from the old-leader it is not going to recover on try because 'active' state is the final state (there are no more transitions, the slice goes through construction -> recovery -> active).
   What I am seeing as a race window is: on the old shard there is call to `getSubShardLeaders` which returns the new slice because at the call moment the status is `state == Slice.State.CONSTRUCTION || state == Slice.State.RECOVERY`, but once it reaches the new slice the state is 'active'. (for the sake of completeness params for the request are  `update.distrib=FROMLEADER&distrib.from.parent=shard1&distrib.from=http://127.0.0.1:64511/collection1_shard1_replica_n2/&wt=javabin&version=2`).
   
   I don't know what the best solution is here, but if there is a retry with the same params it will fail again, it probably needs to be a retry with a different set of params, removing all the 'leader' bits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1484: SOLR-11685: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1484:
URL: https://github.com/apache/solr/pull/1484#issuecomment-1496284849

   for shard split testing we have ShardSplitTest, but it's rather flaky. I am trying to unpack the failures in the effort of getting it more stable #1504 it does simple things: split shard + concurrent adds and deletes. if there are things in your fork's test that are not covered, we could bring them over to that test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley merged pull request #1484: SOLR-11685: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley merged PR #1484:
URL: https://github.com/apache/solr/pull/1484


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1484: SOLR-11685: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on PR #1484:
URL: https://github.com/apache/solr/pull/1484#issuecomment-1663060791

   I forgot about this one.  I wanted to debug other failures in a test I had locally but time got away from me and I can't easily reproduce the issue either.  I'd like to commit it as it is, marking it as an "improvement".  WDYT @stillalex ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1484: SOLR-11685: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1484:
URL: https://github.com/apache/solr/pull/1484#issuecomment-1663248841

   > I'd like to commit it as it is, marking it as an "improvement". WDYT @stillalex ?
   
   I agree to commit. I also dropped the ball here, but if time permits I would still like to revisit this code to make it a bit more testable. in some ideal world we could verify the state transitions are safe and what impact this can have client side.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1484: SOLR-11685: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on PR #1484:
URL: https://github.com/apache/solr/pull/1484#issuecomment-1482259248

   CHANGES.txt suggestion as an Improvement (JIRA also shows as Improvement):
   
   * SOLR-11685: When SolrCloud shard leaders change while indexing updates arrive, Solr could fail and return
     a HTTP 503 status.  Switched to 510 so that CloudSolrClient will auto-retry it and probably succeed.
   
   Based on the errors from some rare flapping tests, I believe this can be just an improvement.  But I have not encountered the issue in this way to be honest, I see this in a serious bug form that I might describe as follows:
   
   * SOLR-11685: When SolrCloud shard leaders change while indexing updates arrive, Solr could return
     a success to a client when it actually failed to accept it.
   
   In the first (just an improvement), it's likely the initial Solr node had the leader flip confusion, but in the second (a bug) it happens when the initial Solr node has to forward the message to another node that is the leader (but doesn't quite know it yet).  I'm debugging more to clarify the impact of the bug with only this change, and very likely another bug for a more general case that would probably deserve another JIRA or we fold into this one to clarify the messaging to users.
   
   I could imagine a test we could beast that induces ZooKeeper session losses and thus Solr side shard leadership changes while indexing is coming in, constantly checking if each doc _actually_ makes it.  Some of the chaos tests show how to do the session loss trick.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1484: SOLR-11685: use INVALID_STATE for ZK state issue in DistributedZkUpdateProcessor

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on PR #1484:
URL: https://github.com/apache/solr/pull/1484#issuecomment-1506297931

   On this issue here (but maybe not ShardSplitTest ?), the different error code resulted in the error bubbling all the way back to the client.  My test (not here) was using CloudSolrClient, which retried and it worked.
   
   What you say is interesting though because I'd imagine it'd be useful for the parent shard to retry this, which is the middle-man for our scenarios.  But since the retries there are done there in SolrCmdDistributor specifically, based on your observation, it could never succeed.  At least not for this scenario.  And it didn't retry a 510 code (I saw empirically) anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org