You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by slackhappy <gi...@git.apache.org> on 2018/02/06 04:12:45 UTC

[GitHub] lucene-solr pull request #321: SOLR-11932 Retry ZkOperation on SessionExpire...

GitHub user slackhappy opened a pull request:

    https://github.com/apache/lucene-solr/pull/321

    SOLR-11932 Retry ZkOperation on SessionExpired

    https://issues.apache.org/jira/browse/SOLR-11932
    
    We are seeing situations where an operation, such as changing a replica's state to active after a recovery, fails because the zk session has expired.
    
    However, these operations seem like they are retryable, because the ZookeeperConnect receives an event that the session expired and tries to reconnect.
    
    That makes the SessionExpired handling scenario seem very similar to the ConnectionLoss handling scenario, so the ZkCmdExecutor seems like it could handle them in the same way.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/slackhappy/lucene-solr SOLR-11932

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucene-solr/pull/321.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #321
    
----
commit a33db254badce91b886a5973e743d3a251d2d760
Author: John Gallagher <jg...@...>
Date:   2018-02-06T04:08:53Z

    SOLR-11932 Retry ZkOperation on SessionExpired

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #321: SOLR-11932 Retry ZkOperation on SessionExpire...

Posted by slackhappy <gi...@git.apache.org>.
Github user slackhappy commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/321#discussion_r166183931
  
    --- Diff: solr/core/src/test/org/apache/solr/cloud/ZkSolrClientTest.java ---
    @@ -188,7 +188,7 @@ public void testReconnect() throws Exception {
         }
       }
       
    -  public void testZkCmdExectutor() throws Exception {
    +  public void testZkCmdExecutor() throws Exception {
    --- End diff --
    
    spelling


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #321: SOLR-11932 Retry ZkOperation on SessionExpire...

Posted by slackhappy <gi...@git.apache.org>.
Github user slackhappy commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/321#discussion_r166183887
  
    --- Diff: solr/core/src/java/org/apache/solr/cloud/Overseer.java ---
    @@ -303,7 +303,7 @@ private void checkIfIamStillLeader() {
           String path = OVERSEER_ELECT + "/leader";
           byte[] data;
           try {
    -        data = zkClient.getData(path, null, stat, true);
    +        data = zkClient.getData(path, null, stat, false);
    --- End diff --
    
    i changed the `retryOnConnLoss` to false here because `checkIfIamStillLeader` is the teardown hook.  The retry was causing tests to fail due to the com.carrotsearch.randomizedtesting.ThreadLeakControl check seeing threads that were waiting on retry, and also, it seems a little weird to do this retrying on a teardown hook anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #321: SOLR-11932 Retry ZkOperation on SessionExpire...

Posted by slackhappy <gi...@git.apache.org>.
Github user slackhappy commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/321#discussion_r166184126
  
    --- Diff: solr/core/src/test/org/apache/solr/cloud/ZkSolrClientTest.java ---
    @@ -205,14 +205,14 @@ public void testZkCmdExectutor() throws Exception {
           try {
           zkCmdExecutor.retryOperation(() -> {
             if (System.nanoTime() - start > TimeUnit.NANOSECONDS.convert(timeout, TimeUnit.MILLISECONDS)) {
    -          throw new KeeperException.SessionExpiredException();
    --- End diff --
    
    My understanding was that SessionExpired was just being used as an alternate (non-caught) exception type, SessionExpired itself wasn't really being tested here, just something besides ConnectionLoss.  So I believe the intent of this test was to check:
    1) that caught exceptions are retried
    2) that non-caught exceptions are passed through.
    
    so i picked a new exception type for 2) that preserves the checks of this test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #321: SOLR-11932 Retry ZkOperation on SessionExpire...

Posted by slackhappy <gi...@git.apache.org>.
Github user slackhappy commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/321#discussion_r166184184
  
    --- Diff: solr/solrj/src/java/org/apache/solr/common/cloud/SolrZkClient.java ---
    @@ -241,7 +241,7 @@ public void delete(final String path, final int version, boolean retryOnConnLoss
           throws InterruptedException, KeeperException {
         if (retryOnConnLoss) {
           zkCmdExecutor.retryOperation(() -> {
    -        keeper.delete(path, version);
    +        getSolrZooKeeper().delete(path, version);
    --- End diff --
    
    switched all retry operations to resolve `keeper` at call time (so a retry will re-resolve the object)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org