You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/05/10 21:17:20 UTC

[GitHub] [accumulo] EdColeman opened a new issue, #2689: Examine multi-thread correctness in ZooPropstore.get()

EdColeman opened a new issue, #2689:
URL: https://github.com/apache/accumulo/issues/2689

From PR Comment ZooPropStore.get() method - this can be reviewed as a follow-on to the single node prop store changes.

@keith-turner said:

Tracing through the code, I think the following sequence of events could happen. Does this seem correct? Want to make sure I am following the code correctly.

Thread 1 calls checkZkConnection() and does not block because the conneciton is currently ok
Thread 2 sees connection loss event in watcher and clears the ready and queues up an event to clear the cache
Thread 3 clears the cache (running in an executor on behalf of event queued by thread 2)
Thread 1 executes cache.get()... because the cache was cleared, a loader is executed in a thread pool
Thread 4 executes the code to load data from ZK... the connection is still lost, so what happens here? This thread is executing on behalf of the cache.get() initiated by thread 1.
I am not sure exactly what happens in step 5 above, does the load from ZK retrty when it sees a connection lost exception? If it does then the checkZkConnection() is redundant. If it does not then the checkZkConnection() does not always prevent problems in the case of connection loss, and maybe it should retry inorder to eliminate the race condition.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1164614427

   I think the test most relevant to this issue is `ZooBasedConfigIT.expireTest`. It may be possible to derive a new test based on this test that shuts down ZooKeeper during the load/reload operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1175179031

   The `PropStoreWatcher` will clear the cache when the connection is disconnected or closed (fires twice) and then the `ZooPropLoader` calls `ZooReaderWriter.getData` which is in a retry loop. The retry loop may return null after exhausting its retries, so I think the `ReadyMonitor` is an attempt to prevent that (although there is a race condition) by not calling `ZooReaderWriter.getData` until the `PropStoreWatcher` gets an event that says the connection is re-established.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1172619799

   > If what I did is correct,
   
   It wasn't. I was closing a different client ZK session. I believe that I have a test now that implements the order of operations from the original description. I had to move one line of code in `ZooPropStore.get` so that the watcher is registered *before* the call to `checkZKConnection`. Here is the test output:
   
   ```
   2022-07-01T18:38:54,281 [impl.PropStoreMetrics] DEBUG: Creating PropStore metrics
   2022-07-01T18:38:54,317 [metrics.MetricsUtil] INFO : Metric producer PropStoreMetrics initialize
   2022-07-01T18:38:54,319 [impl.ZooPropStore] DEBUG: Have a ZooKeeper connection and found instance node: 2a0af30d-5918-425d-b94e-78b66a466ccd
   2022-07-01T18:38:54,326 [impl.TestZooPropStore] INFO : super.checkZkConnection() called, calling runCloseConnectionTask
   2022-07-01T18:38:54,329 [server.ZooKeeperServer] INFO : Closing session 0x10001954cc20002
   2022-07-01T18:38:54,433 [impl.PropStoreWatcher] DEBUG: ZooKeeper disconnected event received
   2022-07-01T18:38:54,435 [impl.ZooPropStore] TRACE: connectionEvent, clearing cache
   2022-07-01T18:38:55,945 [server.ZooKeeperServer] INFO : Invalid session 0x10001954cc20002 for client /127.0.0.1:56280, probably expired
   2022-07-01T18:38:55,946 [zookeeper.ZooSession] DEBUG: Session expired; WatchedEvent state:Expired type:None path:null
   2022-07-01T18:38:55,947 [impl.PropStoreWatcher] INFO : ZooKeeper connection closed event received
   2022-07-01T18:38:55,947 [impl.ZooPropStore] TRACE: connectionEvent, clearing cache
   2022-07-01T18:38:56,327 [impl.TestZooPropStore] INFO : runCloseConnectionTask called, waiting for connection event to release latch
   2022-07-01T18:38:56,327 [impl.TestZooPropStore] INFO : latch released
   2022-07-01T18:38:56,327 [impl.PropCacheCaffeineImpl] TRACE: Called get() for SystemPropKey{InstanceId=2a0af30d-5918-425d-b94e-78b66a466ccd'}
   2022-07-01T18:38:56,335 [impl.ZooPropLoader] TRACE: load called for SystemPropKey{InstanceId=2a0af30d-5918-425d-b94e-78b66a466ccd'}
   2022-07-01T18:38:56,335 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to 127.0.0.1:39327
   2022-07-01T18:38:56,335 [zookeeper.ZooSession] DEBUG: Connecting to 127.0.0.1:39327 with timeout 5000 with auth
   2022-07-01T18:38:56,437 [server.ZooKeeperServer] INFO : got auth packet /127.0.0.1:56282
   2022-07-01T18:38:56,437 [server.ZooKeeperServer] INFO : Session 0x10001954cc20003: auth success for scheme digest and address /127.0.0.1:56282
   2022-07-01T18:38:56,441 [impl.PropStoreMetrics] TRACE: Load time: 106
   2022-07-01T18:38:56,441 [impl.PropStoreMetrics] TRACE: Load count: 0 time:0.0
   ```
   
   Note that the connection event is received twice, once for `disconnect` and once for `close`, thus the cache is cleared twice. The connection to ZooKeeper is re-established so that the ZooPropLoader can load the properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1175371533

   With infinite retries we'd need to handle a closed state and just quit. Not an issue, just something that needs to be addressed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1172456873

   I wrote a test that closes the connection on the ZooKeeper server-side using the client sessionId. If what I did is correct, then I don't believe that the ZooPropStore is getting connection events. I'm going to push a branch and draft PR for this shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1172621933

   > I am not sure exactly what happens in step 5 above, does the load from ZK retrty when it sees a connection lost exception? 
   
   It appears that the ZooKeeper client will re-establish the connection
   
   > If it does then the checkZkConnection() is redundant.
   
   That may be. Could easily modify this test to comment out that method call and see if the result is the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1175358689

   > There is an issue that code could get past the barrier, and then ZooKeeper connection is lost and I don't think that it can ever be prevented. Is should be a small window, but it will always be there. The issue would then be if it gets past the barrier and connection is lost is it handling it correctly. If the code proceeds past the barrier and calls get() and the connection fails , the retry will eventually return null. On the null, then that should trigger an error / exception.
   
   If we used a different RetryFactory, one that used infiniteRetries, instead of the default one in ZooReader, then in this case it would not return null and trigger an exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2689: Examine multi-thread correctness in ZooPropstore.get()

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2689:
URL: https://github.com/apache/accumulo/issues/2689#issuecomment-1175345108

I am not sure that clearing twice is an issue. The state transitions of ZooKeeper seemed to indicate that you can get a disconnect and then reconnect without a close. Close is a terminal state.

As far a ready monitor itself - it is meant to block without hold locks - it functions as a barrier so that when known disconnected everyone is not banging on ZooKeeper with retries. There is an issue that code could get past the barrier, and then ZooKeeper connection is lost and I don't think that it can ever be prevented. Is should be a small window, but it will always be there. The issue would then be if it gets past the barrier and connection is lost is it handling it correctly.

If the code proceeds past the barrier and calls get() and the connection fails , the retry will eventually return null. On the null, then that should trigger an error / exception.

Currently I think that we do not handle connection loss as cleanly as we could - if a cluster tries to do a rolling ZK restart, there are a lot of tservers that drop offline. This does not fully fix that, but should get things closer and then the other problem areas can be addressed separately.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org