You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/06/28 23:06:16 UTC

[GitHub] [bookkeeper] dlg99 opened a new pull request, #3374: Shut down ReplicationWorker and Auditor on non-recoverable ZK error

dlg99 opened a new pull request, #3374:
URL: https://github.com/apache/bookkeeper/pull/3374

   Descriptions of the changes in this PR:
   
   Shut down Replication Worker and Auditor on non-recoverable ZK error
   
   ### Motivation
   
   Some errors require one to re-create Zk client.
   Currently BK and underlying components cannot do that transparently.
   When running AutoRecovery as a separate service it does not seem to crash in such cases and keeps on running while unable to do anything useful and requires operator restarting the service manually.
   
   One can see such log messages like 
    ```
   ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments
   org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:610) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:264) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:230) [com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.76.Final.jar:4.1.76.Final]
   	at java.lang.Thread.run(Thread.java:829) [?:?]
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ledgers/underreplication/ledgers/0000/0001/0ebb
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
   	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2589) ~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$3701(ZooKeeperClient.java:70) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$27.call(ZooKeeperClient.java:1251) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$27.call(ZooKeeperClient.java:1245) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient.getChildren(ZooKeeperClient.java:1245) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager$1.getChildren(ZkLedgerUnderreplicationManager.java:147) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.util.SubTreeCache.getChildren(SubTreeCache.java:118) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:550) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:562) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:562) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:562) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:603) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   ```
   
   ```
   2022-06-24T17:15:55,405 [ZkLedgerManagerScheduler-11-1] ERROR org.apache.bookkeeper.replication.Auditor - Underreplication manager unavailable running periodic check
   org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:731) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.replication.Auditor.lambda$checkAllLedgers$7(Auditor.java:1254) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.lambda$operationComplete$0(AbstractZkLedgerManager.java:573) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
   	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
   	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
   	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.76.Final.jar:4.1.76.Final]
   	at java.lang.Thread.run(Thread.java:829) [?:?]
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ledgers/underreplication/disable
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
   	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2021) ~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2049) ~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   	at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:726) ~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
   ```
   
   and other similar
   
   ### Changes
   
   Now Replication Worker and Auditor will shut down on such errors making their error state visible / letting k8s or service monitor restart them.
   
   Added tests.
   
   Removed KeeperException from some interfaces/implementations to prevent raw ZK exception sneaking throw but there are a few others , see https://github.com/apache/bookkeeper/issues/3373


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli merged pull request #3374: Shut down ReplicationWorker and Auditor on non-recoverable ZK error

Posted by GitBox <gi...@apache.org>.
eolivelli merged PR #3374:
URL: https://github.com/apache/bookkeeper/pull/3374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org