You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/09/13 16:58:06 UTC
[GitHub] [pulsar] lhotari opened a new pull request, #17620: [fix][metadata] Don't execute Bookkeeper metadata callbacks on Zookeeper event thread
lhotari opened a new pull request, #17620:
URL: https://github.com/apache/pulsar/pull/17620
### Motivation
In a Pulsar 2.10.x test environment we ran into an issue where there were 12.3 million ZK WatchEvents in the heap dump.
The ZK Event thread was blocked and the ZK session had also been expired.
### Modifications
Run callbacks on a separate thread.
### Documentation
<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
- [ ] `doc-required`
(Your PR needs to update docs and you will update later)
- [x] `doc-not-needed`
(Please explain why)
- [ ] `doc`
(Your PR contains doc changes)
- [ ] `doc-complete`
(Docs have been already added)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] eolivelli commented on a diff in pull request #17620: [fix][metadata] Don't execute Bookkeeper metadata callbacks on Zookeeper event thread
Posted by GitBox <gi...@apache.org>.
eolivelli commented on code in PR #17620:
URL: https://github.com/apache/pulsar/pull/17620#discussion_r969877749
##########
pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java:
##########
@@ -60,6 +64,9 @@ public PulsarRegistrationClient(MetadataStore store,
this.bookieAllRegistrationPath = ledgersRootPath + "/" + COOKIE_NODE;
this.bookieReadonlyRegistrationPath = this.bookieRegistrationPath + "/" + READONLY;
+ this.executor = Executors
Review Comment:
Please shutdown this threadpool
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] eolivelli merged pull request #17620: [fix][metadata] Don't execute Bookkeeper metadata callbacks on Zookeeper event thread
Posted by GitBox <gi...@apache.org>.
eolivelli merged PR #17620:
URL: https://github.com/apache/pulsar/pull/17620
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on a diff in pull request #17620: [fix][metadata] Don't execute Bookkeeper metadata callbacks on Zookeeper event thread
Posted by GitBox <gi...@apache.org>.
lhotari commented on code in PR #17620:
URL: https://github.com/apache/pulsar/pull/17620#discussion_r969881601
##########
pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java:
##########
@@ -60,6 +64,9 @@ public PulsarRegistrationClient(MetadataStore store,
this.bookieAllRegistrationPath = ledgersRootPath + "/" + COOKIE_NODE;
this.bookieReadonlyRegistrationPath = this.bookieRegistrationPath + "/" + READONLY;
+ this.executor = Executors
Review Comment:
done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on pull request #17620: [fix][metadata] Don't execute Bookkeeper metadata callbacks on Zookeeper event thread
Posted by GitBox <gi...@apache.org>.
lhotari commented on PR #17620:
URL: https://github.com/apache/pulsar/pull/17620#issuecomment-1246303451
> @lhotari Do you have the stack trace for the deadlock to attach here for context?
@merlimat I only had a heap dump of the problem so I'm not sure about the deadlock details. The heap dump includes a thread dump, but without lock info. Here's the full thread dump:
https://gist.github.com/lhotari/b9513c1eeb7582b25f60f1c89119789d
(line numbers refer to DS [fork of BK 4.14.5](https://github.com/datastax/bookkeeper/tree/ds-4.14.5.1.0.2) and [fork of Pulsar 2.10.x](https://github.com/datastax/pulsar/tree/ls210_1.5))
Here's a subset of threads related to metadata store and ZK. There might be several issues that are causing the issue.
I noticed that LeaderElectionImpl.close is also blocked. I guess the ZK session has expired and the recovery for this doesn't work since it's blocked.
Here are the metadata & ZK related threads [from the full thread dump](https://gist.github.com/lhotari/b9513c1eeb7582b25f60f1c89119789d):
```
AuditorBookie-bookie9.testcluster.example.com:3181 Waiting Thread ID: 85
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager.getLostBookieRecoveryDelay(PulsarLedgerUnderreplicationManager.java:785)
org.apache.bookkeeper.replication.Auditor$9.run(Auditor.java:545)
org.apache.bookkeeper.util.SafeRunnable$1.safeRun(SafeRunnable.java:43)
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.lang.Thread.run(Thread.java:834)
AuditorElector-bookie9.testcluster.example.com:3181 Waiting Thread ID: 83
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager.notifyLostBookieRecoveryDelayChanged(PulsarLedgerUnderreplicationManager.java:804)
org.apache.bookkeeper.replication.Auditor.start(Auditor.java:699)
org.apache.bookkeeper.replication.AuditorElector$3.run(AuditorElector.java:187)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.lang.Thread.run(Thread.java:834)
main-EventThread Blocked Thread ID: 59
org.apache.bookkeeper.replication.Auditor.submitAuditTask(Auditor.java:533)
org.apache.bookkeeper.replication.Auditor.lambda$watchBookieChanges$0(Auditor.java:1051)
org.apache.bookkeeper.replication.Auditor$$Lambda$340.onBookiesChanged()
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$216.accept()
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714)
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
org.apache.pulsar.metadata.impl.ZKMetadataStore.handleGetChildrenResult(ZKMetadataStore.java:260)
org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$6(ZKMetadataStore.java:191)
org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$156.processResult()
org.apache.pulsar.metadata.impl.PulsarZooKeeperClient$3$1.processResult(PulsarZooKeeperClient.java:490)
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:712)
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:553)
metadata-store-20-1 Waiting Thread ID: 57
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:834)
metadata-store-3-1 Waiting Thread ID: 17
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:834)
metadata-store-coordination-service-22-1 Waiting Thread ID: 86
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1170)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:834)
metadata-store-coordination-service-5-1 Waiting Thread ID: 90
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.lambda$handleSessionEvent$3(LockManagerImpl.java:139)
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl$$Lambda$343.run()
org.apache.bookkeeper.common.util.SafeRunnable$1.safeRun(SafeRunnable.java:60)
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:834)
metadata-store-zk-session-watcher-21-1 Waiting Thread ID: 60
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
java.util.concurrent.CompletableFuture.join(CompletableFuture.java:2043)
org.apache.pulsar.metadata.coordination.impl.LeaderElectionImpl.close(LeaderElectionImpl.java:233)
org.apache.pulsar.metadata.bookkeeper.PulsarLedgerAuditorManager.close(PulsarLedgerAuditorManager.java:96)
org.apache.bookkeeper.replication.AuditorElector.shutdown(AuditorElector.java:234)
org.apache.bookkeeper.replication.AutoRecoveryMain.shutdown(AutoRecoveryMain.java:154)
org.apache.bookkeeper.replication.AutoRecoveryMain.lambda$new$0(AutoRecoveryMain.java:99)
org.apache.bookkeeper.replication.AutoRecoveryMain$$Lambda$222.onSessionExpired()
org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver.lambda$setSessionStateListener$0(PulsarMetadataClientDriver.java:68)
org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver$$Lambda$223.accept()
org.apache.pulsar.metadata.impl.AbstractMetadataStore.lambda$receivedSessionEvent$11(AbstractMetadataStore.java:297)
org.apache.pulsar.metadata.impl.AbstractMetadataStore$$Lambda$342.accept()
java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:807)
org.apache.pulsar.metadata.impl.AbstractMetadataStore.receivedSessionEvent(AbstractMetadataStore.java:295)
org.apache.pulsar.metadata.impl.ZKMetadataStore.receivedSessionEvent(ZKMetadataStore.java:152)
org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$148.accept()
org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkState(ZKSessionWatcher.java:150)
org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:110)
org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$149.run()
org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:54)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:834)
metadata-store-zk-session-watcher-4-1 Waiting Thread ID: 20
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1798)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1868)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:104)
org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$149.run()
org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:54)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:834)
ReplicationWorker Waiting Thread ID: 63
java.lang.Object.wait(Object.java)
org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager.getLedgerToRereplicate(PulsarLedgerUnderreplicationManager.java:572)
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:264)
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:230)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:834)
ZKC-connect-executor-0 Waiting Thread ID: 169
jdk.internal.misc.Unsafe.park(Unsafe.java)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.lang.Thread.run(Thread.java:834)
ZKC-connect-executor-0-EventThread Blocked Thread ID: 823
org.apache.pulsar.metadata.impl.ZKSessionWatcher.process(ZKSessionWatcher.java:120)
org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$0(ZKMetadataStore.java:101)
org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$361.accept()
java.util.Optional.ifPresent(Optional.java:183)
org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$1(ZKMetadataStore.java:101)
org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$139.process()
org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.notifyEvent(ZooKeeperWatcherBase.java:180)
org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.process(ZooKeeperWatcherBase.java:146)
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:578)
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:553)
ZKC-connect-executor-0-SendThread(zk01.testcluster.example.com:2181) Runnable Thread ID: 822
sun.nio.ch.EPoll.wait(EPoll.java)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:332)
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] merlimat commented on pull request #17620: [fix][metadata] Don't execute Bookkeeper metadata callbacks on Zookeeper event thread
Posted by GitBox <gi...@apache.org>.
merlimat commented on PR #17620:
URL: https://github.com/apache/pulsar/pull/17620#issuecomment-1246092800
@lhotari Do you have the stack trace for the deadlock to attach here for context?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on pull request #17620: [fix][metadata] Don't execute Bookkeeper metadata callbacks on Zookeeper event thread
Posted by GitBox <gi...@apache.org>.
lhotari commented on PR #17620:
URL: https://github.com/apache/pulsar/pull/17620#issuecomment-1246342458
I created a separate PR about handling session events in a separate thread: #17638
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org