You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/06/14 08:00:26 UTC

[GitHub] [incubator-druid] viongpanzi opened a new issue #7893: Will some PathChildrenCacheEvent be missed after the connection to zk disconnected

viongpanzi opened a new issue #7893: Will some PathChildrenCacheEvent be missed after the connection to zk disconnected
URL: https://github.com/apache/incubator-druid/issues/7893
 
 
   hi, all~
   
   We have a problem!
   
   The information about our prod cluster:
   
   version: 0.13.0
   number of segments: more than 6 million
   GC: g1 gc (time cost in one fgc is more than 120 secs.)
   incremental poll is enabled
   
   
   After each fgc (take more than 120 seconds), the connection of one coordinator to the zookeeper is disconnected due to timeout. Soon the another coordinator becomes the leader, and a new fgc happens after polling all data segments from metadata. Again the connection to the zookeeper discoonectted and these two coordinators trap in a loop. However, if we restart these two coordinator service, they can work well for days.
   
   In order to find the cause, we use MAT(Eclipse Memory Analyzer Tool) to analyze the dumped heap of one of those two coordinators, and it reports the following infos:
   
   ![image](https://user-images.githubusercontent.com/8834263/59492148-eba48600-8eba-11e9-91c6-30e0d0a465f7.png)
   
   
   After tracing the call stack to zNodes and checking the logs of the coordinator service, some logs about zookeeper node event may be have some problem.
   
   ```
   09/Jun/2019 20:49:42,970 [ServerInventoryView-0] WARN  org.apache.druid.curator.inventory.CuratorInventoryManager - CuratorInventoryManager: Exception while getting data for node /druid/seg
   ments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/segments/host:8101/host:8101_indexer-executor__default_tier_2019-
   06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60
           at org.apache.zookeeper.KeeperException.create(KeeperException.java:114) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
           at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
           at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1215) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
           at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327) ~[curator-framework-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316) ~[curator-framework-4.1.0.jar:4.1.0]
           at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) ~[curator-client-4.1.0.jar:?]
           at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) ~[curator-client-4.1.0.jar:?]
           at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313) ~[curator-framework-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304) ~[curator-framework-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:107) ~[curator-framework-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:67) ~[curator-framework-4.1.0.jar:4.1.0]
           at org.apache.druid.curator.inventory.CuratorInventoryManager.getZkDataForNode(CuratorInventoryManager.java:177) [druid-server-0.13.0-ad.jar:0.13.0-ad]
           at org.apache.druid.curator.inventory.CuratorInventoryManager.access$400(CuratorInventoryManager.java:58) [druid-server-0.13.0-ad.jar:0.13.0-ad]
           at org.apache.druid.curator.inventory.CuratorInventoryManager$ContainerCacheListener$InventoryCacheListener.childEvent(CuratorInventoryManager.java:402) [druid-server-0.13.0-ad.jar:
   0.13.0-ad]
           at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:538) [curator-recipes-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:532) [curator-recipes-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-4.1.0.jar:4.1.0]
           at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435) [curator-client-4.1.0.jar:?]
           at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:530) [curator-recipes-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-4.1.0.jar:4.1.0]
           at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:808) [curator-recipes-4.1.0.jar:4.1.0]
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
           at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
   09/Jun/2019 20:49:42,970 [ServerInventoryView-0] INFO  org.apache.druid.curator.inventory.CuratorInventoryManager - CuratorInventoryManager: Ignoring event: Type - CHILD_UPDATED , Path - /d
   ruid/segments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60 , Version - 4
   ```
   
   Will some PathChildrenCacheEvent be missed after the connection to zk disconnected? If not, how to explain the exception above that coordinator attempt to update a node that does not exist?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org