You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/01/06 02:22:55 UTC

[GitHub] [pinot] dang-stripe opened a new issue #7976: Brokers getting into stuck state when interrupted during OFFLINE -> ONLINE state transition

dang-stripe opened a new issue #7976:
URL: https://github.com/apache/pinot/issues/7976


   We've noticed a case where brokers get stuck when they're interrupted via SIGTERM when the broker resource is transitioning from OFFLINE to ONLINE states. This seems to leave the broker in a stuck state indefinitely and subsequent SIGTERMs are ignored. We end up needing to kill the process via SIGKILL to recover it. Will pinot/helix retry state transitions on errors like this?
   
   Here's a log we found before this happened:
   ```
   2022/01/06 00:15:47.311 ERROR [BrokerResourceOnlineOfflineStateModelFactory] [HelixTaskExecutor-message_handle_thread] Caught exception while processing transition from OFFLINE to ONLINE for table: test_table_REALTIME
   org.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
           at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1202) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1336) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1328) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:320) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:390) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.helix.store.zk.AutoFallbackPropertyStore.get(AutoFallbackPropertyStore.java:101) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.pinot.common.metadata.ZKMetadataProvider.getTableConfig(ZKMetadataProvider.java:184) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.pinot.broker.routing.RoutingManager.buildRouting(RoutingManager.java:296) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at org.apache.pinot.broker.broker.helix.BrokerResourceOnlineOfflineStateModelFactory$BrokerResourceOnlineOfflineStateModel.onBecomeOnlineFromOffline(BrokerResourceOnlineOfflineStateModelFactory.java:80) [pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   Caused by: java.lang.InterruptedException
   	at java.lang.Object.wait(Native Method) ~[?:?]
   	at java.lang.Object.wait(Object.java:328) ~[?:?]
   	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2129) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2160) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	at org.apache.helix.manager.zk.zookeeper.ZkConnection.readData(ZkConnection.java:136) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	at org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1340) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	at org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1336) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1190) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   	... 20 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #7976: Brokers getting into stuck state when interrupted during OFFLINE -> ONLINE state transition

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #7976:
URL: https://github.com/apache/pinot/issues/7976#issuecomment-1012681447


   Can you please try a thread dump after sending the SIGTERM and see which thread is preventing the broker to be shut down? I suspect the interruption might be swallowed by some thread


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org