You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/14 07:27:41 UTC

[GitHub] [druid] imply-cheddar opened a new pull request, #13087: Don't wait for graceful shutdown on failed start

imply-cheddar opened a new pull request, #13087:
URL: https://github.com/apache/druid/pull/13087

   We saw a task with this log in it:
   
   ```
   2022-09-08T17:01:20,448 ERROR [main-EventThread] org.apache.druid.curator.CuratorModule - Unhandled error in Curator, stopping server.
   java.lang.NullPointerException: null
   	at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) ~[curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) ~[curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) ~[curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) ~[curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) [curator-framework-4.3.0.jar:4.3.0]
   	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598) [zookeeper-3.5.9.jar:3.5.9]
   	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.9.jar:3.5.9]
   ```
   
   But the server wasn't actually stopped for almost 1.5 hrs.  There's perhaps some sort of race with the startup going on, I'm not sure, but there's also not really a reason why we need to wait for a graceful shutdown in this instance.  The process is only going to be sad so might as well kill with reckless abandon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a diff in pull request #13087: Don't wait for graceful shutdown on failed start

Posted by GitBox <gi...@apache.org>.
gianm commented on code in PR #13087:
URL: https://github.com/apache/druid/pull/13087#discussion_r970847285


##########
server/src/main/java/org/apache/druid/curator/CuratorModule.java:
##########
@@ -97,6 +99,21 @@ public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, Curat
 
     framework.getUnhandledErrorListenable().addListener((message, e) -> {
       log.error(e, "Unhandled error in Curator, stopping server.");
+      final Thread halter = new Thread(
+          () -> {
+            try {
+              Threads.sleepFor(10, TimeUnit.SECONDS);
+            }
+            catch (InterruptedException ex) {
+
+            }
+            log.info("Normal shutdown didn't work, halting things NOW!!!!");

Review Comment:
   About the log message:
   
   1) Seems like more of a WARN than an INFO, especially given the original was an ERROR
   2) I'd write the message less colloquially, since people will see this message when they are debugging some sad thing, and a somber tone is called for
   
   Putting those together, how about:
   
   ```
   log.warn("Could not stop server within %d seconds after unhandled Curator error. Halting immediately.");
   ```



##########
server/src/main/java/org/apache/druid/curator/CuratorModule.java:
##########
@@ -97,6 +99,21 @@ public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, Curat
 
     framework.getUnhandledErrorListenable().addListener((message, e) -> {
       log.error(e, "Unhandled error in Curator, stopping server.");
+      final Thread halter = new Thread(
+          () -> {
+            try {
+              Threads.sleepFor(10, TimeUnit.SECONDS);

Review Comment:
   At first I thought 10s seemed short. But now I'm not so sure. I suppose it's fine. We're trying to shut down the server quickly; the only reason we're waiting any time at all is to give shutdown hooks a chance to run, clean up temporary files, close upstream connections cleanly, finish currently-executing queries, etc.
   
   Due to the "finish currently-executing queries" thing, maybe we should go with 30s?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on a diff in pull request #13087: Don't wait for graceful shutdown on failed start

Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on code in PR #13087:
URL: https://github.com/apache/druid/pull/13087#discussion_r990616037


##########
server/src/main/java/org/apache/druid/curator/CuratorModule.java:
##########
@@ -96,7 +98,26 @@ public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, Curat
     final CuratorFramework framework = createCurator(config);
 
     framework.getUnhandledErrorListenable().addListener((message, e) -> {
+      final long startTime = System.currentTimeMillis();
       log.error(e, "Unhandled error in Curator, stopping server.");
+      final Thread halter = new Thread(
+          () -> {
+            try {
+              Threads.sleepFor(30, TimeUnit.SECONDS);
+            }
+            catch (InterruptedException ex) {

Review Comment:
   ```suggestion
               catch (InterruptedException ignored) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] Don't wait for graceful shutdown on failed start (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 closed pull request #13087: Don't wait for graceful shutdown on failed start
URL: https://github.com/apache/druid/pull/13087


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on a diff in pull request #13087: Don't wait for graceful shutdown on failed start

Posted by GitBox <gi...@apache.org>.
imply-cheddar commented on code in PR #13087:
URL: https://github.com/apache/druid/pull/13087#discussion_r971411537


##########
server/src/main/java/org/apache/druid/curator/CuratorModule.java:
##########
@@ -97,6 +99,21 @@ public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, Curat
 
     framework.getUnhandledErrorListenable().addListener((message, e) -> {
       log.error(e, "Unhandled error in Curator, stopping server.");
+      final Thread halter = new Thread(
+          () -> {
+            try {
+              Threads.sleepFor(10, TimeUnit.SECONDS);

Review Comment:
   Generally speaking, this failure mode should only happen at startup.  I suppose it's possible for unhandled background exceptions to make it out later on in the lifetime of the process, but I'm only familiar with these happening at startup and indicating problems with ZK connectivity/configurations.  As such, I don't think that "finish currently executing queries" is really a thing.  That said, if you feel strongly about it, 30s is better than nothing existing so I'll do it to get it merged :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] Don't wait for graceful shutdown on failed start (druid)

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on PR #13087:
URL: https://github.com/apache/druid/pull/13087#issuecomment-1784896579

   Closing this since #13258 was merged. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org