You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by GitBox <gi...@apache.org> on 2020/12/05 01:50:35 UTC

[GitHub] [samza] mynameborat commented on a change in pull request #1452: SAMZA-2611: [AM-HA] heartbeat reestablish causes container's heartbeat thread to die

mynameborat commented on a change in pull request #1452:
URL: https://github.com/apache/samza/pull/1452#discussion_r536478041



##########
File path: samza-core/src/main/java/org/apache/samza/container/ContainerHeartbeatMonitor.java
##########
@@ -92,8 +93,13 @@ public void start() {
       if (!response.isAlive()) {
         if (isApplicationMasterHighAvailabilityEnabled) {
           LOG.warn("Failed to establish connection with {}. Checking for new AM", coordinatorUrl);
-          if (checkAndEstablishConnectionWithNewAM()) {
-            return;
+          try {
+            if (checkAndEstablishConnectionWithNewAM()) {
+              return;
+            }
+          } catch (Exception e) {
+            LOG.warn("Exception trying to connect with new AM", e);

Review comment:
       Should be a `LOG.error` as this is fatal and should be treat as such.

##########
File path: samza-core/src/main/java/org/apache/samza/container/ContainerHeartbeatMonitor.java
##########
@@ -92,8 +93,13 @@ public void start() {
       if (!response.isAlive()) {
         if (isApplicationMasterHighAvailabilityEnabled) {
           LOG.warn("Failed to establish connection with {}. Checking for new AM", coordinatorUrl);
-          if (checkAndEstablishConnectionWithNewAM()) {
-            return;
+          try {
+            if (checkAndEstablishConnectionWithNewAM()) {
+              return;
+            }
+          } catch (Exception e) {
+            LOG.warn("Exception trying to connect with new AM", e);
+            throw new SamzaException(e);

Review comment:
       It will kill the thread and recreate one. I'd suggest instead to proceed with exit code as below except log the error of the exception too.
   
   Recreating the thread doesn't guarantee that the subsequent thread can recover from the previous error and potentially run into this in a loop of some sorts without any concrete action.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org