You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "homatthew (via GitHub)" <gi...@apache.org> on 2023/06/27 05:06:03 UTC

[GitHub] [gobblin] homatthew opened a new pull request, #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

homatthew opened a new pull request, #3711:
URL: https://github.com/apache/gobblin/pull/3711

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [X] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-1847] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1847
   
   
   ### Description
   - [X] Here are some details about my PR, including screenshots (if applicable):
   If the JobLauncher encounters any errors, Gobblin code cleans up the working directory that contains Gobblin state. But we do not clean up the helix workflows. This can cause us to spin up helix workflows that do not have valid Gobblin State on HDFS / S3. 
   
   When this occurs, it causes rapid container recycling, which puts tremendous load on Zookeeper due to Helix API calls, causing a peak in ZK callbacks
   
   ![image](https://github.com/apache/gobblin/assets/35702680/cafba644-bc69-4603-a5c8-3f2ba5ed9e77)
   
   ### Tests
   - [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   ![image](https://github.com/apache/gobblin/assets/35702680/48db7f5e-38b4-4789-88d5-7163f0b7068b)
   
   
   ### Commits
   - [X] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1244392764


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -462,10 +466,14 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       }
 
       // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call.
-      // One potential way to improve the error handling is to make this error swallowing conifgurable
+      // One potential way to improve the error handling is to make this error swallowing configurable
     } catch (Throwable t) {
       errorInJobLaunching = t;
     } finally {
+      if (isCancelWorkflowOnExitEnabled) {
+        cancelJob(jobListener);

Review Comment:
   If the workflow is complete correctly without exception throw out, doesn't that mean helix already marks the workflow as complete as it's in a clean state? what's the point to call cancel in that case? I believe in ingestion cases, We should never exit with complete, if we don't hit an exception, that means we already mark the workflow as deleted on the helix side and exit gracefully.  Removing work unit states make sense as no matter a workflow is complete or failed or canceled, we should do the housekeeping. 
   So I still recommend doing that in the exception block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1244373093


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -462,10 +466,14 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       }
 
       // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call.
-      // One potential way to improve the error handling is to make this error swallowing conifgurable
+      // One potential way to improve the error handling is to make this error swallowing configurable
     } catch (Throwable t) {
       errorInJobLaunching = t;
     } finally {
+      if (isCancelWorkflowOnExitEnabled) {
+        cancelJob(jobListener);

Review Comment:
   My question to you is what reason wouldn't clean up the state on Helix side after the completion of a job? 
   
   As of right now, we are doing all deletion of workflows on startup. When the AM shuts down it doesn't cancel anything on the Helix side (error or not). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1244376380


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -462,10 +466,14 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       }
 
       // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call.
-      // One potential way to improve the error handling is to make this error swallowing conifgurable
+      // One potential way to improve the error handling is to make this error swallowing configurable
     } catch (Throwable t) {
       errorInJobLaunching = t;
     } finally {
+      if (isCancelWorkflowOnExitEnabled) {
+        cancelJob(jobListener);

Review Comment:
   Your approach does work though. If we don't want a flag, and GaaS fliptop also wants to cancel the workflow on error (I have a feeling this is not the case though), then yes we can do your approach without a flag.
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1243158430


##########
gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobLauncherTest.java:
##########
@@ -299,6 +301,25 @@ public void testTimeout() throws Exception {
     Assert.assertThrows(JobException.class, () -> gobblinHelixJobLauncher.launchJobImpl(null));
   }
 
+  public void testExitCancelsJob() throws Exception {
+    final ConcurrentHashMap<String, Boolean> runningMap = new ConcurrentHashMap<>();
+
+    final Properties props = generateJobProperties(this.baseConfig, "testTimeoutTest", "_12345");
+    props.setProperty(GobblinClusterConfigurationKeys.HELIX_WORKFLOW_SUBMISSION_TIMEOUT_SECONDS, "0");
+    props.setProperty(GobblinClusterConfigurationKeys.HELIX_WORKFLOW_CANCEL_ON_EXIT, "true");
+
+    final GobblinHelixJobLauncher gobblinHelixJobLauncher = this.closer.register(
+        new GobblinHelixJobLauncher(props, this.helixManager, this.appWorkDir, ImmutableList.<Tag<?>>of(), runningMap,
+            java.util.Optional.empty()));
+
+    // The launchJob will throw an exception (see testTimeout test) and we expect the launcher to swallow the exception,
+    // then call still properly call cancel. We use the listener to confirm the cancel hook was correctly called once
+    JobListener mockListener = Mockito.mock(JobListener.class);
+    gobblinHelixJobLauncher.launchJob(mockListener);
+    Mockito.verify(mockListener).onJobCancellation(Mockito.any(JobContext.class));

Review Comment:
   Fun fact: verify without a parameter verifies the method is called exactly once! 😃



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] codecov-commenter commented on pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#issuecomment-1610202659

   ## [Codecov](https://app.codecov.io/gh/apache/gobblin/pull/3711?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   > Merging [#3711](https://app.codecov.io/gh/apache/gobblin/pull/3711?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (7c42987) into [master](https://app.codecov.io/gh/apache/gobblin/commit/fc508ca272cc2ea0a9f5cdd28e72f60ce3c7912b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (fc508ca) will **decrease** coverage by `2.14%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #3711      +/-   ##
   ============================================
   - Coverage     46.87%   44.73%   -2.14%     
   + Complexity    10771     2098    -8673     
   ============================================
     Files          2138      411    -1727     
     Lines         84139    17737   -66402     
     Branches       9357     2163    -7194     
   ============================================
   - Hits          39436     7934   -31502     
   + Misses        41102     8943   -32159     
   + Partials       3601      860    -2741     
   ```
   
   
   | [Impacted Files](https://app.codecov.io/gh/apache/gobblin/pull/3711?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [...bblin/cluster/GobblinClusterConfigurationKeys.java](https://app.codecov.io/gh/apache/gobblin/pull/3711?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJDb25maWd1cmF0aW9uS2V5cy5qYXZh) | `0.00% <ø> (ø)` | |
   | [...pache/gobblin/cluster/GobblinHelixJobLauncher.java](https://app.codecov.io/gh/apache/gobblin/pull/3711?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4Sm9iTGF1bmNoZXIuamF2YQ==) | `66.54% <100.00%> (+0.38%)` | :arrow_up: |
   
   ... and [1734 files with indirect coverage changes](https://app.codecov.io/gh/apache/gobblin/pull/3711/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1243158430


##########
gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinHelixJobLauncherTest.java:
##########
@@ -299,6 +301,25 @@ public void testTimeout() throws Exception {
     Assert.assertThrows(JobException.class, () -> gobblinHelixJobLauncher.launchJobImpl(null));
   }
 
+  public void testExitCancelsJob() throws Exception {
+    final ConcurrentHashMap<String, Boolean> runningMap = new ConcurrentHashMap<>();
+
+    final Properties props = generateJobProperties(this.baseConfig, "testTimeoutTest", "_12345");
+    props.setProperty(GobblinClusterConfigurationKeys.HELIX_WORKFLOW_SUBMISSION_TIMEOUT_SECONDS, "0");
+    props.setProperty(GobblinClusterConfigurationKeys.HELIX_WORKFLOW_CANCEL_ON_EXIT, "true");
+
+    final GobblinHelixJobLauncher gobblinHelixJobLauncher = this.closer.register(
+        new GobblinHelixJobLauncher(props, this.helixManager, this.appWorkDir, ImmutableList.<Tag<?>>of(), runningMap,
+            java.util.Optional.empty()));
+
+    // The launchJob will throw an exception (see testTimeout test) and we expect the launcher to swallow the exception,
+    // then call still properly call cancel. We use the listener to confirm the cancel hook was correctly called once
+    JobListener mockListener = Mockito.mock(JobListener.class);
+    gobblinHelixJobLauncher.launchJob(mockListener);
+    Mockito.verify(mockListener).onJobCancellation(Mockito.any(JobContext.class));

Review Comment:
   Fun fact: verify without a parameter verifies the method is called exactly once :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1244374291


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -462,10 +466,14 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       }
 
       // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call.
-      // One potential way to improve the error handling is to make this error swallowing conifgurable
+      // One potential way to improve the error handling is to make this error swallowing configurable
     } catch (Throwable t) {
       errorInJobLaunching = t;
     } finally {
+      if (isCancelWorkflowOnExitEnabled) {
+        cancelJob(jobListener);

Review Comment:
   For example, we always clean up the Gobblin state at the end of the job in the finally block. It's not like there's a point to keeping this helix job around. In the edge case scenario this Helix flow runs, we will cause issues with ZK / Helix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1244327913


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -462,10 +466,14 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       }
 
       // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call.
-      // One potential way to improve the error handling is to make this error swallowing conifgurable
+      // One potential way to improve the error handling is to make this error swallowing configurable
     } catch (Throwable t) {
       errorInJobLaunching = t;
     } finally {
+      if (isCancelWorkflowOnExitEnabled) {
+        cancelJob(jobListener);

Review Comment:
   I'm not getting the idea to put this under the final block and add a flag to determine the logic. Can we consider putting it under the catch block and check the value of isLaunched to determine whether we need to cancel the job or not? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1244537029


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -462,10 +466,14 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       }
 
       // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call.
-      // One potential way to improve the error handling is to make this error swallowing conifgurable
+      // One potential way to improve the error handling is to make this error swallowing configurable
     } catch (Throwable t) {
       errorInJobLaunching = t;
     } finally {
+      if (isCancelWorkflowOnExitEnabled) {
+        cancelJob(jobListener);

Review Comment:
   Addressed in latest iteration to avoid unnecessary delete / cancel calls to Helix as we discussed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 merged pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 merged PR #3711:
URL: https://github.com/apache/gobblin/pull/3711


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3711: [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3711:
URL: https://github.com/apache/gobblin/pull/3711#discussion_r1244376380


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -462,10 +466,14 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       }
 
       // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call.
-      // One potential way to improve the error handling is to make this error swallowing conifgurable
+      // One potential way to improve the error handling is to make this error swallowing configurable
     } catch (Throwable t) {
       errorInJobLaunching = t;
     } finally {
+      if (isCancelWorkflowOnExitEnabled) {
+        cancelJob(jobListener);

Review Comment:
   Your approach does work though. If we don't want a flag, and GaaS fliptop also wants to cancel the workflow on error (I doubt this is the case), then yes we can do your approach without a flag.
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org