You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "homatthew (via GitHub)" <gi...@apache.org> on 2023/04/12 20:01:27 UTC

[GitHub] [gobblin] homatthew opened a new pull request, #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

homatthew opened a new pull request, #3677:
URL: https://github.com/apache/gobblin/pull/3677

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-1813] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1813
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if applicable):
   
   #### Problem Statement:
    When Helix is under tremendous load, Helix workflows can easily exceed the 5 minute timeout. When the timeout is exceeded, our jobs clean up the Gobblin job state. Gobblin considers these workflows effectively dead. 
   
   If Helix were to schedule these workflows, the workflows will spin up and the tasks assigned will immediately have a `FileNotFoundException`, causing an immediate termination of the container. This ironically puts a greater load on ZK / Helix and can further increase the delays in future workflow submissions.
   
   #### Solution:
   The workflow timeouts need to be not hardcoded so that users can adjust the timeout based on their workflow needs and helix / zk cluster size. 
   
   This is just a single change as part of a series of changes for improving how Gobblin uses Helix. There are other issues that need to be addressed like:
   - Offline helix instance purging should occur before workflow submission
   - Helix lost node fencing should occur on each new ApplicationMaster, since Application masters do not hand over containers when an AM dies and is replaced
   - Helix workflows should have a configurable safeguard for how often a workflow can be resubmitted / replaced in order to prevent multiple external systems from doing unsafe concurrent mutations to a running workflow. A replanner should not occur too frequently because the work is redundant and the time between workflows is not sufficient for anything to change. 
   
   #### Caveats:
   This code is shared between GaaS and Kafka ETL code. The default timeout remains unchanged, and is backwards compatible. 
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       3. Subject is limited to 50 characters
       4. Subject does not end with a period
       5. Subject uses the imperative mood ("add", not "adding")
       6. Body wraps at 72 characters
       7. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] codecov-commenter commented on pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#issuecomment-1506117616

   ## [Codecov](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#3677](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8ec4d42) into [master](https://codecov.io/gh/apache/gobblin/commit/c50b527caeba98785e552966016f15dbc614b4da?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c50b527) will **increase** coverage by `2.52%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #3677      +/-   ##
   ============================================
   + Coverage     46.85%   49.38%   +2.52%     
   + Complexity    10757     9291    -1466     
   ============================================
     Files          2140     1759     -381     
     Lines         84053    68451   -15602     
     Branches       9336     7803    -1533     
   ============================================
   - Hits          39385    33804    -5581     
   + Misses        41086    31506    -9580     
   + Partials       3582     3141     -441     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...bblin/cluster/GobblinClusterConfigurationKeys.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJDb25maWd1cmF0aW9uS2V5cy5qYXZh) | `0.00% <ø> (ø)` | |
   | [...er/GobblinHelixDistributeJobExecutionLauncher.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4RGlzdHJpYnV0ZUpvYkV4ZWN1dGlvbkxhdW5jaGVyLmphdmE=) | `59.37% <100.00%> (+0.97%)` | :arrow_up: |
   | [...pache/gobblin/cluster/GobblinHelixJobLauncher.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4Sm9iTGF1bmNoZXIuamF2YQ==) | `66.03% <100.00%> (+0.66%)` | :arrow_up: |
   | [.../gobblin/cluster/HelixRetriggeringJobCallable.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhSZXRyaWdnZXJpbmdKb2JDYWxsYWJsZS5qYXZh) | `64.41% <100.00%> (+0.44%)` | :arrow_up: |
   | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `46.62% <100.00%> (+2.18%)` | :arrow_up: |
   
   ... and [406 files with indirect coverage changes](https://codecov.io/gh/apache/gobblin/pull/3677/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164658958


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixRetriggeringJobCallable.java:
##########
@@ -326,7 +327,10 @@ private void runJobExecutionLauncher() throws JobException {
         // make sure the planning job is initialized (or visible) to other parallel running threads,
         // so that the same critical section check (querying Helix for job completeness)
         // can be applied.
-        HelixUtils.waitJobInitialization(planningJobHelixManager, newPlanningId, newPlanningId);
+        Duration submissionTimeout = Duration.ofSeconds(PropertiesUtils

Review Comment:
   The planning job (i.e. distributed mode) is used in GaaS. This change is usable by GaaS in the event that GaaS also sees similar Helix / ZK bottlenecks. 
   
   The default is the same as the previous 5 minute wait, so it is backwards compatible



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164730248


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -447,11 +455,11 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
       if (this.runningMap.replace(this.jobContext.getJobName(), false, true)) {
         LOGGER.info("Job {} will be executed, add into running map.", this.jobContext.getJobId());
         isLaunched = true;
-        super.launchJob(jobListener);
+        launchJobImpl(jobListener);
       } else {
         LOGGER.warn("Job {} will not be executed because other jobs are still running.", this.jobContext.getJobId());
       }
-      // TODO: Better error handling
+      // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call
     } catch (Throwable t) {

Review Comment:
   Add todo here to make the behavior of swallowing exception configurable



##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -112,22 +117,11 @@ public static String getHelixInstanceName(
     return namePrefix + "_" + instanceId;
   }
 
-  // We have switched from Helix JobQueue to WorkFlow based job execution.
-  @Deprecated
-  public static void submitJobToQueue(
-      JobConfig.Builder jobConfigBuilder,
-      String queueName,
-      String jobName,
-      TaskDriver helixTaskDriver,
-      HelixManager helixManager,
-      long jobQueueDeleteTimeoutSeconds) throws Exception {
-    submitJobToWorkFlow(jobConfigBuilder, queueName, jobName, helixTaskDriver, helixManager, jobQueueDeleteTimeoutSeconds);
-  }
-
   static void waitJobInitialization(
       HelixManager helixManager,
       String workFlowName,
-      String jobName) throws Exception {
+      String jobName,
+      Duration timeout) throws Exception {

Review Comment:
   Just curious here why do we choose to use duration here? Why not use long directly to represent timeout mills?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164767309


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -112,22 +117,11 @@ public static String getHelixInstanceName(
     return namePrefix + "_" + instanceId;
   }
 
-  // We have switched from Helix JobQueue to WorkFlow based job execution.
-  @Deprecated
-  public static void submitJobToQueue(
-      JobConfig.Builder jobConfigBuilder,
-      String queueName,
-      String jobName,
-      TaskDriver helixTaskDriver,
-      HelixManager helixManager,
-      long jobQueueDeleteTimeoutSeconds) throws Exception {
-    submitJobToWorkFlow(jobConfigBuilder, queueName, jobName, helixTaskDriver, helixManager, jobQueueDeleteTimeoutSeconds);
-  }
-
   static void waitJobInitialization(
       HelixManager helixManager,
       String workFlowName,
-      String jobName) throws Exception {
+      String jobName,
+      Duration timeout) throws Exception {

Review Comment:
   There's nothing inherently wrong with representing as a long. The downside is there's no safeguards for passing in the wrong value when going from one form of time (e.g. `Second` to `Millisecond`).
   
   The resulting boiler plate becomes readable and "slightly" less error prone. I will admit It's all nits at the end of the day
   
   Let's consider this example
   ```
   long timeoutSecond = ConfigUtil.getLong(...)
   
   // option 1: multiply by magic number where the conversion isn't enforced by the compiler
   foobar(timeoutSecond * 1000) 
   
   // option 2: use native Duration java library which wouldn't let you forget to do the conversion. And the conversion is implicit
   foobarUsingDuration(Duration.ofSecond(timeoutSecond))
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164769313


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -235,7 +229,8 @@ public static void submitJobToWorkFlow(JobConfig.Builder jobConfigBuilder,
       String jobName,
       TaskDriver helixTaskDriver,
       HelixManager helixManager,
-      long workFlowExpiryTime) throws Exception {
+      long workFlowExpiryTime,

Review Comment:
   Leaving this as a long is inconsistent. I'll change this to duration



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164659387


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -57,7 +49,20 @@
 import org.apache.helix.task.WorkflowContext;
 import org.apache.helix.tools.ClusterSetup;
 
-import static org.apache.helix.task.TaskState.*;
+import com.google.common.collect.Lists;
+
+import lombok.extern.slf4j.Slf4j;
+
+import org.apache.gobblin.configuration.ConfigurationKeys;
+import org.apache.gobblin.metrics.Tag;
+import org.apache.gobblin.metrics.event.TimingEvent;
+import org.apache.gobblin.runtime.JobException;
+import org.apache.gobblin.runtime.JobState;
+import org.apache.gobblin.runtime.listeners.JobListener;
+import org.apache.gobblin.util.Id;
+import org.apache.gobblin.util.JobLauncherUtils;
+
+import static org.apache.helix.task.TaskState.STOPPED;

Review Comment:
   Imports reordered to follow Gobblin OSS standards



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 merged pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 merged PR #3677:
URL: https://github.com/apache/gobblin/pull/3677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable

Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164660571


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -112,22 +117,11 @@ public static String getHelixInstanceName(
     return namePrefix + "_" + instanceId;
   }
 
-  // We have switched from Helix JobQueue to WorkFlow based job execution.
-  @Deprecated
-  public static void submitJobToQueue(
-      JobConfig.Builder jobConfigBuilder,
-      String queueName,
-      String jobName,
-      TaskDriver helixTaskDriver,
-      HelixManager helixManager,
-      long jobQueueDeleteTimeoutSeconds) throws Exception {
-    submitJobToWorkFlow(jobConfigBuilder, queueName, jobName, helixTaskDriver, helixManager, jobQueueDeleteTimeoutSeconds);
-  }
-
   static void waitJobInitialization(
       HelixManager helixManager,
       String workFlowName,
-      String jobName) throws Exception {
+      String jobName,
+      Duration timeout) throws Exception {

Review Comment:
   I also considered having an overloaded method with a default timeout. But I think it makes more sense to enforce all callers of this method to specify a timeout / have that timeout be configurable. Code changes are a slow way to fix prod issues, and config is preferred



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org