You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "homatthew (via GitHub)" <gi...@apache.org> on 2023/04/12 20:01:27 UTC
[GitHub] [gobblin] homatthew opened a new pull request, #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
homatthew opened a new pull request, #3677:
URL: https://github.com/apache/gobblin/pull/3677
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
### JIRA
- [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-1813] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1813
### Description
- [ ] Here are some details about my PR, including screenshots (if applicable):
#### Problem Statement:
When Helix is under tremendous load, Helix workflows can easily exceed the 5 minute timeout. When the timeout is exceeded, our jobs clean up the Gobblin job state. Gobblin considers these workflows effectively dead.
If Helix were to schedule these workflows, the workflows will spin up and the tasks assigned will immediately have a `FileNotFoundException`, causing an immediate termination of the container. This ironically puts a greater load on ZK / Helix and can further increase the delays in future workflow submissions.
#### Solution:
The workflow timeouts need to be not hardcoded so that users can adjust the timeout based on their workflow needs and helix / zk cluster size.
This is just a single change as part of a series of changes for improving how Gobblin uses Helix. There are other issues that need to be addressed like:
- Offline helix instance purging should occur before workflow submission
- Helix lost node fencing should occur on each new ApplicationMaster, since Application masters do not hand over containers when an AM dies and is replaced
- Helix workflows should have a configurable safeguard for how often a workflow can be resubmitted / replaced in order to prevent multiple external systems from doing unsafe concurrent mutations to a running workflow. A replanner should not occur too frequently because the work is redundant and the time between workflows is not sufficient for anything to change.
#### Caveats:
This code is shared between GaaS and Kafka ETL code. The default timeout remains unchanged, and is backwards compatible.
### Tests
- [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
### Commits
- [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
3. Subject is limited to 50 characters
4. Subject does not end with a period
5. Subject uses the imperative mood ("add", not "adding")
6. Body wraps at 72 characters
7. Body explains "what" and "why", not "how"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] codecov-commenter commented on pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#issuecomment-1506117616
## [Codecov](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#3677](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8ec4d42) into [master](https://codecov.io/gh/apache/gobblin/commit/c50b527caeba98785e552966016f15dbc614b4da?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c50b527) will **increase** coverage by `2.52%`.
> The diff coverage is `100.00%`.
```diff
@@ Coverage Diff @@
## master #3677 +/- ##
============================================
+ Coverage 46.85% 49.38% +2.52%
+ Complexity 10757 9291 -1466
============================================
Files 2140 1759 -381
Lines 84053 68451 -15602
Branches 9336 7803 -1533
============================================
- Hits 39385 33804 -5581
+ Misses 41086 31506 -9580
+ Partials 3582 3141 -441
```
| [Impacted Files](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [...bblin/cluster/GobblinClusterConfigurationKeys.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJDb25maWd1cmF0aW9uS2V5cy5qYXZh) | `0.00% <ø> (ø)` | |
| [...er/GobblinHelixDistributeJobExecutionLauncher.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4RGlzdHJpYnV0ZUpvYkV4ZWN1dGlvbkxhdW5jaGVyLmphdmE=) | `59.37% <100.00%> (+0.97%)` | :arrow_up: |
| [...pache/gobblin/cluster/GobblinHelixJobLauncher.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4Sm9iTGF1bmNoZXIuamF2YQ==) | `66.03% <100.00%> (+0.66%)` | :arrow_up: |
| [.../gobblin/cluster/HelixRetriggeringJobCallable.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhSZXRyaWdnZXJpbmdKb2JDYWxsYWJsZS5qYXZh) | `64.41% <100.00%> (+0.44%)` | :arrow_up: |
| [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/gobblin/pull/3677?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `46.62% <100.00%> (+2.18%)` | :arrow_up: |
... and [406 files with indirect coverage changes](https://codecov.io/gh/apache/gobblin/pull/3677/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
:mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164658958
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixRetriggeringJobCallable.java:
##########
@@ -326,7 +327,10 @@ private void runJobExecutionLauncher() throws JobException {
// make sure the planning job is initialized (or visible) to other parallel running threads,
// so that the same critical section check (querying Helix for job completeness)
// can be applied.
- HelixUtils.waitJobInitialization(planningJobHelixManager, newPlanningId, newPlanningId);
+ Duration submissionTimeout = Duration.ofSeconds(PropertiesUtils
Review Comment:
The planning job (i.e. distributed mode) is used in GaaS. This change is usable by GaaS in the event that GaaS also sees similar Helix / ZK bottlenecks.
The default is the same as the previous 5 minute wait, so it is backwards compatible
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164730248
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java:
##########
@@ -447,11 +455,11 @@ public void launchJob(@Nullable JobListener jobListener) throws JobException {
if (this.runningMap.replace(this.jobContext.getJobName(), false, true)) {
LOGGER.info("Job {} will be executed, add into running map.", this.jobContext.getJobId());
isLaunched = true;
- super.launchJob(jobListener);
+ launchJobImpl(jobListener);
} else {
LOGGER.warn("Job {} will not be executed because other jobs are still running.", this.jobContext.getJobId());
}
- // TODO: Better error handling
+ // TODO: Better error handling. The current impl swallows exceptions for jobs that were started by this method call
} catch (Throwable t) {
Review Comment:
Add todo here to make the behavior of swallowing exception configurable
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -112,22 +117,11 @@ public static String getHelixInstanceName(
return namePrefix + "_" + instanceId;
}
- // We have switched from Helix JobQueue to WorkFlow based job execution.
- @Deprecated
- public static void submitJobToQueue(
- JobConfig.Builder jobConfigBuilder,
- String queueName,
- String jobName,
- TaskDriver helixTaskDriver,
- HelixManager helixManager,
- long jobQueueDeleteTimeoutSeconds) throws Exception {
- submitJobToWorkFlow(jobConfigBuilder, queueName, jobName, helixTaskDriver, helixManager, jobQueueDeleteTimeoutSeconds);
- }
-
static void waitJobInitialization(
HelixManager helixManager,
String workFlowName,
- String jobName) throws Exception {
+ String jobName,
+ Duration timeout) throws Exception {
Review Comment:
Just curious here why do we choose to use duration here? Why not use long directly to represent timeout mills?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164767309
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -112,22 +117,11 @@ public static String getHelixInstanceName(
return namePrefix + "_" + instanceId;
}
- // We have switched from Helix JobQueue to WorkFlow based job execution.
- @Deprecated
- public static void submitJobToQueue(
- JobConfig.Builder jobConfigBuilder,
- String queueName,
- String jobName,
- TaskDriver helixTaskDriver,
- HelixManager helixManager,
- long jobQueueDeleteTimeoutSeconds) throws Exception {
- submitJobToWorkFlow(jobConfigBuilder, queueName, jobName, helixTaskDriver, helixManager, jobQueueDeleteTimeoutSeconds);
- }
-
static void waitJobInitialization(
HelixManager helixManager,
String workFlowName,
- String jobName) throws Exception {
+ String jobName,
+ Duration timeout) throws Exception {
Review Comment:
There's nothing inherently wrong with representing as a long. The downside is there's no safeguards for passing in the wrong value when going from one form of time (e.g. `Second` to `Millisecond`).
The resulting boiler plate becomes readable and "slightly" less error prone. I will admit It's all nits at the end of the day
Let's consider this example
```
long timeoutSecond = ConfigUtil.getLong(...)
// option 1: multiply by magic number where the conversion isn't enforced by the compiler
foobar(timeoutSecond * 1000)
// option 2: use native Duration java library which wouldn't let you forget to do the conversion. And the conversion is implicit
foobarUsingDuration(Duration.ofSecond(timeoutSecond))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164769313
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -235,7 +229,8 @@ public static void submitJobToWorkFlow(JobConfig.Builder jobConfigBuilder,
String jobName,
TaskDriver helixTaskDriver,
HelixManager helixManager,
- long workFlowExpiryTime) throws Exception {
+ long workFlowExpiryTime,
Review Comment:
Leaving this as a long is inconsistent. I'll change this to duration
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164659387
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -57,7 +49,20 @@
import org.apache.helix.task.WorkflowContext;
import org.apache.helix.tools.ClusterSetup;
-import static org.apache.helix.task.TaskState.*;
+import com.google.common.collect.Lists;
+
+import lombok.extern.slf4j.Slf4j;
+
+import org.apache.gobblin.configuration.ConfigurationKeys;
+import org.apache.gobblin.metrics.Tag;
+import org.apache.gobblin.metrics.event.TimingEvent;
+import org.apache.gobblin.runtime.JobException;
+import org.apache.gobblin.runtime.JobState;
+import org.apache.gobblin.runtime.listeners.JobListener;
+import org.apache.gobblin.util.Id;
+import org.apache.gobblin.util.JobLauncherUtils;
+
+import static org.apache.helix.task.TaskState.STOPPED;
Review Comment:
Imports reordered to follow Gobblin OSS standards
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] ZihanLi58 merged pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 merged PR #3677:
URL: https://github.com/apache/gobblin/pull/3677
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] homatthew commented on a diff in pull request #3677: [GOBBLIN-1813] Helix workflows submission timeouts are configurable
Posted by "homatthew (via GitHub)" <gi...@apache.org>.
homatthew commented on code in PR #3677:
URL: https://github.com/apache/gobblin/pull/3677#discussion_r1164660571
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java:
##########
@@ -112,22 +117,11 @@ public static String getHelixInstanceName(
return namePrefix + "_" + instanceId;
}
- // We have switched from Helix JobQueue to WorkFlow based job execution.
- @Deprecated
- public static void submitJobToQueue(
- JobConfig.Builder jobConfigBuilder,
- String queueName,
- String jobName,
- TaskDriver helixTaskDriver,
- HelixManager helixManager,
- long jobQueueDeleteTimeoutSeconds) throws Exception {
- submitJobToWorkFlow(jobConfigBuilder, queueName, jobName, helixTaskDriver, helixManager, jobQueueDeleteTimeoutSeconds);
- }
-
static void waitJobInitialization(
HelixManager helixManager,
String workFlowName,
- String jobName) throws Exception {
+ String jobName,
+ Duration timeout) throws Exception {
Review Comment:
I also considered having an overloaded method with a default timeout. But I think it makes more sense to enforce all callers of this method to specify a timeout / have that timeout be configurable. Code changes are a slow way to fix prod issues, and config is preferred
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org