You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Will-Lo (via GitHub)" <gi...@apache.org> on 2023/03/14 20:11:57 UTC
[GitHub] [gobblin] Will-Lo opened a new pull request, #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed
Will-Lo opened a new pull request, #3661:
URL: https://github.com/apache/gobblin/pull/3661
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
### JIRA
- [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1800
### Description
- [x] Here are some details about my PR, including screenshots (if applicable):
There are 2 SLAs in GaaS flows:
1. Job Start SLA, which measures the time in between when GaaS orchestrates a job to when it receives its first event from the job execution. This can catch scenarios such as container bootstrap time taking longer, event emission through Kafka failing, unresponsive executors.
2. Flow SLA, which measures the e2e time of a flow from start to end. If the job is stuck or taking too long to run past this setting, it fails to save resources on the executor side.
When a user defines `job.maxAttempts` in their GaaS flow, the expected behavior should be that these flows are retried to increase resiliency.
### Tests
- [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
Added unit tests in `DagManagerFlowTest.java`
### Commits
- [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed
Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1140794314
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -853,27 +852,40 @@ private boolean killJobIfOrphaned(DagNode<JobExecutionPlan> node, JobStatus jobS
DagManagerUtils.getFullyQualifiedDagName(node),
timeOutForJobStart);
dagManagerMetrics.incrementCountsStartSlaExceeded(node);
- cancelDagNode(node);
String dagId = DagManagerUtils.generateDagId(node).toString();
- this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED);
+ this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);
Review Comment:
This one seems like a discrepancy, if we want to retry because job start sla exceeded, we will set dagNode with pending_retry but flow status to be flow_cancelled?
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -911,9 +923,8 @@ private boolean slaKillIfNeeded(DagNode<JobExecutionPlan> node) throws Execution
node.getValue().getJobSpec().getConfig().getString(ConfigurationKeys.FLOW_NAME_KEY), flowSla,
node.getValue().getJobSpec().getConfig().getString(ConfigurationKeys.JOB_NAME_KEY));
dagManagerMetrics.incrementExecutorSlaExceeded(node);
- cancelDagNode(node);
- this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_RUN_DEADLINE_EXCEEDED);
+ this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);
Review Comment:
Same here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed
Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1136218532
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -687,7 +687,6 @@ private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel) throws Exe
Future future = dagNodeToCancel.getValue().getJobFuture().get();
String serializedFuture = DagManagerUtils.getSpecProducer(dagNodeToCancel).serializeAddSpecResponse(future);
props.put(ConfigurationKeys.SPEC_PRODUCER_SERIALIZED_FUTURE, serializedFuture);
- sendCancellationEvent(dagNodeToCancel.getValue());
Review Comment:
Have concerns on this line. After removing this, are we able to handle normal cancel requests?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] codecov-commenter commented on pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed
Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#issuecomment-1468788776
## [Codecov](https://codecov.io/gh/apache/gobblin/pull/3661?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#3661](https://codecov.io/gh/apache/gobblin/pull/3661?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (4c43028) into [master](https://codecov.io/gh/apache/gobblin/commit/be6d46c2a117b1248d88d22cc64db5a12b802d16?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (be6d46c) will **decrease** coverage by `2.18%`.
> The diff coverage is `n/a`.
```diff
@@ Coverage Diff @@
## master #3661 +/- ##
============================================
- Coverage 46.85% 44.67% -2.18%
+ Complexity 10749 2090 -8659
============================================
Files 2138 411 -1727
Lines 83989 17689 -66300
Branches 9331 2155 -7176
============================================
- Hits 39353 7903 -31450
+ Misses 41054 8931 -32123
+ Partials 3582 855 -2727
```
[see 1730 files with indirect coverage changes](https://codecov.io/gh/apache/gobblin/pull/3661/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
:mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] Will-Lo commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed
Posted by "Will-Lo (via GitHub)" <gi...@apache.org>.
Will-Lo commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1136337532
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -687,7 +687,6 @@ private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel) throws Exe
Future future = dagNodeToCancel.getValue().getJobFuture().get();
String serializedFuture = DagManagerUtils.getSpecProducer(dagNodeToCancel).serializeAddSpecResponse(future);
props.put(ConfigurationKeys.SPEC_PRODUCER_SERIALIZED_FUTURE, serializedFuture);
- sendCancellationEvent(dagNodeToCancel.getValue());
Review Comment:
crap good catch, I didn't realize it was also used for manual kill requests as well. Will add a parameter to not send the event on SLA kill and let the pollAndAdvanceDag() make the decision there
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [gobblin] umustafi commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed
Posted by "umustafi (via GitHub)" <gi...@apache.org>.
umustafi commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1175638739
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -681,18 +681,20 @@ private void cancelDag(DagId dagId) throws ExecutionException, InterruptedExcept
clearUpDagAction(dagId);
}
- private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel) throws ExecutionException, InterruptedException {
+ private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel, boolean shouldSendCancellationEvent) throws ExecutionException, InterruptedException {
Review Comment:
can you add a comment to explain why you added this boolean and what case we don't send cancellation event even though this function is called?
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -853,27 +852,40 @@ private boolean killJobIfOrphaned(DagNode<JobExecutionPlan> node, JobStatus jobS
DagManagerUtils.getFullyQualifiedDagName(node),
timeOutForJobStart);
dagManagerMetrics.incrementCountsStartSlaExceeded(node);
- cancelDagNode(node);
String dagId = DagManagerUtils.generateDagId(node).toString();
- this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED);
+ this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);
Review Comment:
we're not calling `cancelDagNode` but still updating flow status also
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org