You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Will-Lo (via GitHub)" <gi...@apache.org> on 2023/03/14 20:11:57 UTC

[GitHub] [gobblin] Will-Lo opened a new pull request, #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed

Will-Lo opened a new pull request, #3661:
URL: https://github.com/apache/gobblin/pull/3661

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1800
   
   ### Description
   - [x] Here are some details about my PR, including screenshots (if applicable):
   
   There are 2 SLAs in GaaS flows:
   1. Job Start SLA, which measures the time in between when GaaS orchestrates a job to when it receives its first event from the job execution. This can catch scenarios such as container bootstrap time taking longer, event emission through Kafka failing, unresponsive executors.
   2. Flow SLA, which measures the e2e time of a flow from start to end. If the job is stuck or taking too long to run past this setting, it fails to save resources on the executor side.
   
   When a user defines `job.maxAttempts` in their GaaS flow, the expected behavior should be that these flows are retried to increase resiliency. 
   
   
   
   ### Tests
   - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   Added unit tests in `DagManagerFlowTest.java`
   
   ### Commits
   - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1140794314


##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -853,27 +852,40 @@ private boolean killJobIfOrphaned(DagNode<JobExecutionPlan> node, JobStatus jobS
             DagManagerUtils.getFullyQualifiedDagName(node),
             timeOutForJobStart);
         dagManagerMetrics.incrementCountsStartSlaExceeded(node);
-        cancelDagNode(node);
 
         String dagId = DagManagerUtils.generateDagId(node).toString();
-        this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED);
+        this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);

Review Comment:
   This one seems like a discrepancy, if we want to retry because job start sla exceeded, we will set dagNode with pending_retry but flow status to be flow_cancelled? 



##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -911,9 +923,8 @@ private boolean slaKillIfNeeded(DagNode<JobExecutionPlan> node) throws Execution
             node.getValue().getJobSpec().getConfig().getString(ConfigurationKeys.FLOW_NAME_KEY), flowSla,
             node.getValue().getJobSpec().getConfig().getString(ConfigurationKeys.JOB_NAME_KEY));
         dagManagerMetrics.incrementExecutorSlaExceeded(node);
-        cancelDagNode(node);
 
-        this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_RUN_DEADLINE_EXCEEDED);
+        this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);

Review Comment:
   Same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1136218532


##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -687,7 +687,6 @@ private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel) throws Exe
         Future future = dagNodeToCancel.getValue().getJobFuture().get();
         String serializedFuture = DagManagerUtils.getSpecProducer(dagNodeToCancel).serializeAddSpecResponse(future);
         props.put(ConfigurationKeys.SPEC_PRODUCER_SERIALIZED_FUTURE, serializedFuture);
-        sendCancellationEvent(dagNodeToCancel.getValue());

Review Comment:
   Have concerns on this line. After removing this, are we able to handle normal cancel requests?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] codecov-commenter commented on pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#issuecomment-1468788776

   ## [Codecov](https://codecov.io/gh/apache/gobblin/pull/3661?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#3661](https://codecov.io/gh/apache/gobblin/pull/3661?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (4c43028) into [master](https://codecov.io/gh/apache/gobblin/commit/be6d46c2a117b1248d88d22cc64db5a12b802d16?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (be6d46c) will **decrease** coverage by `2.18%`.
   > The diff coverage is `n/a`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #3661      +/-   ##
   ============================================
   - Coverage     46.85%   44.67%   -2.18%     
   + Complexity    10749     2090    -8659     
   ============================================
     Files          2138      411    -1727     
     Lines         83989    17689   -66300     
     Branches       9331     2155    -7176     
   ============================================
   - Hits          39353     7903   -31450     
   + Misses        41054     8931   -32123     
   + Partials       3582      855    -2727     
   ```
   
   
   [see 1730 files with indirect coverage changes](https://codecov.io/gh/apache/gobblin/pull/3661/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] Will-Lo commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed

Posted by "Will-Lo (via GitHub)" <gi...@apache.org>.
Will-Lo commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1136337532


##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -687,7 +687,6 @@ private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel) throws Exe
         Future future = dagNodeToCancel.getValue().getJobFuture().get();
         String serializedFuture = DagManagerUtils.getSpecProducer(dagNodeToCancel).serializeAddSpecResponse(future);
         props.put(ConfigurationKeys.SPEC_PRODUCER_SERIALIZED_FUTURE, serializedFuture);
-        sendCancellationEvent(dagNodeToCancel.getValue());

Review Comment:
   crap good catch, I didn't realize it was also used for manual kill requests as well. Will add a parameter to not send the event on SLA kill and let the pollAndAdvanceDag() make the decision there



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] umustafi commented on a diff in pull request #3661: [GOBBLIN-1800] Fix bug where retries would not consider jobs that are SLA killed

Posted by "umustafi (via GitHub)" <gi...@apache.org>.
umustafi commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1175638739


##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -681,18 +681,20 @@ private void cancelDag(DagId dagId) throws ExecutionException, InterruptedExcept
       clearUpDagAction(dagId);
     }
 
-    private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel) throws ExecutionException, InterruptedException {
+    private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel, boolean shouldSendCancellationEvent) throws ExecutionException, InterruptedException {

Review Comment:
   can you add a comment to explain why you added this boolean and what case we don't send cancellation event even though this function is called?



##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -853,27 +852,40 @@ private boolean killJobIfOrphaned(DagNode<JobExecutionPlan> node, JobStatus jobS
             DagManagerUtils.getFullyQualifiedDagName(node),
             timeOutForJobStart);
         dagManagerMetrics.incrementCountsStartSlaExceeded(node);
-        cancelDagNode(node);
 
         String dagId = DagManagerUtils.generateDagId(node).toString();
-        this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED);
+        this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);

Review Comment:
   we're not calling `cancelDagNode` but still updating flow status also



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org