You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2020/11/17 18:39:39 UTC

[GitHub] [incubator-gobblin] yukuai518 commented on a change in pull request #3155: [GOBBLIN-1319] fix cancellation in gobblin cluster jobs

yukuai518 commented on a change in pull request #3155:
URL: https://github.com/apache/incubator-gobblin/pull/3155#discussion_r525397700



##########
File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java
##########
@@ -367,6 +369,33 @@ public void handleDeleteJobConfigArrival(DeleteJobConfigArrivalEvent deleteJobAr
     }
   }
 
+  @Subscribe
+  public void handleCancelJobConfigArrival(CancelJobConfigArrivalEvent cancelJobArrival)
+      throws InterruptedException {
+    String jobUri = cancelJobArrival.getJoburi();
+    LOGGER.info("Received cancel for job configuration of job " + jobUri);
+    Optional<String> planningJob = Optional.empty();
+    Optional<String> actualJob = Optional.empty();
+
+    try {
+      planningJob = this.jobsMapping.getPlanningJobId(jobUri);
+      actualJob = this.jobsMapping.getActualJobId(jobUri);
+    } catch (IOException e) {
+      LOGGER.warn("Planning and actual jobs could not be retrieved for job {}", jobUri);
+      return;
+    }
+
+    if (planningJob.isPresent()) {
+      LOGGER.info("Cancelling planning job helix workflow: {}", planningJob.get());
+      new TaskDriver(this.taskDriverHelixManager.get()).waitToStop(planningJob.get(), this.helixJobStopTimeoutMillis);
+    }
+
+    if (actualJob.isPresent()) {
+      LOGGER.info("Cancelling actual job helix workflow: {}", actualJob.get());
+      new TaskDriver(this.jobHelixManager).waitToStop(actualJob.get(), this.helixJobStopTimeoutMillis);
+    }

Review comment:
       I have several questions:
   1. What is the difference between this logic vs cancelJobIfRequired() method, can we reuse some part of the logic?
   2. I don't know if it is safe to stop the actual job twice. When you cancel the planningJob, the GobblinHelixJobLauncher on the task driver node will receive a cancellation message and invoke its own executeCancellation() method, which will further cancel the actual job on the worker node. If we cancel the actual job directly here, do we have any conflict? Will there be any race condition needs some scrutify?
   3. Can you provide some comments on which method will be used by GaaS and which one will be the opposite since you mentioned you want a separate path?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org