You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by shuai-xu <gi...@git.apache.org> on 2017/02/22 06:31:33 UTC

[GitHub] flink pull request #3385: [FLINK-5501] JM use running job registry to determ...

GitHub user shuai-xu opened a pull request:

    https://github.com/apache/flink/pull/3385

    [FLINK-5501] JM use running job registry to determine whether is the first running

    This pr if for jira-#[5501](https://issues.apache.org/jira/browse/FLINK-5501).
    
    The main changes are:
    1. Add interface isJobFinished() and clearJob() to RunningJobRegistry and implement them.
    2. After grantLeadership, JMRunner will first check whether the job is finished, if finished, it means that other JM has finished the job, it only need to exist.
    3. Then JMRunner will check whether the job is running, if running, it means other JM has run it, but not succeeded, so it need to recover it.
    4. If the job is not running, it mean the first running, the JMRunner will setJobRunning in RunningJobRegistry.
    5. After job finished, will clear the job state from RunningJobRegistry 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shuai-xu/flink jira-5501

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3385.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3385
    
----
commit 7c5068e7ea0592f3ba0527d3d363c7cf4653713d
Author: shuai.xus <sh...@alibaba-inc.com>
Date:   2017-02-22T06:15:43Z

    [FLINK-5501] JM use running job registry to determine whether is the first running

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by shuai-xu <gi...@git.apache.org>.
Github user shuai-xu commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    @StephanEwen , Thank you very much, sorry for the test break, next time I will be more careful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    With the problem observed above, I think we should change the approach a bit:
    
      - The registry should have an enum that it returns: `getJobSchedulingStatus` or so, which can be `PENDING`, `RUNNING`, and `DONE`. That way there is only one access to the registry and we don't have the problem that the internal status is changed between checks.
    
      - The file-based registry would create one file for the transition to `RUNNING` and another for the transition to `DONE`. Important is that the transition to `DONE` does not remove the file for `RUNNING`. The status check checks backwards - first for the `DONE` file, then for the `RUNNING` file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    One issue I think can happen in practice is that the checks "isRunning" and "isFinished" are not atomic. Imagine this scenario:
    
      - job is running
      - JobManager checks "isFinished" -> false
      - job finishes
      - JobManager checks "isRunning" -> false
      - JobManager starts job = bug



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    One test case seemed to be failing in this PR:
    I have merged the PR to my local repository, fixed the test, and added some fixes/cleanups on top.
    Will merge back to Flink master tomorrow...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    Thanks!
    I think I can take this over now...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    I would like to merge this and make a few edits on top...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3385: [FLINK-5501] JM use running job registry to determ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/3385


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3385: [FLINK-5501] JM use running job registry to determ...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3385#discussion_r102993092
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/ZookeeperRegistry.java ---
    @@ -55,7 +59,7 @@ public void setJobRunning(JobID jobID) throws IOException {
     		try {
     			String zkPath = runningJobPath + jobID.toString();
     			this.client.newNamespaceAwareEnsurePath(zkPath).ensure(client.getZookeeperClient());
    -			this.client.setData().forPath(zkPath);
    +			this.client.setData().forPath(zkPath, RUNNING.getBytes());
    --- End diff --
    
    String to bytes conversion (and bytes to string) must always explicitly specify the encoding (Charset). Otherwise, there can be mismatches when different machines configure different default Charsets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    PR looks like a good start, but I think we need to add a few things on top:
    
      - The file-based registry cannot distinguish between "job created but not running" and "job running". This distinction is important to decide whether to start reconciliation.
      - There are currently no tests for the extended functionality


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3385: [FLINK-5501] JM use running job registry to determine whe...

Posted by shuai-xu <gi...@git.apache.org>.
Github user shuai-xu commented on the issue:

    https://github.com/apache/flink/pull/3385
  
    hi @StephanEwen , thank for you review, I modify it according to your comments, add getJobSchedulingStatus  to it and add tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---