You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by EronWright <gi...@git.apache.org> on 2017/08/18 01:50:39 UTC

[GitHub] flink pull request #4560: Flink 7077

GitHub user EronWright opened a pull request:

    https://github.com/apache/flink/pull/4560

    Flink 7077

    ## Description
    
    _This PR extends #4555 - please disregard commits from Aug 16, 2017._ 
    
    [FLINK-7077] [mesos] Implement task release to support dynamic scaling
        
        - SlotManager: fix for idleness tracking (`markIdle` shouldn't reset `idleSince` on every call)
        - ResourceManager: change `stopWorker` method to use `ResourceID`
        - ResourceManager: schedule callbacks from `ResourceManagerActions` onto main thread
        - MesosResourceManager: implement `stopWorker`
        - MesosResourceManager: fix for message routing from child actors to RM
    
    This change added tests and can be verified as follows:
    - `MesosResourceManagerTest::testStopWorker`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): **no**
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no**
      - The serializers: **no**
      - The runtime per-record code paths (performance sensitive): **no**
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **no**
    
    ## Documentation
    
      - Does this pull request introduce a new feature? **no**


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/EronWright/flink FLINK-7077

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4560.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4560
    
----
commit 7d257db84f4b9bf2a02d1375a04ff64516266186
Author: Wright, Eron <er...@emc.com>
Date:   2017-08-16T21:30:24Z

    [FLINK-6630] Implement FLIP-6 MesosAppMasterRunner
    [FLINK-6631] Implement FLIP-6 MesosTaskExecutorRunner
    
    - bin: new entrypoints scripts for flip-6
    - ClusterEntrypoint: Refactor the shutdown method
    - ClusterEntrypoint: Install default FileSystem (for parity with legacy entrypoints)
    - ClusterEntrypoint: new MesosJobClusterEntrypoint, MesosSessionClusterEntrypoint, MesosEntrypointUtils, MesosTaskExecutorRunner
    - MesosServices: enhanced with artifactServer, localActorSystem
    - MesosResourceManager: Fallback to old TM params when UNKNOWN resource profile is provided
    - MesosResourceManager: config setting for taskmanager startup script (mesos.resourcemanager.tasks.taskmanager-cmd)
    - test: added a 'noop' job graph for testing purposes

commit 4cbcde3095774f4ea6484a4cfd07df613fe08d30
Author: Wright, Eron <er...@emc.com>
Date:   2017-08-16T22:40:47Z

    [FLINK-6630] [Mesos] Implement FLIP-6 MesosAppMasterRunner
    
    - removed `streaming-noop-3.graph` since it can be generated using `StreamingNoop` program.

commit 691d04efb0d1a44db8b0ef11a504355e6e3d49aa
Author: Wright, Eron <er...@emc.com>
Date:   2017-08-18T01:22:55Z

    [FLINK-7077] [mesos] Implement task release to support dynamic scaling
    
    - SlotManager: fix for idleness tracking (`markIdle` shouldn't reset `idleSince` on every call)
    - ResourceManager: change `stopWorker` method to use `ResourceID`
    - ResourceManager: schedule callbacks from `ResourceManagerActions` onto main thread
    - MesosResourceManager: implement `stopWorker`
    - MesosResourceManager: fix for message routing from child actors to RM

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4560: Flink 7077

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4560#discussion_r133980216
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java ---
    @@ -960,12 +965,26 @@ public void handleError(final Exception exception) {
     
     		@Override
     		public void releaseResource(InstanceID instanceId) {
    -			stopWorker(instanceId);
    +			runAsync(new Runnable() {
    +				@Override
    +				public void run() {
    +					for (Map.Entry<ResourceID, WorkerRegistration<WorkerType>> entry : taskExecutors.entrySet()) {
    +						if (entry.getValue().getInstanceID().equals(instanceId)) {
    +							stopWorker(entry.getKey());
    --- End diff --
    
    In the future we should make these ids being composed of each other. Then we should easily obtain the `ResourceID` from the `InstanceID`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4560: Flink 7077

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4560#discussion_r133980386
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskManagerRegistration.java ---
    @@ -68,7 +68,9 @@ public boolean isIdle() {
     	}
     
     	public void markIdle() {
    -		idleSince = System.currentTimeMillis();
    +		if (!isIdle()) {
    +			idleSince = System.currentTimeMillis();
    +		}
    --- End diff --
    
    good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4560: Flink 7077

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/4560
  
    Thanks for you contribution @EronWright. `MesosResourceManagerTest` seems to fail on Travis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4560: Flink 7077

Posted by EronWright <gi...@git.apache.org>.
Github user EronWright commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4560#discussion_r134004745
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskManagerRegistration.java ---
    @@ -68,7 +68,9 @@ public boolean isIdle() {
     	}
     
     	public void markIdle() {
    -		idleSince = System.currentTimeMillis();
    +		if (!isIdle()) {
    +			idleSince = System.currentTimeMillis();
    +		}
    --- End diff --
    
    The reason that the SM idleness unit tests didn't catch this is, `markIdle` is called when a slot report is received, which in practice occurs continuously but is tough to simulate.  Just making clear that there are idle tests but I didn't see an obvious way to improve them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4560: Flink 7077

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/4560


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4560: Flink 7077

Posted by EronWright <gi...@git.apache.org>.
Github user EronWright commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4560#discussion_r134003593
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java ---
    @@ -960,12 +965,26 @@ public void handleError(final Exception exception) {
     
     		@Override
     		public void releaseResource(InstanceID instanceId) {
    -			stopWorker(instanceId);
    +			runAsync(new Runnable() {
    +				@Override
    +				public void run() {
    +					for (Map.Entry<ResourceID, WorkerRegistration<WorkerType>> entry : taskExecutors.entrySet()) {
    +						if (entry.getValue().getInstanceID().equals(instanceId)) {
    +							stopWorker(entry.getKey());
    --- End diff --
    
    I was thinking along those same lines - the slot manager deals with `InstanceID` mostly, and its log lines are tough to correlate with lower-level resource manager information that is `ResourceID` based.   It would be a nice improvement to make `InstanceID` a composite key that included `ResourceID`.
    
    Regardless, I think the `stopWorker` method should use `ResourceID`, because `InstanceID` isn't a concept exposed to the RM subclasses.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4560: Flink 7077

Posted by EronWright <gi...@git.apache.org>.
Github user EronWright commented on the issue:

    https://github.com/apache/flink/pull/4560
  
    @tillrohrmann fixed the test issue and other noted issues.  Once #4555 is in, this is good to go.
    
    The test issue was a race in RM acquiring leadership vs RPC call.   Added some code to `TestLeaderElectionService` to wait for leader ack and updated the base RM and Mesos RM to use it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---