You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by kl0u <gi...@git.apache.org> on 2016/02/04 19:15:36 UTC
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
GitHub user kl0u opened a pull request:
https://github.com/apache/flink/pull/1588
FLINK-2213 Makes the number of vcores per YARN container configurable.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kl0u/flink vcores_param
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1588.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1588
----
commit 91d2dc905e5a82b9812dbbe172c9a267eff27ad6
Author: Kostas Kloudas <kk...@gmail.com>
Date: 2016-02-04T14:01:58Z
Makes the YARN_VCORES configurable.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-180789530
The fallback behavior here is now different than the original behavior.
I think it would be good to make the fallback the same as before, meaning to use the number of slots as the number of vcores if possible, otherwise, use one vcore (unless configured).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-186396847
In my private tests and here, `YARNSessionFIFOITCase.testQueryCluster` failed with a timeout. Something has made this test unstable.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on a diff in the pull request:
https://github.com/apache/flink/pull/1588#discussion_r53296170
--- Diff: flink-yarn-tests/src/main/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java ---
@@ -436,7 +430,7 @@ public void perJobYarnCluster() {
runWithArgs(new String[]{"run", "-m", "yarn-cluster",
"-yj", flinkUberjar.getAbsolutePath(), "-yt", flinkLibFolder.getAbsolutePath(),
"-yn", "1",
- "-ys", "2", //test that the job is executed with a DOP of 2
+ "-ys", "1", //test that the job is executed with a DOP of 2
--- End diff --
Same here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-180753220
I hope its a coincidence that the YARN tests failed in this PR. If they fail after your next push again, we have to check if your changes caused the failure
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-186835009
The expected output of the test is the following
```
Test testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) is running.
--------------------------------------------------------------------------------
20:12:25,692 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at testing-worker-linux-docker-8281db7b-3371-linux-16/172.17.6.245:8032
20:12:25,701 INFO org.apache.hadoop.yarn.webapp.WebApps - Registered webapp guice modules
20:12:25,712 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Registering with RM using finished containers :[]
20:12:25,712 INFO org.apache.hadoop.yarn.util.RackResolver - Resolved testing-worker-linux-docker-8281db7b-3371-linux-16.prod.travis-ci.org to /default-rack
20:12:25,712 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService - NodeManager from node testing-worker-linux-docker-8281db7b-3371-linux-16.prod.travis-ci.org(cmPort: 59877 httpPort: 39611) registered with capability: <memory:4096, vCores:666>, assigned nodeId testing-worker-linux-docker-8281db7b-3371-linux-16.prod.travis-ci.org:59877
20:12:25,712 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl - testing-worker-linux-docker-8281db7b-3371-linux-16.prod.travis-ci.org:59877 Node Transitioned from NEW to RUNNING
20:12:25,712 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager - Rolling master-key for container-tokens, got key with id 1069868518
20:12:25,712 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM - Rolling master-key for nm-tokens, got key with id :-1168902475
20:12:25,712 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Registered with ResourceManager as testing-worker-linux-docker-8281db7b-3371-linux-16.prod.travis-ci.org:59877 with total resource of <memory:4096, vCores:666>
20:12:25,712 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Notifying ContainerManager to unblock new container-requests
20:12:25,939 INFO org.apache.flink.yarn.YARNSessionFIFOITCase - Starting testQueryCluster()
20:12:25,939 INFO org.apache.flink.yarn.YarnTestBase - Running with args [-q]
20:12:25,993 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
20:12:26,940 INFO org.apache.flink.yarn.YarnTestBase - Found expected output in redirected streams
20:12:26,940 INFO org.apache.flink.yarn.YarnTestBase - RunWithArgs: request runner to stop
20:12:26,940 WARN org.apache.flink.yarn.YarnTestBase - RunWithArgs runner stopped.
20:12:26,940 INFO org.apache.flink.yarn.YarnTestBase - Sending stdout content through logger:
NodeManagers in the Cluster 2|Property |Value
+---------------------------------------+
|NodeID |testing-worker-linux-docker-8281db7b-3371-linux-16.prod.travis-ci.org:59877
|Memory |4096 MB
|vCores |666
|HealthReport |
|Containers |0
+---------------------------------------+
|NodeID |testing-worker-linux-docker-8281db7b-3371-linux-16.prod.travis-ci.org:44161
|Memory |4096 MB
|vCores |666
|HealthReport |
|Containers |0
+---------------------------------------+
Summary: totalMemory 8192 totalCores 1332
20:12:26,940 INFO org.apache.flink.yarn.YarnTestBase - Sending stderr content through logger:
20:12:26,940 INFO org.apache.flink.yarn.YarnTestBase - Test was successful
20:12:26,940 INFO org.apache.flink.yarn.YARNSessionFIFOITCase - Finished testQueryCluster()
20:12:27,443 INFO org.apache.flink.yarn.YARNSessionFIFOITCase -
```
but when its failing, its outputting
```
Test testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) is running.
--------------------------------------------------------------------------------
21:50:20,684 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at testing-worker-linux-docker-1305e497-3358-linux-10/172.17.4.146:8032
21:50:22,041 INFO org.apache.flink.yarn.YARNSessionFIFOITCase - Starting testQueryCluster()
21:50:22,041 INFO org.apache.flink.yarn.YarnTestBase - Running with args [-q]
21:50:22,212 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
21:50:22,559 INFO org.mortbay.log - Started HttpServer2$SelectChannelConnectorWithSafeStartup@testing-worker-linux-docker-1305e497-3358-linux-10:45772
21:50:22,576 INFO org.apache.hadoop.yarn.webapp.WebApps - Web app /node started at 45772
21:50:22,577 INFO org.mortbay.log - Started HttpServer2$SelectChannelConnectorWithSafeStartup@testing-worker-linux-docker-1305e497-3358-linux-10:58616
21:50:22,577 INFO org.apache.hadoop.yarn.webapp.WebApps - Web app /node started at 58616
21:50:22,664 INFO org.apache.hadoop.yarn.webapp.WebApps - Registered webapp guice modules
21:50:22,665 INFO org.apache.hadoop.yarn.webapp.WebApps - Registered webapp guice modules
21:50:22,691 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Sending out 0 NM container statuses: []
21:50:22,696 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Registering with RM using containers :[]
21:50:22,746 INFO org.apache.hadoop.yarn.util.RackResolver - Resolved testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org to /default-rack
21:50:22,753 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl - testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org:42997 Node Transitioned from NEW to RUNNING
21:50:22,758 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Sending out 0 NM container statuses: []
21:50:22,758 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Registering with RM using containers :[]
21:50:22,758 INFO org.apache.hadoop.yarn.util.RackResolver - Resolved testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org to /default-rack
21:50:22,753 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService - NodeManager from node testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org(cmPort: 42997 httpPort: 58616) registered with capability: <memory:4096, vCores:666>, assigned nodeId testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org:42997
21:50:22,758 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl - testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org:50661 Node Transitioned from NEW to RUNNING
21:50:22,759 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager - Rolling master-key for container-tokens, got key with id -1878662703
21:50:22,758 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService - NodeManager from node testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org(cmPort: 50661 httpPort: 45772) registered with capability: <memory:4096, vCores:666>, assigned nodeId testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org:50661
21:50:22,760 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager - Rolling master-key for container-tokens, got key with id -1878662703
21:50:22,760 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM - Rolling master-key for nm-tokens, got key with id :-748258147
21:50:22,761 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Registered with ResourceManager as testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org:42997 with total resource of <memory:4096, vCores:666>
21:50:22,761 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Notifying ContainerManager to unblock new container-requests
21:50:22,761 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM - Rolling master-key for nm-tokens, got key with id :-748258147
21:50:22,762 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Registered with ResourceManager as testing-worker-linux-docker-1305e497-3358-linux-10.prod.travis-ci.org:50661 with total resource of <memory:4096, vCores:666>
21:50:22,762 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl - Notifying ContainerManager to unblock new container-requests
21:50:23,077 INFO org.apache.flink.yarn.YarnTestBase - Runner stopped earlier than expected with return value = 0
21:50:23,077 INFO org.apache.flink.yarn.YarnTestBase - Sending stdout content through logger:
NodeManagers in the Cluster 0|Property |Value
+---------------------------------------+
Summary: totalMemory 0 totalCores 0
21:50:23,077 INFO org.apache.flink.yarn.YarnTestBase - Sending stderr content through logger:
21:50:23,580 ERROR org.apache.flink.yarn.YARNSessionFIFOITCase -
--------------------------------------------------------------------------------
Test testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) failed with:
java.lang.AssertionError: During the timeout period of 180 seconds the expected string did not show up
```
Looking through the logs, I think the issue is the following:
When the `testQueryCluster()` is executed, the NodeManagers are not yet registered with YARN. That's why the number of nodemanagers in the test is 0.
The issue is occurring after you change, because the test execution order changed. Before your change, the `testQueryCluster()` test was executed after other tests, so the NM's were always registered.
Since you removed many tests from the FIFOITCase, the `testQueryCluster()` is the first test to be executed. Apparently, the test setup is not waiting until all NM's are connected.
I think you can solve the issue using the `waitForNodeManagersToConnect` from the MiniYARNCluster.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on a diff in the pull request:
https://github.com/apache/flink/pull/1588#discussion_r52101633
--- Diff: docs/setup/config.md ---
@@ -211,6 +211,8 @@ The parameters define the behavior of tasks that create result files.
yarn.application-master.env.LD_LIBRARY_PATH: "/usr/lib/native"
+- `yarn.containers.vcores` The number of virtual cores (vcores) per YARN container. By default the number of `vcores` is set equal to the maximum between the number of slots per TaskManager, and the number of cores available to the Java runtime.
--- End diff --
what's the rationale for using the max(slots, #cpus) ?
I think in most cases users use fewer slots than CPU cores available on the physical machine.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/flink/pull/1588
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on a diff in the pull request:
https://github.com/apache/flink/pull/1588#discussion_r53296134
--- Diff: flink-yarn-tests/src/main/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java ---
@@ -114,9 +108,9 @@ public void testClientStartup() {
"-n", "1",
"-jm", "768",
"-tm", "1024",
- "-s", "2" // Test that 2 slots are started on the TaskManager.
+ "-s", "1" // Test that 1 slots are started on the TaskManager.
},
- "Number of connected TaskManagers changed to 1. Slots available: 2", null, RunTypes.YARN_SESSION, 0);
+ "Number of connected TaskManagers changed to 1. Slots available: 1", null, RunTypes.YARN_SESSION, 0);
--- End diff --
Can you set the number of slots back to 2 and move the test to the other class as well?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by kl0u <gi...@git.apache.org>.
Github user kl0u commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-180847040
Thanks a lot for the comments @rmetzger and @StephanEwen .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-187606854
Merging ...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-180753150
I didn't test this myself, but this diff could be sufficient for testing your change:
```diff
diff --git a/flink-yarn-tests/src/main/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java b/flink-yarn-tests/src/main/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java
index 8c9a9c7..999b5be 100644
--- a/flink-yarn-tests/src/main/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java
+++ b/flink-yarn-tests/src/main/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java
@@ -180,6 +180,7 @@ public class YARNSessionFIFOITCase extends YarnTestBase {
"-n", "1",
"-jm", "768",
"-tm", "1024",
+ "-s", "3", // set the slots 3 to check if the vCores are set properly!
"-nm", "customName",
"-Dfancy-configuration-value=veryFancy",
"-Dyarn.maximum-failed-containers=3"},
@@ -268,6 +269,7 @@ public class YARNSessionFIFOITCase extends YarnTestBase {
String command = Joiner.on(" ").join(entry.getValue().getLaunchContext().getCommands());
if(command.contains(YarnTaskManagerRunner.class.getSimpleName())) {
taskManagerContainer = entry.getKey();
+ Assert.assertEquals(3,entry.getValue().getResource().getVirtualCores());
nodeManager = nm;
nmIdent = new NMTokenIdentifier(taskManagerContainer.getApplicationAttemptId(), null, "",0);
// allow myself to do stuff with the container
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by rmetzger <gi...@git.apache.org>.
Github user rmetzger commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-186364227
I started some tests on my travis as well to see whether the one YARN test failure is a coincidence: https://travis-ci.org/rmetzger/flink/builds/110454166
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by kl0u <gi...@git.apache.org>.
Github user kl0u commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-187700884
Thanks @rmetzger
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by kl0u <gi...@git.apache.org>.
Github user kl0u commented on the pull request:
https://github.com/apache/flink/pull/1588#issuecomment-186832973
@rmetzger I am trying to reproduce it but so far I cannot. I will keep you posted.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: FLINK-2213 Makes the number of vcores per YARN...
Posted by kl0u <gi...@git.apache.org>.
Github user kl0u commented on a diff in the pull request:
https://github.com/apache/flink/pull/1588#discussion_r52106077
--- Diff: docs/setup/config.md ---
@@ -211,6 +211,8 @@ The parameters define the behavior of tasks that create result files.
yarn.application-master.env.LD_LIBRARY_PATH: "/usr/lib/native"
+- `yarn.containers.vcores` The number of virtual cores (vcores) per YARN container. By default the number of `vcores` is set equal to the maximum between the number of slots per TaskManager, and the number of cores available to the Java runtime.
--- End diff --
This was to have a fallback strategy in case the slots parameter is not set. But @StephanEwen 's comment probably solves it. The fallback will be set to the previous strategy where vcores=1.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---