You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2019/02/18 17:10:00 UTC

[jira] [Commented] (YARN-8821) GPU hierarchy/topology scheduling support

    [ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771241#comment-16771241 ] 

Hadoop QA commented on YARN-8821:
---------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color} | {color:green} The patch appears to include 5 new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 15s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 27s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 57s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 57s{color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager generated 3 new + 26 unchanged - 0 fixed = 29 total (was 26) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  0m 23s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 17 new + 0 unchanged - 0 fixed = 17 total (was 0) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 33s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  0s{color} | {color:red} The patch 5 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m  5s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 30s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 68m 30s{color} | {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8821 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959138/YARN-8821-trunk.006.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 805cf9fa2ee6 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / f2fb653 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| javac | https://builds.apache.org/job/PreCommit-YARN-Build/23437/artifact/out/diff-compile-javac-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt |
| checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/23437/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt |
| whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/23437/artifact/out/whitespace-tabs.txt |
|  Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23437/testReport/ |
| asflicense | https://builds.apache.org/job/PreCommit-YARN-Build/23437/artifact/out/patch-asflicense-problems.txt |
| Max. process+thread count | 412 (vs. ulimit of 10000) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager |
| Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23437/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> GPU hierarchy/topology scheduling support
> -----------------------------------------
>
>                 Key: YARN-8821
>                 URL: https://issues.apache.org/jira/browse/YARN-8821
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>         Attachments: YARN-8821-trunk.001.patch, YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, YARN-8821-trunk.006.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework which can support plugin custom scheduler. Based on the framework, GPU plugin could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to build a hash map whose key is all pairs of GPUs and the value is the communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all combinations of GPUs and corresponding cost between them and cache it. The cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key is the combination of GPUs and the value is the calculated communication cost of the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on topology, we provide two policy which container can set through an environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not using the same bus to CPU). And the key difference of the two policy is the sort order of the inner map in the cost table. For instance, let's assume 2 GPUs is wanted. The costTable.get(2) would return a map containing all combinations of two GPUs and their cost. If the policy is "PACK", we'll sort the map by cost in ascending order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending order and get the first one which is the highest GPU-GPU cost which means lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best combination GPUs can get *5% to 185%* *performance gain* among the test cases with various factors including CNN model, batch size, GPU subset, etc.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The topology scheduling can only potentially get *about 10%* speedup.
> 3. Our current version of topology scheduling algorithm can achieve *3% to 140%* *performance gain. And the algorithm's allocations match the fastest GPUs needed by "vgg16"*.
>     For "alexnet", although the fastest GPUs is not the algorithm's allocation, the GPU subset ranks in the first 5 of the algorithm's candidates and has the same cost with the one picked by the algorithm. We may improve this by selecting a random combination in the first 5 candidates since they have the same cost.
>  
> In summary, the GPU topology scheduling algorithm is effective and can potentially get 5% to 185% performance gain after more optimization.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org