You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Huangkaixuan (JIRA)" <ji...@apache.org> on 2017/03/06 07:54:32 UTC
[jira] [Comment Edited] (YARN-6289) yarn got little data locality
[ https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896854#comment-15896854 ]
Huangkaixuan edited comment on YARN-6289 at 3/6/17 7:54 AM:
------------------------------------------------------------
Experiment1:
7 node Hadoop cluster (1 master, 6 data nodes/node managers)
Simple37 Simple27 Simple28 Simple30 Simple31 Simple32 Simple33
Master Node1 Node2 Node3 Node4 Node5 node6
Configure HDFS with replication factor 2
File has a single block in HDFS
Configure Spark to use dynamic allocation
Configure Yarn for both mapreduce shuffle service and Spark shuffle service
Add a single small file (few bytes) to HDFS
Run wordcount on the file (using Spark/MapReduce)
Inspect if the single task for the map stage was scheduled on the node with the data
The result are shown in the webui as follow:
Result1:
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
MapReduce wordcount
Times Data location Scheduled node Hit Time
1 Node3/Node4 Node6 No 20s
2 Node5/Node3 Node6 No 17s
3 Node3/Node5 Node1 No 21s
4 Node2/Node3 Node6 No 18s
5 Node1/Node2 Node1 Yes 15s
6 Node4/Node5 Node3 No 19s
7 Node2/Node3 Node2 Yes 14s
8 Node1/Node4 Node5 No 16s
9 Node1/Node6 Node6 yes 15s
10 Node3/Node5 Node4 no 17s
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
Spark wordcount
Times Data location Scheduled node Hit Time
1 Node3/Node4 Node4 Yes 24s
2 Node2/Node3 Node5 No 30s
3 Node3/Node5 Node4 No 35s
4 Node2/Node3 Node2 Yes 24s
5 Node1/Node2 Node4 No 26s
6 Node4/Node5 Node2 No 25s
7 Node2/Node3 Node4 No 27s
8 Node1/Node4 Node1 Yes 22s
9 Node1/Node6 Node2 No 23s
10 Node1/Node2 Node4 No 33s
was (Author: huangkx6810):
The experiment details as follow:
Yarn Experiments for Data Locality
I run the experiments with a 7 node cluster with 2x replication(1 master, 6 data nodes/node managers)
Hardware configuration
CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread
Memory: 128GB Memory (16x8GB) 1600MHz
Disk: 600GBx2 3.5-inch with RAID-1
Network bandwidth: 968Mb/s
Software configuration
Spark-1.6.2 Hadoop-2.7.1
Experiment1:
7 node Hadoop cluster (1 master, 6 data nodes/node managers)
Simple37 Simple27 Simple28 Simple30 Simple31 Simple32 Simple33
Master Node1 Node2 Node3 Node4 Node5 node6
Configure HDFS with replication factor 2
File has a single block in HDFS
Configure Spark to use dynamic allocation
Configure Yarn for both mapreduce shuffle service and Spark shuffle service
Add a single small file (few bytes) to HDFS
Run wordcount on the file (using Spark/MapReduce)
Inspect if the single task for the map stage was scheduled on the node with the data
The result are shown in the webui as follow:
Result1:
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
MapReduce wordcount
Times Data location Scheduled node Hit Time
1 Node3/Node4 Node6 No 20s
2 Node5/Node3 Node6 No 17s
3 Node3/Node5 Node1 No 21s
4 Node2/Node3 Node6 No 18s
5 Node1/Node2 Node1 Yes 15s
6 Node4/Node5 Node3 No 19s
7 Node2/Node3 Node2 Yes 14s
8 Node1/Node4 Node5 No 16s
9 Node1/Node6 Node6 yes 15s
10 Node3/Node5 Node4 no 17s
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
Spark wordcount
Times Data location Scheduled node Hit Time
1 Node3/Node4 Node4 Yes 24s
2 Node2/Node3 Node5 No 30s
3 Node3/Node5 Node4 No 35s
4 Node2/Node3 Node2 Yes 24s
5 Node1/Node2 Node4 No 26s
6 Node4/Node5 Node2 No 25s
7 Node2/Node3 Node4 No 27s
8 Node1/Node4 Node1 Yes 22s
9 Node1/Node6 Node2 No 23s
10 Node1/Node2 Node4 No 33s
> yarn got little data locality
> -----------------------------
>
> Key: YARN-6289
> URL: https://issues.apache.org/jira/browse/YARN-6289
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: capacity scheduler
> Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2 Hadoop-2.7.1
> Reporter: Huangkaixuan
> Priority: Minor
>
> When I ran this experiment with both Spark and MapReduce wordcount on the file, I noticed that the job did not get data locality every time. It was seemingly random in the placement of the tasks, even though there is no other job running on the cluster. I expected the task placement to always be on the single machine which is holding the data block, but that did not happen.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org