You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Huangkaixuan (JIRA)" <ji...@apache.org> on 2017/03/06 08:04:33 UTC
[jira] [Issue Comment Deleted] (YARN-6289) yarn got little data
locality
[ https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Huangkaixuan updated YARN-6289:
-------------------------------
Comment: was deleted
(was: The experiment details:
7 node cluster (1 master, 6 data nodes/node managers)
HostName Simple37 Simple27 Simple28 Simple30 Simple31 Simple32 Simple33
Role Master Master Node1 Node2 Node3 Node4 Node5 node6
Configure HDFS with replication factor 2
File has a single block in HDFS
Configure Spark to use dynamic allocation
Configure Yarn for both mapreduce shuffle service and Spark shuffle service
Add a single small file (few bytes) to HDFS
Run wordcount on the file (using Spark/MapReduce)
Inspect if the single task for the map stage was scheduled on the node with the data
Results of experiment one (run 10 times):
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file, MapReduce wordcount
Round NO. Data location Scheduled node Hit Time Cost
1 Node3/Node4 Node6 No 20s
2 Node5/Node3 Node6 No 17s
3 Node3/Node5 Node1 No 21s
4 Node2/Node3 Node6 No 18s
5 Node1/Node2 Node1 Yes 15s
6 Node4/Node5 Node3 No 19s
7 Node2/Node3 Node2 Yes 14s
8 Node1/Node4 Node5 No 16s
9 Node1/Node6 Node6 Yes 15s
10 Node3/Node5 Node4 NO 17s
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file, Spark wordcount
Round NO. Data location Scheduled node Hit Time cost
1 Node3/Node4 Node4 Yes 24s
2 Node2/Node3 Node5 No 30s
3 Node3/Node5 Node4 No 35s
4 Node2/Node3 Node2 Yes 24s
5 Node1/Node2 Node4 No 26s
6 Node4/Node5 Node2 No 25s
7 Node2/Node3 Node4 No 27s
8 Node1/Node4 Node1 Yes 22s
9 Node1/Node6 Node2 No 23s
10 Node1/Node2 Node4 No 33s
)
> yarn got little data locality
> -----------------------------
>
> Key: YARN-6289
> URL: https://issues.apache.org/jira/browse/YARN-6289
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: capacity scheduler
> Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2 Hadoop-2.7.1
> Reporter: Huangkaixuan
> Priority: Minor
>
> When I ran this experiment with both Spark and MapReduce wordcount on the file, I noticed that the job did not get data locality every time. It was seemingly random in the placement of the tasks, even though there is no other job running on the cluster. I expected the task placement to always be on the single machine which is holding the data block, but that did not happen.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org