You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Huangkaixuan (JIRA)" <ji...@apache.org> on 2017/03/06 08:04:33 UTC

[jira] [Issue Comment Deleted] (YARN-6289) yarn got little data locality

     [ https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Huangkaixuan updated YARN-6289:
-------------------------------
    Comment: was deleted

(was: The experiment details:
7 node cluster (1 master, 6 data nodes/node managers)
HostName	Simple37	Simple27	Simple28	Simple30	Simple31	Simple32	Simple33
Role	Master	Master     Node1	Node2	Node3	Node4	Node5	node6

Configure HDFS with replication factor 2
File has a single block in HDFS
Configure Spark to use dynamic allocation
Configure Yarn for both mapreduce shuffle service and Spark shuffle service
Add a single small file (few bytes) to HDFS
Run wordcount on the file (using Spark/MapReduce)
Inspect if the single task for the map stage was scheduled on the node with the data

Results of experiment one (run 10 times):
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file, MapReduce wordcount

Round NO.  Data location	 Scheduled node	Hit	Time Cost
1	            Node3/Node4	 Node6	                No	20s
2	            Node5/Node3	 Node6	                No	17s
3	            Node3/Node5	 Node1	                No	21s
4	            Node2/Node3	 Node6	                No	18s
5	            Node1/Node2	 Node1	                Yes	15s
6	            Node4/Node5	 Node3	                No	19s
7	            Node2/Node3	 Node2	                Yes	14s
8	            Node1/Node4	 Node5	                No	16s
9	            Node1/Node6	 Node6	                Yes	15s
10	            Node3/Node5	 Node4	                NO	17s


7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file, Spark wordcount

Round NO.    Data location	Scheduled node	Hit	Time cost
1	             Node3/Node4	Node4	                Yes	24s
2	             Node2/Node3	Node5	                No	30s
3	             Node3/Node5	Node4	                No	35s
4	             Node2/Node3	Node2	                Yes	24s
5	             Node1/Node2	Node4	                No	26s
6	             Node4/Node5	Node2	                No	25s
7	             Node2/Node3	Node4	                No	27s
8	             Node1/Node4	Node1	                Yes	22s
9	             Node1/Node6	Node2	                No	23s
10	             Node1/Node2	Node4	                No	33s



)

> yarn got little data locality
> -----------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2	Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>            Priority: Minor
>
> When I ran this experiment with both Spark and MapReduce wordcount on the file, I noticed that the job did not get data locality every time. It was seemingly random in the placement of the tasks, even though there is no other job running on the cluster. I expected the task placement to always be on the single machine which is holding the data block, but that did not happen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org