You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Huangkaixuan (JIRA)" <ji...@apache.org> on 2017/03/06 07:54:32 UTC

[jira] [Comment Edited] (YARN-6289) yarn got little data locality

    [ https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896854#comment-15896854 ] 

Huangkaixuan edited comment on YARN-6289 at 3/6/17 7:54 AM:
------------------------------------------------------------

Experiment1:
	7 node Hadoop cluster (1 master, 6 data nodes/node managers)
Simple37	Simple27	Simple28	Simple30	Simple31	Simple32	Simple33
Master	Node1	Node2	Node3	Node4	Node5	node6
	Configure HDFS with replication factor 2
	File has a single block in HDFS
	Configure Spark to use dynamic allocation
	Configure Yarn for both mapreduce shuffle service and Spark shuffle service
	Add a single small file (few bytes) to HDFS
	Run wordcount on the file (using Spark/MapReduce)
	Inspect if the single task for the map stage was scheduled on the node with the data
  
The result are shown in the webui as follow:

  
 

 

Result1:
 7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
MapReduce wordcount

Times	Data location	Scheduled node	Hit	Time
1	Node3/Node4	Node6	No	20s
2	Node5/Node3	Node6	No	17s
3	Node3/Node5	Node1	No	21s
4	Node2/Node3	Node6	No	18s
5	Node1/Node2	Node1	Yes	15s
6	Node4/Node5	Node3	No	19s
7	Node2/Node3	Node2	Yes	14s
8	Node1/Node4	Node5	No	16s
9	Node1/Node6	Node6	yes	15s
10	Node3/Node5	Node4	no	17s


























7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
Spark wordcount

Times	Data location	Scheduled node	Hit	Time
1	Node3/Node4	Node4	Yes	24s
2	Node2/Node3	Node5	No	30s
3	Node3/Node5	Node4	No	35s
4	Node2/Node3	Node2	Yes	24s
5	Node1/Node2	Node4	No	26s
6	Node4/Node5	Node2	No	25s
7	Node2/Node3	Node4	No	27s
8	Node1/Node4	Node1	Yes	22s
9	Node1/Node6	Node2	No	23s
10	Node1/Node2	Node4	No	33s





























was (Author: huangkx6810):
The experiment details as follow:
Yarn Experiments for Data Locality
I run the experiments with a 7 node cluster with 2x replication(1 master, 6 data nodes/node managers) 
Hardware configuration
CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
Memory: 128GB Memory (16x8GB) 1600MHz
Disk: 600GBx2 3.5-inch with RAID-1
Network bandwidth: 968Mb/s
Software configuration
Spark-1.6.2   Hadoop-2.7.1 
Experiment1:
	7 node Hadoop cluster (1 master, 6 data nodes/node managers)
Simple37	Simple27	Simple28	Simple30	Simple31	Simple32	Simple33
Master	Node1	Node2	Node3	Node4	Node5	node6
	Configure HDFS with replication factor 2
	File has a single block in HDFS
	Configure Spark to use dynamic allocation
	Configure Yarn for both mapreduce shuffle service and Spark shuffle service
	Add a single small file (few bytes) to HDFS
	Run wordcount on the file (using Spark/MapReduce)
	Inspect if the single task for the map stage was scheduled on the node with the data
  
The result are shown in the webui as follow:

  
 

 

Result1:
 7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
MapReduce wordcount

Times	Data location	Scheduled node	Hit	Time
1	Node3/Node4	Node6	No	20s
2	Node5/Node3	Node6	No	17s
3	Node3/Node5	Node1	No	21s
4	Node2/Node3	Node6	No	18s
5	Node1/Node2	Node1	Yes	15s
6	Node4/Node5	Node3	No	19s
7	Node2/Node3	Node2	Yes	14s
8	Node1/Node4	Node5	No	16s
9	Node1/Node6	Node6	yes	15s
10	Node3/Node5	Node4	no	17s


























7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block file
Spark wordcount

Times	Data location	Scheduled node	Hit	Time
1	Node3/Node4	Node4	Yes	24s
2	Node2/Node3	Node5	No	30s
3	Node3/Node5	Node4	No	35s
4	Node2/Node3	Node2	Yes	24s
5	Node1/Node2	Node4	No	26s
6	Node4/Node5	Node2	No	25s
7	Node2/Node3	Node4	No	27s
8	Node1/Node4	Node1	Yes	22s
9	Node1/Node6	Node2	No	23s
10	Node1/Node2	Node4	No	33s




























> yarn got little data locality
> -----------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2	Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>            Priority: Minor
>
> When I ran this experiment with both Spark and MapReduce wordcount on the file, I noticed that the job did not get data locality every time. It was seemingly random in the placement of the tasks, even though there is no other job running on the cluster. I expected the task placement to always be on the single machine which is holding the data block, but that did not happen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org