You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Huangkaixuan (JIRA)" <ji...@apache.org> on 2017/03/10 08:16:04 UTC

[jira] [Issue Comment Deleted] (YARN-6289) Fail to achieve data locality when runing MapReduce and Spark on HDFS

     [ https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Huangkaixuan updated YARN-6289:
-------------------------------
    Comment: was deleted

(was: Thanks [~leftnoteasy]
1、MR can get the locations of a block through FileSystem.getFileBlockLocations. Usually MR applications use FileSystem.getFileBlockLocations to compute splits, but I haven't seen it in the default Yarn scheduling policy (FIFO)
2、All nodes in the experiment are in the same rack, and all tasks are rack-local. RackAwareness will not affect the experimental results
3、the task failed to achieve data locality, even though there is no other job running on the cluster at the same time. it seems that Yarn didn’t attempt to allocate containers with data locality in the default scheduling mode
)

> Fail to achieve data locality when runing MapReduce and Spark on HDFS
> ---------------------------------------------------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: distributed-scheduling
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2	Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>         Attachments: Hadoop_Spark_Conf.zip, YARN-DataLocality.docx
>
>
> When running a simple wordcount experiment on YARN, I noticed that the task failed to achieve data locality, even though there is no other job running on the cluster at the same time. The experiment was done in a 7-node (1 master, 6 data nodes/node managers) cluster and the input of the wordcount job (both Spark and MapReduce) is a single-block file in HDFS which is two-way replicated (replication factor = 2). I ran wordcount on YARN for 10 times. The results show that only 30% of tasks can achieve data locality, which seems like the result of a random placement of tasks. The experiment details are in the attachment, and feel free to reproduce the experiments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org