You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by "Eric Payne (JIRA)" <ji...@apache.org> on 2016/01/11 18:18:40 UTC

[jira] [Resolved] (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Payne resolved MAPREDUCE-2011.
-----------------------------------
    Resolution: Won't Fix

[~knoguchi], here are [~jlowe]'s comments from an offline discussion:
I think the distributed cache already behaves the way you desire, at least in YARN. When a resource request arrives at the nodemanager, it tries to lookup the local resource info based on that request. If it finds it (i.e.: a hit in the cache) then it just increments the refcount of the resource – I don't see any attempt to stat HDFS to verify it's still there in HDFS. The only time I see the timestamp of the request compared with HDFS is when it tries to download the resource from HDFS.

> Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2011
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distributed-cache
>            Reporter: Koji Noguchi
>
> On our cluster, we had jobs with 20 dist cache and very short-lived tasks resulting in 500 map tasks launched per second resulting in  10,000 getFileStatus calls to the namenode.  Namenode can handle this but asking to see if we can reduce this somehow.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)