You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Vinod K V (JIRA)" <ji...@apache.org> on 2009/09/17 06:21:57 UTC

[jira] Commented: (MAPREDUCE-989) Allow segregation of DistributedCache for maps and reduces

    [ https://issues.apache.org/jira/browse/MAPREDUCE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756361#action_12756361 ] 

Vinod K V commented on MAPREDUCE-989:
-------------------------------------

Any use-cases? As of now, maps and reduces _can_ manage separate DistributedCache files/archives themselves. Is the suggestion only for separate API in job submission? Or do we want to enfoce the segragation on the TTs too - avoiding maps' cache files to be accessible by reduces and vice versa?

> Allow segregation of DistributedCache for maps and reduces
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-989
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-989
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>            Reporter: Arun C Murthy
>
> Applications might have differing needs for files in the DistributedCache wrt maps and reduces. We should allow them to specify them separately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (MAPREDUCE-989) Allow segregation of DistributedCache for maps and reduces

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.
Vinod

How do maps and reduces manage the caches "themselves" now?

If I understand this feature right, user specifies distributed cache  
per job, and the tasktracker makes sure that those cache files are  
present on the local disks before *any* task in that job executes. By  
"any task," I mean setup, map, reduce, or cleanup task. Setup and  
cleanup tasks are a particular concern because these are the gating  
tasks, which according to amdahl's law, restrict the scalability of  
the parallel application.

The usage scenario is this. A star-join, where the map task filters,  
projects, or and transforms, and a reduce task does the join of a  
large (1 TB) fact table and several smaller (total 4 GB) dimension  
tables. The dimention tables evolve slowly, say, once a month they are  
updated. Whereas, the fact tables are updated hourly.

So,  fact tables are partitioned before the reduce stage, and the  
reducers need dimension tables locally. Therefore, these dimension  
tables are fetched by reducers to do the join. If individual reducers  
fetch dimention tables explicitly, it would be an extra overhead if  
more than one reducers execute on a single tasktracker. If these  
tables are specified as job-level cache, then map tasks, which do not  
need these tables, get blocked unnecessarily when the tasktrackers  
fetch these into cache.

I hope this detailed explanation really explains the use case.