You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Vinod K V (JIRA)" <ji...@apache.org> on 2009/09/17 06:21:57 UTC
[jira] Commented: (MAPREDUCE-989) Allow segregation of
DistributedCache for maps and reduces
[ https://issues.apache.org/jira/browse/MAPREDUCE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756361#action_12756361 ]
Vinod K V commented on MAPREDUCE-989:
-------------------------------------
Any use-cases? As of now, maps and reduces _can_ manage separate DistributedCache files/archives themselves. Is the suggestion only for separate API in job submission? Or do we want to enfoce the segragation on the TTs too - avoiding maps' cache files to be accessible by reduces and vice versa?
> Allow segregation of DistributedCache for maps and reduces
> ----------------------------------------------------------
>
> Key: MAPREDUCE-989
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-989
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: client
> Reporter: Arun C Murthy
>
> Applications might have differing needs for files in the DistributedCache wrt maps and reduces. We should allow them to specify them separately.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAPREDUCE-989) Allow segregation of
DistributedCache for maps and reduces
Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.
Vinod
How do maps and reduces manage the caches "themselves" now?
If I understand this feature right, user specifies distributed cache
per job, and the tasktracker makes sure that those cache files are
present on the local disks before *any* task in that job executes. By
"any task," I mean setup, map, reduce, or cleanup task. Setup and
cleanup tasks are a particular concern because these are the gating
tasks, which according to amdahl's law, restrict the scalability of
the parallel application.
The usage scenario is this. A star-join, where the map task filters,
projects, or and transforms, and a reduce task does the join of a
large (1 TB) fact table and several smaller (total 4 GB) dimension
tables. The dimention tables evolve slowly, say, once a month they are
updated. Whereas, the fact tables are updated hourly.
So, fact tables are partitioned before the reduce stage, and the
reducers need dimension tables locally. Therefore, these dimension
tables are fetched by reducers to do the join. If individual reducers
fetch dimention tables explicitly, it would be an extra overhead if
more than one reducers execute on a single tasktracker. If these
tables are specified as job-level cache, then map tasks, which do not
need these tables, get blocked unnecessarily when the tasktrackers
fetch these into cache.
I hope this detailed explanation really explains the use case.