You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2009/02/06 15:35:52 UTC

any doc on the map-reduce aspect of on operators

I am looking for a documentation which clearly specifies how each of the
operators use the mapreduce paradigm. For example foreach..generate may not
use any reducer at all ( I am assming ). cogoup,group could first split the
input data using some partitioning function, where key being the fields
passed by teh user and then reducer being used to generate the final tuple.

These are all intuitive but I am looking for a document which clearly
specifies how each operator does map/reduce with which
partitioning-function, combiner etc.

-Thanks,
Prasen

Re: any doc on the map-reduce aspect of on operators

Posted by Chris Olston <ol...@yahoo-inc.com>.
Our SIGMOD'08 paper gives the general idea, but doesn't give an exhaustive
list.

It's pretty much what you'd expect: each operator that requires a global
re-organization of the data (group, cogroup, join, sort) becomes a
map-reduce boundary. Other, non-blocking, operators (foreach, filter, etc.)
are formed into pipelines in the map function or the reduce function.

So for example:

load
foreach
filter
group
filter2
foreach2
store

becomes:

map(load, foreach, filter, partition by group key)
reduce(form groups, filter2, foreach2, store)

Another example:

load
foreach
group
foreach2
group2
store

becomes:

map(load, foreach, partition by group key)
reduce(form groups, foreach2)
map(partition by group key 2)
reduce(form groups 2, store)


Some other notes:

* sort kicks off an initial map-reduce job to sample the data distribution
to determine a balanced range partitioning to use
* with cogroup/join, the incoming map function will have N independent
pipelines, one for each of the tables being cogrouped/joined
* if there's a choice of putting a pipelinable operator (foreach, filter,
...) in reduce1 or map2, I believe pig always puts it in reduce1 -- although
there are certainly cases where it's better to put it in map2 (would depend
on operator selectivity, among other things)


-Chris






On 2/6/09 6:35 AM, "prasenjit mukherjee" <pr...@gmail.com> wrote:

> I am looking for a documentation which clearly specifies how each of the
> operators use the mapreduce paradigm. For example foreach..generate may not
> use any reducer at all ( I am assming ). cogoup,group could first split the
> input data using some partitioning function, where key being the fields
> passed by teh user and then reducer being used to generate the final tuple.
> 
> These are all intuitive but I am looking for a document which clearly
> specifies how each operator does map/reduce with which
> partitioning-function, combiner etc.
> 
> -Thanks,
> Prasen

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research