You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Narlin M <hp...@gmail.com> on 2013/08/14 21:59:27 UTC

Crunch DoFn vs Mapper/reducer

I have just recently started using Crunch, having been recommended to use
it instead of writing plain map reduce jobs. As I was going through the
crunch documentation, some questions came to my mind. Am I correct in
saying that the DoFn family of functions will internally spawn map-reduce
jobs, so there is no need to write separate mapper or reducer classes? If
so, I agree that this will abstract some of the lower level details from
the programmer, but at the same time, does it not lower the programmer's
control over the processing logic?

Also, will there be situations when separate mapper / reducer classes will
be required in addition to the DoFn functions?

Thanks.

Re: Crunch DoFn vs Mapper/reducer

Posted by Narlin M <hp...@gmail.com>.

Thanks for the reply, Josh. I understand its function a bit better now.


On Wed, Aug 14, 2013 at 5:50 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Narlin,
>
> DoFns are similar to the Mapper and Reducer classes that you would write
> in classic MapReduce jobs-- they don't spawn MapReduce jobs themselves. The
> Crunch planner will analyze the overall DAG of DoFns, groupByKeys, unions,
> and combineValues operations and compile the DAG into one or more MapReduce
> jobs, where each of the DoFns will be assigned to one of the Mappers or
> Reducers in those jobs. Crunch has its own Mapper and Reducer
> implementations (named CrunchMapper and CrunchReducer, naturally) that are
> responsible for executing the DoFns that are assigned to each phase of the
> job.
>
> In general, you should not need to use mapper and reducer classes when you
> use Crunch, although if you have legacy Mapper and Reducer classes that you
> would like to use in conjunction with the DoFns in a Crunch pipeline, there
> is a collection of methods in org.apache.crunch.lib.MapReduce in Crunch
> 0.7.0 that will wrap a given Mapper or Reducer class inside of a DoFn.
>
> Hope that helps.
>
> Best,
> Josh
>
>
>
> On Wed, Aug 14, 2013 at 12:59 PM, Narlin M <hp...@gmail.com> wrote:
>
>> I have just recently started using Crunch, having been recommended to use
>> it instead of writing plain map reduce jobs. As I was going through the
>> crunch documentation, some questions came to my mind. Am I correct in
>> saying that the DoFn family of functions will internally spawn map-reduce
>> jobs, so there is no need to write separate mapper or reducer classes? If
>> so, I agree that this will abstract some of the lower level details from
>> the programmer, but at the same time, does it not lower the programmer's
>> control over the processing logic?
>>
>> Also, will there be situations when separate mapper / reducer classes
>> will be required in addition to the DoFn functions?
>>
>> Thanks.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Crunch DoFn vs Mapper/reducer

Posted by Josh Wills <jw...@cloudera.com>.

Hey Narlin,

DoFns are similar to the Mapper and Reducer classes that you would write in
classic MapReduce jobs-- they don't spawn MapReduce jobs themselves. The
Crunch planner will analyze the overall DAG of DoFns, groupByKeys, unions,
and combineValues operations and compile the DAG into one or more MapReduce
jobs, where each of the DoFns will be assigned to one of the Mappers or
Reducers in those jobs. Crunch has its own Mapper and Reducer
implementations (named CrunchMapper and CrunchReducer, naturally) that are
responsible for executing the DoFns that are assigned to each phase of the
job.

In general, you should not need to use mapper and reducer classes when you
use Crunch, although if you have legacy Mapper and Reducer classes that you
would like to use in conjunction with the DoFns in a Crunch pipeline, there
is a collection of methods in org.apache.crunch.lib.MapReduce in Crunch
0.7.0 that will wrap a given Mapper or Reducer class inside of a DoFn.

Hope that helps.

Best,
Josh

On Wed, Aug 14, 2013 at 12:59 PM, Narlin M <hp...@gmail.com> wrote:

> I have just recently started using Crunch, having been recommended to use
> it instead of writing plain map reduce jobs. As I was going through the
> crunch documentation, some questions came to my mind. Am I correct in
> saying that the DoFn family of functions will internally spawn map-reduce
> jobs, so there is no need to write separate mapper or reducer classes? If
> so, I agree that this will abstract some of the lower level details from
> the programmer, but at the same time, does it not lower the programmer's
> control over the processing logic?
>
> Also, will there be situations when separate mapper / reducer classes will
> be required in addition to the DoFn functions?
>
> Thanks.
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>