You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/04/05 22:30:17 UTC

Difference between combiner and aggregator

Hi,
 I am trying to understand the difference between combiner and aggregator.

Based on my readings:
Wordcount example (mapper)

aggregator
class Mapper
  method MAP
  H <-- Associative array
  for all term t in document:
      H{t} = H{t} + 1
  for all term t ele H do
      EMIT(term t, count H{t})


combiner:
class Mapper
 method INITIALIZE
  H <-- Associative array
  method MAP
  for all term t in document:
      H{t} = H{t} + 1
 method CLOSE
  for all term t ele H do
      EMIT(term t, count H{t})

So, second method is how combiner is implemented.
But 1 seems much simpler?
What are the gains I get using combiner instead of local aggregations?

Re: Difference between combiner and aggregator

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear jamal sasha,

The usual example goes like this:

class Mapper
  method MAP (Line l)
     document <- split l in Terms t
     for all Terms t in document
        EMIT(Term t, one)


class Combiner
  method REDUCE(Term t, List of Counts lc)
     cnt <- sum lc
     EMIT(Term t, Count cnt)

class Reducer
   method REDUCE(Term t, List of Counts lc)
      cnt <- sum lc
      EMIT(Term t, Count cnt)


The combiner is run node local on mapper output (before the shuffle). It's
output is used as input to the reducers (after the shuffle). A combiner is
an I/O optimization. There are no guarantees by the framework, if a
combiner will be called at all, one or more times on the output.

Best regards,

Jens

Re: Difference between combiner and aggregator

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear jamal sasha,

The usual example goes like this:

class Mapper
  method MAP (Line l)
     document <- split l in Terms t
     for all Terms t in document
        EMIT(Term t, one)


class Combiner
  method REDUCE(Term t, List of Counts lc)
     cnt <- sum lc
     EMIT(Term t, Count cnt)

class Reducer
   method REDUCE(Term t, List of Counts lc)
      cnt <- sum lc
      EMIT(Term t, Count cnt)


The combiner is run node local on mapper output (before the shuffle). It's
output is used as input to the reducers (after the shuffle). A combiner is
an I/O optimization. There are no guarantees by the framework, if a
combiner will be called at all, one or more times on the output.

Best regards,

Jens

Re: Difference between combiner and aggregator

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear jamal sasha,

The usual example goes like this:

class Mapper
  method MAP (Line l)
     document <- split l in Terms t
     for all Terms t in document
        EMIT(Term t, one)


class Combiner
  method REDUCE(Term t, List of Counts lc)
     cnt <- sum lc
     EMIT(Term t, Count cnt)

class Reducer
   method REDUCE(Term t, List of Counts lc)
      cnt <- sum lc
      EMIT(Term t, Count cnt)


The combiner is run node local on mapper output (before the shuffle). It's
output is used as input to the reducers (after the shuffle). A combiner is
an I/O optimization. There are no guarantees by the framework, if a
combiner will be called at all, one or more times on the output.

Best regards,

Jens

Re: Difference between combiner and aggregator

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear jamal sasha,

The usual example goes like this:

class Mapper
  method MAP (Line l)
     document <- split l in Terms t
     for all Terms t in document
        EMIT(Term t, one)


class Combiner
  method REDUCE(Term t, List of Counts lc)
     cnt <- sum lc
     EMIT(Term t, Count cnt)

class Reducer
   method REDUCE(Term t, List of Counts lc)
      cnt <- sum lc
      EMIT(Term t, Count cnt)


The combiner is run node local on mapper output (before the shuffle). It's
output is used as input to the reducers (after the shuffle). A combiner is
an I/O optimization. There are no guarantees by the framework, if a
combiner will be called at all, one or more times on the output.

Best regards,

Jens