You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Amogh Vasekar <va...@yahoo-inc.com> on 2008/11/21 06:48:46 UTC

combiner without reducer

Hi,
I believe currently a combiner is not run unless you have atleast one
reducer set. 
Not getting into the Hadoop-18 semantics of combiner running on both
sides ( the number of reducers are anyways 0, so I guess the
merge-combine doesn't come into picture at all) , I have a use case
where I would like to run a combiner without a reducer.
Basically the aggregation ( a lookup sort of thing ) I do is dependent
on a relatively small dataset, and the aggregation is independent of
records in the map input data forming the input dataset, and hence the
motivation for combine-without-reduce. 
What I wanted to do was aggregate the similar records in the combiner (
or particular instance of combiner ) in a single shot, this forming my
output. This would save me from the amount of intermediate I/O involved
in S&S phase at some partial I/O cost on the map + combine side, and I
just wanted to try it out to see if its feasible at all. 
Given combiner w/o reducer is not supported, I was thinking of doing it
in a similar way Hadoop would do : create a buffer, sort, combine as I
flush.
Any thoughts on this would be really helpful.

Thanks,
Amogh

Re: combiner without reducer

Posted by Ian Swett <is...@yahoo.com>.



--- On Thu, 11/20/08, Amogh Vasekar <va...@yahoo-inc.com> wrote:

> From: Amogh Vasekar <va...@yahoo-inc.com>
> Subject: combiner without reducer
> To: core-dev@hadoop.apache.org, core-user@hadoop.apache.org
> Date: Thursday, November 20, 2008, 9:48 PM
> Hi,
> I believe currently a combiner is not run unless you have
> atleast one
> reducer set. 
> Not getting into the Hadoop-18 semantics of combiner
> running on both
> sides ( the number of reducers are anyways 0, so I guess
> the
> merge-combine doesn't come into picture at all) , I
> have a use case
> where I would like to run a combiner without a reducer.
> Basically the aggregation ( a lookup sort of thing ) I do
> is dependent
> on a relatively small dataset, and the aggregation is
> independent of
> records in the map input data forming the input dataset,
> and hence the
> motivation for combine-without-reduce. 
> What I wanted to do was aggregate the similar records in
> the combiner (
> or particular instance of combiner ) in a single shot, this
> forming my
> output. This would save me from the amount of intermediate
> I/O involved
> in S&S phase at some partial I/O cost on the map +
> combine side, and I
> just wanted to try it out to see if its feasible at all. 
> Given combiner w/o reducer is not supported, I was thinking
> of doing it
> in a similar way Hadoop would do : create a buffer, sort,
> combine as I
> flush.
> Any thoughts on this would be really helpful.
> 
> Thanks,
> Amogh