You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Alves <dr...@criticalsoftware.com> on 2008/02/07 18:35:01 UTC

Skip Reduce Phase

Hi All
	First of all since this is my first post I must say congrats for the
great piece of software (both Hadoop and HBase).
	I've been using Hadoop&HBase for a while and I have a question, let me
just explain a little my setup:

I have an HBase Database that holds information that I want to process
in a Map/Reduce job but that before needs to be a little processed.

So I built another Map/Reduce Job that uses a Specific (Filtered)
TableInputFormat and then pre processes the information in a Map phase.

As I don't need none of the intermediate phases (like merge sort) and I
don't need to do anything on the reduce phase I was wondering If I could
just save the Map phase output and start the second Map/Reduce job using
that as an input (but still saving the splits to DFS for
backtracking/reliability reasons).

Is this possible?

Thanks in advance, and again great piece of software.
David Alves




Re: Skip Reduce Phase

Posted by Jason Venner <ja...@attributor.com>.
We set the reduce count to 0
        conf.setNumReduceTasks(0); // no reduce

David Alves wrote:
> Hi All
> 	First of all since this is my first post I must say congrats for the
> great piece of software (both Hadoop and HBase).
> 	I've been using Hadoop&HBase for a while and I have a question, let me
> just explain a little my setup:
>
> I have an HBase Database that holds information that I want to process
> in a Map/Reduce job but that before needs to be a little processed.
>
> So I built another Map/Reduce Job that uses a Specific (Filtered)
> TableInputFormat and then pre processes the information in a Map phase.
>
> As I don't need none of the intermediate phases (like merge sort) and I
> don't need to do anything on the reduce phase I was wondering If I could
> just save the Map phase output and start the second Map/Reduce job using
> that as an input (but still saving the splits to DFS for
> backtracking/reliability reasons).
>
> Is this possible?
>
> Thanks in advance, and again great piece of software.
> David Alves
>
>
>
>   
-- 
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Re: Skip Reduce Phase

Posted by Vadim Zaliva <kr...@gmail.com>.
I am sorry, it was my fault.  I have not updated JAR.
Now it seems to be working as expected. Thanks!

Vadim

On Tue, Feb 24, 2009 at 21:23, Jothi Padmanabhan <jo...@yahoo-inc.com> wrote:
> If you had set the number of reduce tasks to 0, you should not see the
> reduce>sort. How did you set the number of reducers?
> You could do that by doing
>
> job.setNumReduceTasks(0);
>
> Jothi
>
>
> On 2/25/09 10:34 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:
>
>> On Thu, Feb 7, 2008 at 10:07, Owen O'Malley <oo...@yahoo-inc.com> wrote:
>>
>>> Setting it to 0 skips all of the buffering, sorting, merging, and shuffling.
>>> It passes the objects straight from the mapper to the output format, which
>>> writes it straight to hdfs.
>>
>> I just tried to set number or Reduce tasks to 0, but Job Tracker shows
>> Reduce task working, doing "reduce > sort". I have a big data set and
>> it takes a while. It would be a good to find a way to skip it.
>>
>> Vadim
>
>

Re: Skip Reduce Phase

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.
Sorry, this mail was intended for somebody else. Please disregard.




On 2/25/09 2:33 PM, "Jothi Padmanabhan" <jo...@yahoo-inc.com> wrote:

> Just to clarify -- setting test.build.data on the command line to point to
> some arbitrary directory in /tmp should work
> 
> ant -Dtestcase=TestMapReduceLocal -Dtest.output=yes
> -Dtest.build.data=/tmp/foo test-core
> 
> Jothi
> 
> 
> On 2/25/09 10:53 AM, "Jothi Padmanabhan" <jo...@yahoo-inc.com> wrote:
> 
>> If you had set the number of reduce tasks to 0, you should not see the
>> reduce>sort. How did you set the number of reducers?
>> You could do that by doing
>> 
>> job.setNumReduceTasks(0);
>> 
>> Jothi
>> 
>> 
>> On 2/25/09 10:34 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:
>> 
>>> On Thu, Feb 7, 2008 at 10:07, Owen O'Malley <oo...@yahoo-inc.com> wrote:
>>> 
>>>> Setting it to 0 skips all of the buffering, sorting, merging, and
>>>> shuffling.
>>>> It passes the objects straight from the mapper to the output format, which
>>>> writes it straight to hdfs.
>>> 
>>> I just tried to set number or Reduce tasks to 0, but Job Tracker shows
>>> Reduce task working, doing "reduce > sort". I have a big data set and
>>> it takes a while. It would be a good to find a way to skip it.
>>> 
>>> Vadim
> 


Re: Skip Reduce Phase

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.
Just to clarify -- setting test.build.data on the command line to point to
some arbitrary directory in /tmp should work

ant -Dtestcase=TestMapReduceLocal -Dtest.output=yes
-Dtest.build.data=/tmp/foo test-core

Jothi


On 2/25/09 10:53 AM, "Jothi Padmanabhan" <jo...@yahoo-inc.com> wrote:

> If you had set the number of reduce tasks to 0, you should not see the
> reduce>sort. How did you set the number of reducers?
> You could do that by doing
> 
> job.setNumReduceTasks(0);
> 
> Jothi
> 
> 
> On 2/25/09 10:34 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:
> 
>> On Thu, Feb 7, 2008 at 10:07, Owen O'Malley <oo...@yahoo-inc.com> wrote:
>> 
>>> Setting it to 0 skips all of the buffering, sorting, merging, and shuffling.
>>> It passes the objects straight from the mapper to the output format, which
>>> writes it straight to hdfs.
>> 
>> I just tried to set number or Reduce tasks to 0, but Job Tracker shows
>> Reduce task working, doing "reduce > sort". I have a big data set and
>> it takes a while. It would be a good to find a way to skip it.
>> 
>> Vadim


Re: Skip Reduce Phase

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.
If you had set the number of reduce tasks to 0, you should not see the
reduce>sort. How did you set the number of reducers?
You could do that by doing

job.setNumReduceTasks(0);

Jothi


On 2/25/09 10:34 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:

> On Thu, Feb 7, 2008 at 10:07, Owen O'Malley <oo...@yahoo-inc.com> wrote:
> 
>> Setting it to 0 skips all of the buffering, sorting, merging, and shuffling.
>> It passes the objects straight from the mapper to the output format, which
>> writes it straight to hdfs.
> 
> I just tried to set number or Reduce tasks to 0, but Job Tracker shows
> Reduce task working, doing "reduce > sort". I have a big data set and
> it takes a while. It would be a good to find a way to skip it.
> 
> Vadim


Re: Skip Reduce Phase

Posted by Vadim Zaliva <kr...@gmail.com>.
On Thu, Feb 7, 2008 at 10:07, Owen O'Malley <oo...@yahoo-inc.com> wrote:

> Setting it to 0 skips all of the buffering, sorting, merging, and shuffling.
> It passes the objects straight from the mapper to the output format, which
> writes it straight to hdfs.

I just tried to set number or Reduce tasks to 0, but Job Tracker shows
Reduce task working, doing "reduce > sort". I have a big data set and
it takes a while. It would be a good to find a way to skip it.

Vadim

Re: Skip Reduce Phase

Posted by David Alves <dr...@criticalsoftware.com>.
Great!

Thanks Owen, Ted and Jason

On Thu, 2008-02-07 at 10:07 -0800, Owen O'Malley wrote:
> On Feb 7, 2008, at 9:59 AM, Ted Dunning wrote:
> 
> >
> > I think that setting the parameter to 0 skips most of the overhead  
> > of the
> > later stages.
> 
> Setting it to 0 skips all of the buffering, sorting, merging, and  
> shuffling. It passes the objects straight from the mapper to the  
> output format, which writes it straight to hdfs.
> 
> -- Owen


Re: Skip Reduce Phase

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 7, 2008, at 9:59 AM, Ted Dunning wrote:

>
> I think that setting the parameter to 0 skips most of the overhead  
> of the
> later stages.

Setting it to 0 skips all of the buffering, sorting, merging, and  
shuffling. It passes the objects straight from the mapper to the  
output format, which writes it straight to hdfs.

-- Owen

Re: Skip Reduce Phase

Posted by Ted Dunning <td...@veoh.com>.
I think that setting the parameter to 0 skips most of the overhead of the
later stages.

Also, if you REALLY want to lower overhead, you can write a "meta-mapper"
class that hooks together a list of mappers using a purpose built output
collector.  

That will avoid the disk storage overhead completely.


On 2/7/08 9:54 AM, "David Alves" <dr...@criticalsoftware.com> wrote:

> Hi Ted
> 
> But wouldn't that still go through the intermediate phases and do the
> merge sort and copy to the local filesystem (which is the reduce input)?
> 
> Is there a way to provide the direct map output (saved onto DFS) to
> another map task, or does you suggestion already do this and this is a
> moot point?.
> 
> David
> 
> On Thu, 2008-02-07 at 09:39 -0800, Ted Dunning wrote:
>> Set numReducers to 0.
>> 
>> 
>> On 2/7/08 9:35 AM, "David Alves" <dr...@criticalsoftware.com> wrote:
>> 
>>> Hi All
>>> First of all since this is my first post I must say congrats for the
>>> great piece of software (both Hadoop and HBase).
>>> I've been using Hadoop&HBase for a while and I have a question, let me
>>> just explain a little my setup:
>>> 
>>> I have an HBase Database that holds information that I want to process
>>> in a Map/Reduce job but that before needs to be a little processed.
>>> 
>>> So I built another Map/Reduce Job that uses a Specific (Filtered)
>>> TableInputFormat and then pre processes the information in a Map phase.
>>> 
>>> As I don't need none of the intermediate phases (like merge sort) and I
>>> don't need to do anything on the reduce phase I was wondering If I could
>>> just save the Map phase output and start the second Map/Reduce job using
>>> that as an input (but still saving the splits to DFS for
>>> backtracking/reliability reasons).
>>> 
>>> Is this possible?
>>> 
>>> Thanks in advance, and again great piece of software.
>>> David Alves
>>> 
>>> 
>>> 
>> 
> 


Re: Skip Reduce Phase

Posted by David Alves <dr...@criticalsoftware.com>.
Hi Ted

But wouldn't that still go through the intermediate phases and do the
merge sort and copy to the local filesystem (which is the reduce input)?

Is there a way to provide the direct map output (saved onto DFS) to
another map task, or does you suggestion already do this and this is a
moot point?.

David

On Thu, 2008-02-07 at 09:39 -0800, Ted Dunning wrote:
> Set numReducers to 0.
> 
> 
> On 2/7/08 9:35 AM, "David Alves" <dr...@criticalsoftware.com> wrote:
> 
> > Hi All
> > First of all since this is my first post I must say congrats for the
> > great piece of software (both Hadoop and HBase).
> > I've been using Hadoop&HBase for a while and I have a question, let me
> > just explain a little my setup:
> > 
> > I have an HBase Database that holds information that I want to process
> > in a Map/Reduce job but that before needs to be a little processed.
> > 
> > So I built another Map/Reduce Job that uses a Specific (Filtered)
> > TableInputFormat and then pre processes the information in a Map phase.
> > 
> > As I don't need none of the intermediate phases (like merge sort) and I
> > don't need to do anything on the reduce phase I was wondering If I could
> > just save the Map phase output and start the second Map/Reduce job using
> > that as an input (but still saving the splits to DFS for
> > backtracking/reliability reasons).
> > 
> > Is this possible?
> > 
> > Thanks in advance, and again great piece of software.
> > David Alves
> > 
> > 
> > 
> 


Re: Skip Reduce Phase

Posted by Ted Dunning <td...@veoh.com>.
Set numReducers to 0.


On 2/7/08 9:35 AM, "David Alves" <dr...@criticalsoftware.com> wrote:

> Hi All
> First of all since this is my first post I must say congrats for the
> great piece of software (both Hadoop and HBase).
> I've been using Hadoop&HBase for a while and I have a question, let me
> just explain a little my setup:
> 
> I have an HBase Database that holds information that I want to process
> in a Map/Reduce job but that before needs to be a little processed.
> 
> So I built another Map/Reduce Job that uses a Specific (Filtered)
> TableInputFormat and then pre processes the information in a Map phase.
> 
> As I don't need none of the intermediate phases (like merge sort) and I
> don't need to do anything on the reduce phase I was wondering If I could
> just save the Map phase output and start the second Map/Reduce job using
> that as an input (but still saving the splits to DFS for
> backtracking/reliability reasons).
> 
> Is this possible?
> 
> Thanks in advance, and again great piece of software.
> David Alves
> 
> 
>