You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Deepak Diwakar <dd...@gmail.com> on 2010/07/28 21:16:43 UTC

Multiple final reduced outputs

I have setup 2 node clusters and ran many jobs including wordcount.  In all
the output folders i am getting two mutual exclusive output files as
part-00000 and part-00001 instead of single output. A merging should take
place to get into one single output file which is not occurring here .

Could someone point me out where i am going wrong?

Thanks & regards
- Deepak Diwakar,

Re: Multiple final reduced outputs

Posted by VV <ch...@gmail.com>.

Hi Deepak,

AFAIK, the number of output files depends on the number of reduce tasks (i
hope i'm not missing any other factors). So, If a single output file is the
requirement, then setting number of reduce tasks to 1 should work. Another
solution would be to put another job with these output files as input and
merge them.

Hope this helps,
Chaitanya.

On Thu, Jul 29, 2010 at 12:46 AM, Deepak Diwakar <dd...@gmail.com>wrote:

> I have setup 2 node clusters and ran many jobs including wordcount.  In all
> the output folders i am getting two mutual exclusive output files as
> part-00000 and part-00001 instead of single output. A merging should take
> place to get into one single output file which is not occurring here .
>
> Could someone point me out where i am going wrong?
>
> Thanks & regards
> - Deepak Diwakar,
>

Re: Multiple final reduced outputs

Posted by Deepak Diwakar <dd...@gmail.com>.

Yep Harsh. I was doing the same just wondering why not we have option at
master to combine them into a single file. That could be a feature( and if
its there please let me know ). Similar to setting reduce class to job ,we
may set a merger/master-combiner to that class  into code itself.

Also thanks David and chaitanya for putting your pointers.  Actually i was
more of wondering about having an in-build option to marge after collecting
all reduced outputs .

Thanks & regards
- Deepak Diwakar,




On 29 July 2010 01:12, Harsh J <qw...@gmail.com> wrote:

> Concatenating them is the easiest way to get the result back as a
> single file (its grouped/sorted anyway). For files that can't exactly
> be 'cat' together (headers, etc.), you may run your job with an
> explicit number of Reducers (or write special tools for such cases,
> cause else the limited number of reducers may impact the processing
> time).
>
> JobConf.setNumReduceTasks(int n); before submitting the job should do it.
>
> In case you've doubts about what 'merge' really means in the
> map-to-intermediate-to-reduce phases, this guide should explain it
> very well: http://wiki.apache.org/hadoop/HadoopMapReduce
>
> On Thu, Jul 29, 2010 at 12:57 AM, David Pellegrini
> <da...@datawebsystems.com> wrote:
> > Perhaps I'm missing some subtlety, but that's what I would expect.  2
> > reducer nodes -> 2 outputs.  If you need them in one big file, cat them
> > together.
> >
> > my 2 cents
> >
> > David
> >
> > On 07/28/2010 12:16 PM, Deepak Diwakar wrote:
> >>
> >> I have setup 2 node clusters and ran many jobs including wordcount.  In
> >> all
> >> the output folders i am getting two mutual exclusive output files as
> >> part-00000 and part-00001 instead of single output. A merging should
> take
> >> place to get into one single output file which is not occurring here .
> >>
> >> Could someone point me out where i am going wrong?
> >>
> >> Thanks&  regards
> >> - Deepak Diwakar,
> >>
> >>
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: Multiple final reduced outputs

Posted by Harsh J <qw...@gmail.com>.

Concatenating them is the easiest way to get the result back as a
single file (its grouped/sorted anyway). For files that can't exactly
be 'cat' together (headers, etc.), you may run your job with an
explicit number of Reducers (or write special tools for such cases,
cause else the limited number of reducers may impact the processing
time).

JobConf.setNumReduceTasks(int n); before submitting the job should do it.

In case you've doubts about what 'merge' really means in the
map-to-intermediate-to-reduce phases, this guide should explain it
very well: http://wiki.apache.org/hadoop/HadoopMapReduce

On Thu, Jul 29, 2010 at 12:57 AM, David Pellegrini
<da...@datawebsystems.com> wrote:
> Perhaps I'm missing some subtlety, but that's what I would expect.  2
> reducer nodes -> 2 outputs.  If you need them in one big file, cat them
> together.
>
> my 2 cents
>
> David
>
> On 07/28/2010 12:16 PM, Deepak Diwakar wrote:
>>
>> I have setup 2 node clusters and ran many jobs including wordcount.  In
>> all
>> the output folders i am getting two mutual exclusive output files as
>> part-00000 and part-00001 instead of single output. A merging should take
>> place to get into one single output file which is not occurring here .
>>
>> Could someone point me out where i am going wrong?
>>
>> Thanks&  regards
>> - Deepak Diwakar,
>>
>>
>

-- 
Harsh J
www.harshj.com

Re: Multiple final reduced outputs

Posted by David Pellegrini <da...@datawebsystems.com>.

Perhaps I'm missing some subtlety, but that's what I would expect.  2 
reducer nodes -> 2 outputs.  If you need them in one big file, cat them 
together.

my 2 cents

David

On 07/28/2010 12:16 PM, Deepak Diwakar wrote:
> I have setup 2 node clusters and ran many jobs including wordcount.  In all
> the output folders i am getting two mutual exclusive output files as
> part-00000 and part-00001 instead of single output. A merging should take
> place to get into one single output file which is not occurring here .
>
> Could someone point me out where i am going wrong?
>
> Thanks&  regards
> - Deepak Diwakar,
>
>