You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Pavel Hančar <pa...@gmail.com> on 2013/01/03 22:11:24 UTC

more reduce tasks

  Hello,
I'd like to use more than one reduce task with Hadoop Streaming and I'd
like to have only one result. Is it possible? Or should I run one more job
to merge the result? And is it the same with non-streaming jobs? Below you
see, I have 5 results for mapred.reduce.tasks=5.

$ hadoop jar
/packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
-D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
-file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
.
.
.
13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
job_201301021717_0038
13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
$ hadoop dfs -cat 1gb.wc/part-*
472173052
165736187
201719914
184376668
163872819
$

where /tmp/wcc contains
#!/bin/bash
wc -c

Thanks for any answer,
 Pavel Hančar

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Hi Bejoy

Thank you for your idea.

The hadoop patch I said means this merge happens during the output writing
process.

Regards!

Chen
On Jan 3, 2013 11:25 PM, <be...@gmail.com> wrote:

> **
> Hi Chen,
>
> You do have an option in hadoop to achieve this if you want the merged
> file in LFS.
>
> 1) Run your job with n number of reducers. And you'll have n files in the
> output dir.
>
> 2) Issue a hadoop fs -getmerge command to get the files in output dir
> merged into a single file in LFS
> (In recent versions use 'hdfs dfs -getmerge')
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Chen He <ai...@gmail.com>
> *Date: *Thu, 3 Jan 2013 22:55:36 -0600
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: more reduce tasks
>
> Sounds like you want more reducer to reduce the execution time but only
> want a single output file.
>
> Is this waht you want?
>
> You can use as many as your want (may not be optimal) reducers when you
> are running your reducer. Once the program is done, write a small perl,
> python, or shell program connect those part-* files.
>
> if you do not want to write your own script to connect those files and let
> Hadoop automatically generate a single file.
>
> It may need some patched to current Hadoop. I am not sure they are ready
> or not.
>
> On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>>  Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>>
>

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Hi Bejoy

Thank you for your idea.

The hadoop patch I said means this merge happens during the output writing
process.

Regards!

Chen
On Jan 3, 2013 11:25 PM, <be...@gmail.com> wrote:

> **
> Hi Chen,
>
> You do have an option in hadoop to achieve this if you want the merged
> file in LFS.
>
> 1) Run your job with n number of reducers. And you'll have n files in the
> output dir.
>
> 2) Issue a hadoop fs -getmerge command to get the files in output dir
> merged into a single file in LFS
> (In recent versions use 'hdfs dfs -getmerge')
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Chen He <ai...@gmail.com>
> *Date: *Thu, 3 Jan 2013 22:55:36 -0600
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: more reduce tasks
>
> Sounds like you want more reducer to reduce the execution time but only
> want a single output file.
>
> Is this waht you want?
>
> You can use as many as your want (may not be optimal) reducers when you
> are running your reducer. Once the program is done, write a small perl,
> python, or shell program connect those part-* files.
>
> if you do not want to write your own script to connect those files and let
> Hadoop automatically generate a single file.
>
> It may need some patched to current Hadoop. I am not sure they are ready
> or not.
>
> On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>>  Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>>
>

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Hi Bejoy

Thank you for your idea.

The hadoop patch I said means this merge happens during the output writing
process.

Regards!

Chen
On Jan 3, 2013 11:25 PM, <be...@gmail.com> wrote:

> **
> Hi Chen,
>
> You do have an option in hadoop to achieve this if you want the merged
> file in LFS.
>
> 1) Run your job with n number of reducers. And you'll have n files in the
> output dir.
>
> 2) Issue a hadoop fs -getmerge command to get the files in output dir
> merged into a single file in LFS
> (In recent versions use 'hdfs dfs -getmerge')
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Chen He <ai...@gmail.com>
> *Date: *Thu, 3 Jan 2013 22:55:36 -0600
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: more reduce tasks
>
> Sounds like you want more reducer to reduce the execution time but only
> want a single output file.
>
> Is this waht you want?
>
> You can use as many as your want (may not be optimal) reducers when you
> are running your reducer. Once the program is done, write a small perl,
> python, or shell program connect those part-* files.
>
> if you do not want to write your own script to connect those files and let
> Hadoop automatically generate a single file.
>
> It may need some patched to current Hadoop. I am not sure they are ready
> or not.
>
> On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>>  Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>>
>

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Hi Bejoy

Thank you for your idea.

The hadoop patch I said means this merge happens during the output writing
process.

Regards!

Chen
On Jan 3, 2013 11:25 PM, <be...@gmail.com> wrote:

> **
> Hi Chen,
>
> You do have an option in hadoop to achieve this if you want the merged
> file in LFS.
>
> 1) Run your job with n number of reducers. And you'll have n files in the
> output dir.
>
> 2) Issue a hadoop fs -getmerge command to get the files in output dir
> merged into a single file in LFS
> (In recent versions use 'hdfs dfs -getmerge')
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Chen He <ai...@gmail.com>
> *Date: *Thu, 3 Jan 2013 22:55:36 -0600
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: more reduce tasks
>
> Sounds like you want more reducer to reduce the execution time but only
> want a single output file.
>
> Is this waht you want?
>
> You can use as many as your want (may not be optimal) reducers when you
> are running your reducer. Once the program is done, write a small perl,
> python, or shell program connect those part-* files.
>
> if you do not want to write your own script to connect those files and let
> Hadoop automatically generate a single file.
>
> It may need some patched to current Hadoop. I am not sure they are ready
> or not.
>
> On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>>  Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>>
>

Re: more reduce tasks

Posted by be...@gmail.com.

Hi Chen,

You do have an option in hadoop to achieve this if you want the merged file in LFS.

1) Run your job with n number of reducers. And you'll have n files in the output dir.

2) Issue a hadoop fs -getmerge command to get the files in output dir merged into a single file in LFS
(In recent versions use 'hdfs dfs -getmerge')

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Chen He <ai...@gmail.com>
Date: Thu, 3 Jan 2013 22:55:36 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: more reduce tasks

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by be...@gmail.com.

Hi Chen,

You do have an option in hadoop to achieve this if you want the merged file in LFS.

1) Run your job with n number of reducers. And you'll have n files in the output dir.

2) Issue a hadoop fs -getmerge command to get the files in output dir merged into a single file in LFS
(In recent versions use 'hdfs dfs -getmerge')

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Chen He <ai...@gmail.com>
Date: Thu, 3 Jan 2013 22:55:36 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: more reduce tasks

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by be...@gmail.com.

Hi Chen,

You do have an option in hadoop to achieve this if you want the merged file in LFS.

1) Run your job with n number of reducers. And you'll have n files in the output dir.

2) Issue a hadoop fs -getmerge command to get the files in output dir merged into a single file in LFS
(In recent versions use 'hdfs dfs -getmerge')

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Chen He <ai...@gmail.com>
Date: Thu, 3 Jan 2013 22:55:36 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: more reduce tasks

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by be...@gmail.com.

Hi Chen,

You do have an option in hadoop to achieve this if you want the merged file in LFS.

1) Run your job with n number of reducers. And you'll have n files in the output dir.

2) Issue a hadoop fs -getmerge command to get the files in output dir merged into a single file in LFS
(In recent versions use 'hdfs dfs -getmerge')

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Chen He <ai...@gmail.com>
Date: Thu, 3 Jan 2013 22:55:36 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: more reduce tasks

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Thank you, and I apologize for my bad undestanding of MapReduce. I
forgot, the data for reduce tasks are grouped by keys and therefore they
need nothing like second round. My "simple example" is stupid, because
there is no key (same as it would be one). Fewer keys than reduce tasks
never work.
  Thank you for your helpfullness,
   Pavel Hančar

2013/1/5 Harsh J <ha...@cloudera.com>

> What do you mean by a "final reduce"? Not all jobs require that the
> final output result be singular, since the reducer phase is provided
> to work on a per-partition basis (also why the files are named
> part-*). One job consists of only one reduce phase, wherein the
> reducers all work independently and complete.
>
> If you need a result assembled together in order of the partitions
> created, rely on the above provided solutions such as a second step of
> fs -getmerge, or a call of the same in a custom FileOutputCommitter,
> etc.
>
> On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com>
> wrote:
> >   Hello,
> > thank you for the answer. Exactly: I want the parallelism but a single
> final
> > output. What do you mean by "another stage"? I thought I should set
> > mapred.reduce.tasks large enough and hadoop will run the reducers in so
> many
> > rounds it will be optimal. But it isn't the case.
> >   When I tried to run the classical WordCount example, and try to set
> this
> > by JobConf.setNumReduceTasks(int n), it seemed to me I had the final
> output
> > (there were no word duplicates for the normal words -- only some for
> strange
> > words). So why the hadoop doesn't run the final reduce in my simple
> > streaming example?
> >   Thank you,
> >   Pavel Hančar
> >
> > 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
> >>
> >>
> >> Is it that you want the parallelism but a single final output? Assuming
> >> your first job's reducers generate a small output, another stage is the
> way
> >> to go. If not, second stage won't help. What exactly are your
> objectives?
> >>
> >> Thanks,
> >> +Vinod
> >>
> >> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
> >>
> >>   Hello,
> >> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> >> like to have only one result. Is it possible? Or should I run one more
> job
> >> to merge the result? And is it the same with non-streaming jobs? Below
> you
> >> see, I have 5 results for mapred.reduce.tasks=5.
> >>
> >> $ hadoop jar
> >>
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> >> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file
> /tmp/wcc
> >> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> >> .
> >> .
> >> .
> >> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> >> job_201301021717_0038
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> >> $ hadoop dfs -cat 1gb.wc/part-*
> >> 472173052
> >> 165736187
> >> 201719914
> >> 184376668
> >> 163872819
> >> $
> >>
> >> where /tmp/wcc contains
> >> #!/bin/bash
> >> wc -c
> >>
> >> Thanks for any answer,
> >>  Pavel Hančar
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Thank you, and I apologize for my bad undestanding of MapReduce. I
forgot, the data for reduce tasks are grouped by keys and therefore they
need nothing like second round. My "simple example" is stupid, because
there is no key (same as it would be one). Fewer keys than reduce tasks
never work.
  Thank you for your helpfullness,
   Pavel Hančar

2013/1/5 Harsh J <ha...@cloudera.com>

> What do you mean by a "final reduce"? Not all jobs require that the
> final output result be singular, since the reducer phase is provided
> to work on a per-partition basis (also why the files are named
> part-*). One job consists of only one reduce phase, wherein the
> reducers all work independently and complete.
>
> If you need a result assembled together in order of the partitions
> created, rely on the above provided solutions such as a second step of
> fs -getmerge, or a call of the same in a custom FileOutputCommitter,
> etc.
>
> On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com>
> wrote:
> >   Hello,
> > thank you for the answer. Exactly: I want the parallelism but a single
> final
> > output. What do you mean by "another stage"? I thought I should set
> > mapred.reduce.tasks large enough and hadoop will run the reducers in so
> many
> > rounds it will be optimal. But it isn't the case.
> >   When I tried to run the classical WordCount example, and try to set
> this
> > by JobConf.setNumReduceTasks(int n), it seemed to me I had the final
> output
> > (there were no word duplicates for the normal words -- only some for
> strange
> > words). So why the hadoop doesn't run the final reduce in my simple
> > streaming example?
> >   Thank you,
> >   Pavel Hančar
> >
> > 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
> >>
> >>
> >> Is it that you want the parallelism but a single final output? Assuming
> >> your first job's reducers generate a small output, another stage is the
> way
> >> to go. If not, second stage won't help. What exactly are your
> objectives?
> >>
> >> Thanks,
> >> +Vinod
> >>
> >> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
> >>
> >>   Hello,
> >> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> >> like to have only one result. Is it possible? Or should I run one more
> job
> >> to merge the result? And is it the same with non-streaming jobs? Below
> you
> >> see, I have 5 results for mapred.reduce.tasks=5.
> >>
> >> $ hadoop jar
> >>
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> >> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file
> /tmp/wcc
> >> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> >> .
> >> .
> >> .
> >> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> >> job_201301021717_0038
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> >> $ hadoop dfs -cat 1gb.wc/part-*
> >> 472173052
> >> 165736187
> >> 201719914
> >> 184376668
> >> 163872819
> >> $
> >>
> >> where /tmp/wcc contains
> >> #!/bin/bash
> >> wc -c
> >>
> >> Thanks for any answer,
> >>  Pavel Hančar
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Thank you, and I apologize for my bad undestanding of MapReduce. I
forgot, the data for reduce tasks are grouped by keys and therefore they
need nothing like second round. My "simple example" is stupid, because
there is no key (same as it would be one). Fewer keys than reduce tasks
never work.
  Thank you for your helpfullness,
   Pavel Hančar

2013/1/5 Harsh J <ha...@cloudera.com>

> What do you mean by a "final reduce"? Not all jobs require that the
> final output result be singular, since the reducer phase is provided
> to work on a per-partition basis (also why the files are named
> part-*). One job consists of only one reduce phase, wherein the
> reducers all work independently and complete.
>
> If you need a result assembled together in order of the partitions
> created, rely on the above provided solutions such as a second step of
> fs -getmerge, or a call of the same in a custom FileOutputCommitter,
> etc.
>
> On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com>
> wrote:
> >   Hello,
> > thank you for the answer. Exactly: I want the parallelism but a single
> final
> > output. What do you mean by "another stage"? I thought I should set
> > mapred.reduce.tasks large enough and hadoop will run the reducers in so
> many
> > rounds it will be optimal. But it isn't the case.
> >   When I tried to run the classical WordCount example, and try to set
> this
> > by JobConf.setNumReduceTasks(int n), it seemed to me I had the final
> output
> > (there were no word duplicates for the normal words -- only some for
> strange
> > words). So why the hadoop doesn't run the final reduce in my simple
> > streaming example?
> >   Thank you,
> >   Pavel Hančar
> >
> > 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
> >>
> >>
> >> Is it that you want the parallelism but a single final output? Assuming
> >> your first job's reducers generate a small output, another stage is the
> way
> >> to go. If not, second stage won't help. What exactly are your
> objectives?
> >>
> >> Thanks,
> >> +Vinod
> >>
> >> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
> >>
> >>   Hello,
> >> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> >> like to have only one result. Is it possible? Or should I run one more
> job
> >> to merge the result? And is it the same with non-streaming jobs? Below
> you
> >> see, I have 5 results for mapred.reduce.tasks=5.
> >>
> >> $ hadoop jar
> >>
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> >> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file
> /tmp/wcc
> >> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> >> .
> >> .
> >> .
> >> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> >> job_201301021717_0038
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> >> $ hadoop dfs -cat 1gb.wc/part-*
> >> 472173052
> >> 165736187
> >> 201719914
> >> 184376668
> >> 163872819
> >> $
> >>
> >> where /tmp/wcc contains
> >> #!/bin/bash
> >> wc -c
> >>
> >> Thanks for any answer,
> >>  Pavel Hančar
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Thank you, and I apologize for my bad undestanding of MapReduce. I
forgot, the data for reduce tasks are grouped by keys and therefore they
need nothing like second round. My "simple example" is stupid, because
there is no key (same as it would be one). Fewer keys than reduce tasks
never work.
  Thank you for your helpfullness,
   Pavel Hančar

2013/1/5 Harsh J <ha...@cloudera.com>

> What do you mean by a "final reduce"? Not all jobs require that the
> final output result be singular, since the reducer phase is provided
> to work on a per-partition basis (also why the files are named
> part-*). One job consists of only one reduce phase, wherein the
> reducers all work independently and complete.
>
> If you need a result assembled together in order of the partitions
> created, rely on the above provided solutions such as a second step of
> fs -getmerge, or a call of the same in a custom FileOutputCommitter,
> etc.
>
> On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com>
> wrote:
> >   Hello,
> > thank you for the answer. Exactly: I want the parallelism but a single
> final
> > output. What do you mean by "another stage"? I thought I should set
> > mapred.reduce.tasks large enough and hadoop will run the reducers in so
> many
> > rounds it will be optimal. But it isn't the case.
> >   When I tried to run the classical WordCount example, and try to set
> this
> > by JobConf.setNumReduceTasks(int n), it seemed to me I had the final
> output
> > (there were no word duplicates for the normal words -- only some for
> strange
> > words). So why the hadoop doesn't run the final reduce in my simple
> > streaming example?
> >   Thank you,
> >   Pavel Hančar
> >
> > 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
> >>
> >>
> >> Is it that you want the parallelism but a single final output? Assuming
> >> your first job's reducers generate a small output, another stage is the
> way
> >> to go. If not, second stage won't help. What exactly are your
> objectives?
> >>
> >> Thanks,
> >> +Vinod
> >>
> >> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
> >>
> >>   Hello,
> >> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> >> like to have only one result. Is it possible? Or should I run one more
> job
> >> to merge the result? And is it the same with non-streaming jobs? Below
> you
> >> see, I have 5 results for mapred.reduce.tasks=5.
> >>
> >> $ hadoop jar
> >>
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> >> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file
> /tmp/wcc
> >> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> >> .
> >> .
> >> .
> >> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> >> job_201301021717_0038
> >> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> >> $ hadoop dfs -cat 1gb.wc/part-*
> >> 472173052
> >> 165736187
> >> 201719914
> >> 184376668
> >> 163872819
> >> $
> >>
> >> where /tmp/wcc contains
> >> #!/bin/bash
> >> wc -c
> >>
> >> Thanks for any answer,
> >>  Pavel Hančar
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: more reduce tasks

Posted by Harsh J <ha...@cloudera.com>.

What do you mean by a "final reduce"? Not all jobs require that the
final output result be singular, since the reducer phase is provided
to work on a per-partition basis (also why the files are named
part-*). One job consists of only one reduce phase, wherein the
reducers all work independently and complete.

If you need a result assembled together in order of the partitions
created, rely on the above provided solutions such as a second step of
fs -getmerge, or a call of the same in a custom FileOutputCommitter,
etc.

On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com> wrote:
>   Hello,
> thank you for the answer. Exactly: I want the parallelism but a single final
> output. What do you mean by "another stage"? I thought I should set
> mapred.reduce.tasks large enough and hadoop will run the reducers in so many
> rounds it will be optimal. But it isn't the case.
>   When I tried to run the classical WordCount example, and try to set this
> by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
> (there were no word duplicates for the normal words -- only some for strange
> words). So why the hadoop doesn't run the final reduce in my simple
> streaming example?
>   Thank you,
>   Pavel Hančar
>
> 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>> Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>



-- 
Harsh J

Re: more reduce tasks

Posted by Harsh J <ha...@cloudera.com>.

What do you mean by a "final reduce"? Not all jobs require that the
final output result be singular, since the reducer phase is provided
to work on a per-partition basis (also why the files are named
part-*). One job consists of only one reduce phase, wherein the
reducers all work independently and complete.

If you need a result assembled together in order of the partitions
created, rely on the above provided solutions such as a second step of
fs -getmerge, or a call of the same in a custom FileOutputCommitter,
etc.

On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com> wrote:
>   Hello,
> thank you for the answer. Exactly: I want the parallelism but a single final
> output. What do you mean by "another stage"? I thought I should set
> mapred.reduce.tasks large enough and hadoop will run the reducers in so many
> rounds it will be optimal. But it isn't the case.
>   When I tried to run the classical WordCount example, and try to set this
> by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
> (there were no word duplicates for the normal words -- only some for strange
> words). So why the hadoop doesn't run the final reduce in my simple
> streaming example?
>   Thank you,
>   Pavel Hančar
>
> 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>> Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>



-- 
Harsh J

Re: more reduce tasks

Posted by Harsh J <ha...@cloudera.com>.

What do you mean by a "final reduce"? Not all jobs require that the
final output result be singular, since the reducer phase is provided
to work on a per-partition basis (also why the files are named
part-*). One job consists of only one reduce phase, wherein the
reducers all work independently and complete.

If you need a result assembled together in order of the partitions
created, rely on the above provided solutions such as a second step of
fs -getmerge, or a call of the same in a custom FileOutputCommitter,
etc.

On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com> wrote:
>   Hello,
> thank you for the answer. Exactly: I want the parallelism but a single final
> output. What do you mean by "another stage"? I thought I should set
> mapred.reduce.tasks large enough and hadoop will run the reducers in so many
> rounds it will be optimal. But it isn't the case.
>   When I tried to run the classical WordCount example, and try to set this
> by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
> (there were no word duplicates for the normal words -- only some for strange
> words). So why the hadoop doesn't run the final reduce in my simple
> streaming example?
>   Thank you,
>   Pavel Hančar
>
> 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>> Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>



-- 
Harsh J

Re: more reduce tasks

Posted by Harsh J <ha...@cloudera.com>.

What do you mean by a "final reduce"? Not all jobs require that the
final output result be singular, since the reducer phase is provided
to work on a per-partition basis (also why the files are named
part-*). One job consists of only one reduce phase, wherein the
reducers all work independently and complete.

If you need a result assembled together in order of the partitions
created, rely on the above provided solutions such as a second step of
fs -getmerge, or a call of the same in a custom FileOutputCommitter,
etc.

On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <pa...@gmail.com> wrote:
>   Hello,
> thank you for the answer. Exactly: I want the parallelism but a single final
> output. What do you mean by "another stage"? I thought I should set
> mapred.reduce.tasks large enough and hadoop will run the reducers in so many
> rounds it will be optimal. But it isn't the case.
>   When I tried to run the classical WordCount example, and try to set this
> by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
> (there were no word duplicates for the normal words -- only some for strange
> words). So why the hadoop doesn't run the final reduce in my simple
> streaming example?
>   Thank you,
>   Pavel Hančar
>
> 2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>> Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>



-- 
Harsh J

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Hello,
thank you for the answer. Exactly: I want the parallelism but a single
final output. What do you mean by "another stage"? I thought I should
setmapred.reduce.tasks large enough and hadoop will run the reducers
in so
many rounds it will be optimal. But it isn't the case.
  When I tried to run the classical WordCount example, and try to set this
by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
(there were no word duplicates for the normal words -- only some for
strange words). So why the hadoop doesn't run the final reduce in my simple
streaming example?
  Thank you,
  Pavel Hančar

2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Hello,
thank you for the answer. Exactly: I want the parallelism but a single
final output. What do you mean by "another stage"? I thought I should
setmapred.reduce.tasks large enough and hadoop will run the reducers
in so
many rounds it will be optimal. But it isn't the case.
  When I tried to run the classical WordCount example, and try to set this
by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
(there were no word duplicates for the normal words -- only some for
strange words). So why the hadoop doesn't run the final reduce in my simple
streaming example?
  Thank you,
  Pavel Hančar

2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Chen He <ai...@gmail.com>.

Sounds like you want more reducer to reduce the execution time but only
want a single output file.

Is this waht you want?

You can use as many as your want (may not be optimal) reducers when you are
running your reducer. Once the program is done, write a small perl, python,
or shell program connect those part-* files.

if you do not want to write your own script to connect those files and let
Hadoop automatically generate a single file.

It may need some patched to current Hadoop. I am not sure they are ready or
not.

On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Hello,
thank you for the answer. Exactly: I want the parallelism but a single
final output. What do you mean by "another stage"? I thought I should
setmapred.reduce.tasks large enough and hadoop will run the reducers
in so
many rounds it will be optimal. But it isn't the case.
  When I tried to run the classical WordCount example, and try to set this
by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
(there were no word duplicates for the normal words -- only some for
strange words). So why the hadoop doesn't run the final reduce in my simple
streaming example?
  Thank you,
  Pavel Hančar

2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Pavel Hančar <pa...@gmail.com>.

  Hello,
thank you for the answer. Exactly: I want the parallelism but a single
final output. What do you mean by "another stage"? I thought I should
setmapred.reduce.tasks large enough and hadoop will run the reducers
in so
many rounds it will be optimal. But it isn't the case.
  When I tried to run the classical WordCount example, and try to set this
by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
(there were no word duplicates for the normal words -- only some for
strange words). So why the hadoop doesn't run the final reduce in my simple
streaming example?
  Thank you,
  Pavel Hančar

2013/1/4 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>

Re: more reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Is it that you want the parallelism but a single final output? Assuming your first job's reducers generate a small output, another stage is the way to go. If not, second stage won't help. What exactly are your objectives?

Thanks,
+Vinod

On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd like to have only one result. Is it possible? Or should I run one more job to merge the result? And is it the same with non-streaming jobs? Below you see, I have 5 results for mapred.reduce.tasks=5.
> 
> $ hadoop jar /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar  -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
> 
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
> 
> Thanks for any answer,
>  Pavel Hančar

Re: more reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Is it that you want the parallelism but a single final output? Assuming your first job's reducers generate a small output, another stage is the way to go. If not, second stage won't help. What exactly are your objectives?

Thanks,
+Vinod

On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd like to have only one result. Is it possible? Or should I run one more job to merge the result? And is it the same with non-streaming jobs? Below you see, I have 5 results for mapred.reduce.tasks=5.
> 
> $ hadoop jar /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar  -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
> 
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
> 
> Thanks for any answer,
>  Pavel Hančar

Re: more reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Is it that you want the parallelism but a single final output? Assuming your first job's reducers generate a small output, another stage is the way to go. If not, second stage won't help. What exactly are your objectives?

Thanks,
+Vinod

On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd like to have only one result. Is it possible? Or should I run one more job to merge the result? And is it the same with non-streaming jobs? Below you see, I have 5 results for mapred.reduce.tasks=5.
> 
> $ hadoop jar /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar  -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
> 
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
> 
> Thanks for any answer,
>  Pavel Hančar

Re: more reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Is it that you want the parallelism but a single final output? Assuming your first job's reducers generate a small output, another stage is the way to go. If not, second stage won't help. What exactly are your objectives?

Thanks,
+Vinod

On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd like to have only one result. Is it possible? Or should I run one more job to merge the result? And is it the same with non-streaming jobs? Below you see, I have 5 results for mapred.reduce.tasks=5.
> 
> $ hadoop jar /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar  -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
> 
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
> 
> Thanks for any answer,
>  Pavel Hančar

Re: more reduce tasks

Posted by Robert Dyer <ps...@gmail.com>.

You could create a CustomOutputCommitter and in the commitJob() method
simply read in the part-* files and write them out into a single aggregated
file.

This requires making a CustomOutputFormat class that uses the
CustomOutputCommittter and then setting that
via job.setOutputFormatClass(CustomOutputFormat.class).

See these classes:


http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputCommitter.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)

- Robert

On Thu, Jan 3, 2013 at 3:11 PM, Pavel Hančar <pa...@gmail.com> wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar

Re: more reduce tasks

Posted by Robert Dyer <ps...@gmail.com>.

You could create a CustomOutputCommitter and in the commitJob() method
simply read in the part-* files and write them out into a single aggregated
file.

This requires making a CustomOutputFormat class that uses the
CustomOutputCommittter and then setting that
via job.setOutputFormatClass(CustomOutputFormat.class).

See these classes:


http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputCommitter.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)

- Robert

On Thu, Jan 3, 2013 at 3:11 PM, Pavel Han�ar <pa...@gmail.com> wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Han�ar

Re: more reduce tasks

Posted by Robert Dyer <ps...@gmail.com>.

You could create a CustomOutputCommitter and in the commitJob() method
simply read in the part-* files and write them out into a single aggregated
file.

This requires making a CustomOutputFormat class that uses the
CustomOutputCommittter and then setting that
via job.setOutputFormatClass(CustomOutputFormat.class).

See these classes:


http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputCommitter.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)

- Robert

On Thu, Jan 3, 2013 at 3:11 PM, Pavel Han�ar <pa...@gmail.com> wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Han�ar

Re: more reduce tasks

Posted by Robert Dyer <ps...@gmail.com>.

You could create a CustomOutputCommitter and in the commitJob() method
simply read in the part-* files and write them out into a single aggregated
file.

This requires making a CustomOutputFormat class that uses the
CustomOutputCommittter and then setting that
via job.setOutputFormatClass(CustomOutputFormat.class).

See these classes:


http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputCommitter.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)

- Robert

On Thu, Jan 3, 2013 at 3:11 PM, Pavel Hančar <pa...@gmail.com> wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar