You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Vanja Komadinovic <va...@gmail.com> on 2011/08/02 00:21:08 UTC

MultipleOutputs support

Hi all,

I'm trying to create M/R tasks that will output more than one "type" of data. Ideal thing would be MultipleOutputs feature of Map Reduce, but in our current production version, CDH3 ( 0.20.2 ), this support is broken. 

So, I tried to simulate MultipleOutputs. In Reducer setup I'm opening hdfs output stream, during reduce calls writing to stream and in close call closing stream. Output streams are named with attempt id inside. This is working great. Speculative execution is disabled, but sometimes one of reduce task fail, and I' getting two files for reducer on same data. Is there any way to find out which task attempts where successful, so I can delete unneeded data after successful job? I'm using new MapReduce API. Or some other better idea to achieve this?

Best,
Vanja

Komadinovic Vanja
+381 (64) 296 03 43
vanjakom@gmail.com

Re: MultipleOutputs support

Posted by Harsh J <ha...@cloudera.com>.

Vanja,

On Thu, Aug 4, 2011 at 8:45 PM, Vanja Komadinovic <va...@gmail.com> wrote:
> Thanks Harsh,
>
> I solved my problem with FAQ point you give me.

Glad to know things are resolved!

> Regarding MultipleOutputs, I was thinking that MultipleOutputs are not working with new API on 0.20, but later found that in CDH distribution this is solved. Until all our production clusters are not switched to CDH3 I must use manual output to multiple files.

Porting to a new API MO later shouldn't be that hard :)

-- 
Harsh J

Re: MultipleOutputs support

Posted by Vanja Komadinovic <va...@gmail.com>.

Thanks Harsh,

I solved my problem with FAQ point you give me. 

Regarding MultipleOutputs, I was thinking that MultipleOutputs are not working with new API on 0.20, but later found that in CDH distribution this is solved. Until all our production clusters are not switched to CDH3 I must use manual output to multiple files.

Thanks once more.

Best,
Vanja

Komadinovic Vanja
+381 (64) 296 03 43
vanjakom@gmail.com


On Aug 2, 2011, at 07:30, Harsh J wrote:

> Hello Vanja,
> 
> The CDH report is best submitted to cdh-user@cloudera.org where an
> action could then be taken. Would help if you can describe your new
> API MO issue as well!
> 
> Regarding your general multiple outputs in output directory issue,
> check this FAQ to get a full understanding of how task committing can
> help: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
> 
> Basically, you have to write to the task attempts' working directory.
> This way, even speculatives are OK to have, and only successfully
> completed (and 'committed', like in a DB) tasks can get to push the
> final output after the job completes, and the rest are cleaned away
> from the temporary working directory.
> 
> Hope this helps! Although, I consider MO to be painless and the true
> way to go to do multiple outputs in MR right now. So would like to
> help to see if we can fix those issues up as well.
> 
> On Tue, Aug 2, 2011 at 3:51 AM, Vanja Komadinovic <va...@gmail.com> wrote:
>> Hi all,
>> 
>> I'm trying to create M/R tasks that will output more than one "type" of data. Ideal thing would be MultipleOutputs feature of Map Reduce, but in our current production version, CDH3 ( 0.20.2 ), this support is broken.
>> 
>> So, I tried to simulate MultipleOutputs. In Reducer setup I'm opening hdfs output stream, during reduce calls writing to stream and in close call closing stream. Output streams are named with attempt id inside. This is working great. Speculative execution is disabled, but sometimes one of reduce task fail, and I' getting two files for reducer on same data. Is there any way to find out which task attempts where successful, so I can delete unneeded data after successful job? I'm using new MapReduce API. Or some other better idea to achieve this?
>> 
>> Best,
>> Vanja
>> 
>> Komadinovic Vanja
>> +381 (64) 296 03 43
>> vanjakom@gmail.com
>> 
>> 
>> 
> 
> 
> 
> -- 
> Harsh J

Re: MultipleOutputs support

Posted by Harsh J <ha...@cloudera.com>.

Hello Vanja,

The CDH report is best submitted to cdh-user@cloudera.org where an
action could then be taken. Would help if you can describe your new
API MO issue as well!

Regarding your general multiple outputs in output directory issue,
check this FAQ to get a full understanding of how task committing can
help: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F

Basically, you have to write to the task attempts' working directory.
This way, even speculatives are OK to have, and only successfully
completed (and 'committed', like in a DB) tasks can get to push the
final output after the job completes, and the rest are cleaned away
from the temporary working directory.

Hope this helps! Although, I consider MO to be painless and the true
way to go to do multiple outputs in MR right now. So would like to
help to see if we can fix those issues up as well.

On Tue, Aug 2, 2011 at 3:51 AM, Vanja Komadinovic <va...@gmail.com> wrote:
> Hi all,
>
> I'm trying to create M/R tasks that will output more than one "type" of data. Ideal thing would be MultipleOutputs feature of Map Reduce, but in our current production version, CDH3 ( 0.20.2 ), this support is broken.
>
> So, I tried to simulate MultipleOutputs. In Reducer setup I'm opening hdfs output stream, during reduce calls writing to stream and in close call closing stream. Output streams are named with attempt id inside. This is working great. Speculative execution is disabled, but sometimes one of reduce task fail, and I' getting two files for reducer on same data. Is there any way to find out which task attempts where successful, so I can delete unneeded data after successful job? I'm using new MapReduce API. Or some other better idea to achieve this?
>
> Best,
> Vanja
>
> Komadinovic Vanja
> +381 (64) 296 03 43
> vanjakom@gmail.com
>
>
>

-- 
Harsh J