You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Stanley Xu <we...@gmail.com> on 2011/05/03 08:09:16 UTC

Is there any way I could keep both the Mapper and Reducer output in hdfs?

Dear all,

We have a task to run a map-reduce job multiple times to do some machine
learning calculation. We will first use a mapper to update the data
iteratively, and then use the reducer to process the output of the mapper to
update a global matrix. After that, we need to re-use the output of the
previous mapper(as a datasource) and reducer(as a set of parameters) to
re-run the map-reduce again to do another round of learning.

I am wondering is there any setting or API I could use to let the hadoop to
keep both the output of the mapper and reducer? Now it looks if it is a job
contains a reducer, it will delete the intermediate result generated by the
mapper.

Thanks.
Stanley Xu

Re: Is there any way I could keep both the Mapper and Reducer output in hdfs?

Posted by Stanley Xu <we...@gmail.com>.
Thanks Jason, will take a look to try MultipleOutputs for mapper.

Best wishes,
Stanley Xu



On Tue, May 3, 2011 at 11:25 PM, Jason <ur...@gmail.com> wrote:

> It is actually trivial to do using MultipleOutputs. You just need to emit
> your key-values to both MO and standard output context/collector in your
> mapper.
>
> Two things you should know about MO:
> 1. Early implementation has a serious (couple of order of magnitude)
> performance bug
> 2. Output files not created for empty output data.
>
>
>
> On May 2, 2011, at 11:09 PM, Stanley Xu <we...@gmail.com> wrote:
>
> > Dear all,
> >
> > We have a task to run a map-reduce job multiple times to do some machine
> learning calculation. We will first use a mapper to update the data
> iteratively, and then use the reducer to process the output of the mapper to
> update a global matrix. After that, we need to re-use the output of the
> previous mapper(as a datasource) and reducer(as a set of parameters) to
> re-run the map-reduce again to do another round of learning.
> >
> > I am wondering is there any setting or API I could use to let the hadoop
> to keep both the output of the mapper and reducer? Now it looks if it is a
> job contains a reducer, it will delete the intermediate result generated by
> the mapper.
> >
> > Thanks.
> > Stanley Xu
> >
>

Re: Is there any way I could keep both the Mapper and Reducer output in hdfs?

Posted by Jason <ur...@gmail.com>.
It is actually trivial to do using MultipleOutputs. You just need to emit your key-values to both MO and standard output context/collector in your mapper.

Two things you should know about MO:
1. Early implementation has a serious (couple of order of magnitude) performance bug
2. Output files not created for empty output data. 



On May 2, 2011, at 11:09 PM, Stanley Xu <we...@gmail.com> wrote:

> Dear all,
> 
> We have a task to run a map-reduce job multiple times to do some machine learning calculation. We will first use a mapper to update the data iteratively, and then use the reducer to process the output of the mapper to update a global matrix. After that, we need to re-use the output of the previous mapper(as a datasource) and reducer(as a set of parameters) to re-run the map-reduce again to do another round of learning.
> 
> I am wondering is there any setting or API I could use to let the hadoop to keep both the output of the mapper and reducer? Now it looks if it is a job contains a reducer, it will delete the intermediate result generated by the mapper.
> 
> Thanks.
> Stanley Xu
> 

Re: Is there any way I could keep both the Mapper and Reducer output in hdfs?

Posted by Stanley Xu <we...@gmail.com>.
But it will let us read the same data twice, which would be a waste in IO
for large data.

Thanks.

Best wishes,
Stanley Xu



On Tue, May 3, 2011 at 4:09 PM, Bai, Gang <de...@baigang.net> wrote:

> IMHO

Re: Is there any way I could keep both the Mapper and Reducer output in hdfs?

Posted by "Bai, Gang" <de...@baigang.net>.
IMHO, it will be better if you separate your mapper and reducer into
different jobs.

Regards,
BaiGang

On Tue, May 3, 2011 at 2:09 PM, Stanley Xu <we...@gmail.com> wrote:

> Dear all,
>
> We have a task to run a map-reduce job multiple times to do some machine
> learning calculation. We will first use a mapper to update the data
> iteratively, and then use the reducer to process the output of the mapper to
> update a global matrix. After that, we need to re-use the output of the
> previous mapper(as a datasource) and reducer(as a set of parameters) to
> re-run the map-reduce again to do another round of learning.
>
> I am wondering is there any setting or API I could use to let the hadoop to
> keep both the output of the mapper and reducer? Now it looks if it is a job
> contains a reducer, it will delete the intermediate result generated by the
> mapper.
>
> Thanks.
> Stanley Xu
>
>