You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Nan Zhu <zh...@gmail.com> on 2010/09/27 07:35:59 UTC

Available of Intermediate data generated by mappers

Hi, all

I'm  not sure which mail list I should send my question to, sorry for any
inconvenience I brought

I'm interested in that how hadoop handles the lost of intermediate data
generated by map tasks currently, as some papers suggest,  for the situation
that  the data needed by reducers are lost, we should compare the cost
leading by redo the task and replicating the data, if redoing the task costs
more, we can offer more replication of the intermediate data generated by
map to ensure that reducers can access the data, otherwise, we just redo the
corresponding map task when we detect the lost

I'm not sure what's the strategy adopted by hadoop currently, I haven't find
the code on this function, can anyone give me some suggestions?

Thank you

Nan

Re: Available of Intermediate data generated by mappers

Posted by Nan Zhu <zh...@gmail.com>.
yes, I finally find the corresponding codes

it's in TaskTracker.MapOutputServelet,
doGet()->sendMapFile()->TaskTracker.MapOutputLost()

it's true that the hadoop use redo strategy to solve this problem , but for
some papers, it indicates that we can also replicate the intermediate result
to make it fault-tolerance

Thank you very much

Nan

On Wed, Oct 13, 2010 at 4:07 PM, newpant <ne...@gmail.com> wrote:

> Hi, according to Hadoop The Definitive Guide , map will store the
> intermediate output to a in-memory buffer first, and the spill it to local
> disk which configured by mapred.local.dir, so from i knew, if the
> intermediate data lost , only redo can fix it.
>
> if i wrong, please correct me.
>
> 2010/9/27 Nan Zhu <zh...@gmail.com>
>
> > Hi, all
> >
> > I'm  not sure which mail list I should send my question to, sorry for any
> > inconvenience I brought
> >
> > I'm interested in that how hadoop handles the lost of intermediate data
> > generated by map tasks currently, as some papers suggest,  for the
> > situation
> > that  the data needed by reducers are lost, we should compare the cost
> > leading by redo the task and replicating the data, if redoing the task
> > costs
> > more, we can offer more replication of the intermediate data generated by
> > map to ensure that reducers can access the data, otherwise, we just redo
> > the
> > corresponding map task when we detect the lost
> >
> > I'm not sure what's the strategy adopted by hadoop currently, I haven't
> > find
> > the code on this function, can anyone give me some suggestions?
> >
> > Thank you
> >
> > Nan
> >
>

Re: Available of Intermediate data generated by mappers

Posted by newpant <ne...@gmail.com>.
Hi, according to Hadoop The Definitive Guide , map will store the
intermediate output to a in-memory buffer first, and the spill it to local
disk which configured by mapred.local.dir, so from i knew, if the
intermediate data lost , only redo can fix it.

if i wrong, please correct me.

2010/9/27 Nan Zhu <zh...@gmail.com>

> Hi, all
>
> I'm  not sure which mail list I should send my question to, sorry for any
> inconvenience I brought
>
> I'm interested in that how hadoop handles the lost of intermediate data
> generated by map tasks currently, as some papers suggest,  for the
> situation
> that  the data needed by reducers are lost, we should compare the cost
> leading by redo the task and replicating the data, if redoing the task
> costs
> more, we can offer more replication of the intermediate data generated by
> map to ensure that reducers can access the data, otherwise, we just redo
> the
> corresponding map task when we detect the lost
>
> I'm not sure what's the strategy adopted by hadoop currently, I haven't
> find
> the code on this function, can anyone give me some suggestions?
>
> Thank you
>
> Nan
>