You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by newpant <ne...@gmail.com> on 2010/10/13 10:07:23 UTC

Re: Available of Intermediate data generated by mappers

Hi, according to Hadoop The Definitive Guide , map will store the
intermediate output to a in-memory buffer first, and the spill it to local
disk which configured by mapred.local.dir, so from i knew, if the
intermediate data lost , only redo can fix it.

if i wrong, please correct me.

2010/9/27 Nan Zhu <zh...@gmail.com>

> Hi, all
>
> I'm  not sure which mail list I should send my question to, sorry for any
> inconvenience I brought
>
> I'm interested in that how hadoop handles the lost of intermediate data
> generated by map tasks currently, as some papers suggest,  for the
> situation
> that  the data needed by reducers are lost, we should compare the cost
> leading by redo the task and replicating the data, if redoing the task
> costs
> more, we can offer more replication of the intermediate data generated by
> map to ensure that reducers can access the data, otherwise, we just redo
> the
> corresponding map task when we detect the lost
>
> I'm not sure what's the strategy adopted by hadoop currently, I haven't
> find
> the code on this function, can anyone give me some suggestions?
>
> Thank you
>
> Nan
>

Re: Available of Intermediate data generated by mappers

Posted by Nan Zhu <zh...@gmail.com>.

yes, I finally find the corresponding codes

it's in TaskTracker.MapOutputServelet,
doGet()->sendMapFile()->TaskTracker.MapOutputLost()

it's true that the hadoop use redo strategy to solve this problem , but for
some papers, it indicates that we can also replicate the intermediate result
to make it fault-tolerance

Thank you very much

Nan

On Wed, Oct 13, 2010 at 4:07 PM, newpant <ne...@gmail.com> wrote:

> Hi, according to Hadoop The Definitive Guide , map will store the
> intermediate output to a in-memory buffer first, and the spill it to local
> disk which configured by mapred.local.dir, so from i knew, if the
> intermediate data lost , only redo can fix it.
>
> if i wrong, please correct me.
>
> 2010/9/27 Nan Zhu <zh...@gmail.com>
>
> > Hi, all
> >
> > I'm  not sure which mail list I should send my question to, sorry for any
> > inconvenience I brought
> >
> > I'm interested in that how hadoop handles the lost of intermediate data
> > generated by map tasks currently, as some papers suggest,  for the
> > situation
> > that  the data needed by reducers are lost, we should compare the cost
> > leading by redo the task and replicating the data, if redoing the task
> > costs
> > more, we can offer more replication of the intermediate data generated by
> > map to ensure that reducers can access the data, otherwise, we just redo
> > the
> > corresponding map task when we detect the lost
> >
> > I'm not sure what's the strategy adopted by hadoop currently, I haven't
> > find
> > the code on this function, can anyone give me some suggestions?
> >
> > Thank you
> >
> > Nan
> >
>