You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Rahul Bhattacharjee <ra...@gmail.com> on 2013/05/29 16:40:30 UTC

Reduce side question on MR

Hi,

I have one question related to the reduce phase of MR jobs.

The intermediate outputs of map tasks are pulled in from the nodes which
ran map tasks to the node where reducers is going to run and those
intermediate data is written to the reducers local fs. My question is that
if there is a job processing huge amount of data and it has multiple
mappers but only one reducer , then its possible that the job would never
complete successfully as the single hosts disk might not be sufficient to
hold all the map outputs of the job.

The job essentially would fail after retrying configured number of attempts.

Thanks,
Rahul

Re: Reduce side question on MR

Posted by Harsh J <ha...@cloudera.com>.

I don't see a direct question asked, but here's a condition in the
source code you want to take a look at (*):
https://github.com/apache/hadoop-common/blob/branch-1/src/mapred/org/apache/hadoop/mapred/JobInProgress.java#L2316

(*) - Yet to appear in MRv2 - See/help out with MAPREDUCE-2723.

On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee
<ra...@gmail.com> wrote:
> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which ran
> map tasks to the node where reducers is going to run and those intermediate
> data is written to the reducers local fs. My question is that if there is a
> job processing huge amount of data and it has multiple mappers but only one
> reducer , then its possible that the job would never complete successfully
> as the single hosts disk might not be sufficient to hold all the map outputs
> of the job.
>
> The job essentially would fail after retrying configured number of attempts.
>
> Thanks,
> Rahul



--
Harsh J

Re: Reduce side question on MR

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Harsh for the response. It very much answers what I was looking for.

Regards,
Rahul


On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which
> ran map tasks to the node where reducers is going to run and those
> intermediate data is written to the reducers local fs. My question is that
> if there is a job processing huge amount of data and it has multiple
> mappers but only one reducer , then its possible that the job would never
> complete successfully as the single hosts disk might not be sufficient to
> hold all the map outputs of the job.
>
> The job essentially would fail after retrying configured number of
> attempts.
>
> Thanks,
> Rahul
>

Re: Reduce side question on MR

Posted by Harsh J <ha...@cloudera.com>.

I don't see a direct question asked, but here's a condition in the
source code you want to take a look at (*):
https://github.com/apache/hadoop-common/blob/branch-1/src/mapred/org/apache/hadoop/mapred/JobInProgress.java#L2316

(*) - Yet to appear in MRv2 - See/help out with MAPREDUCE-2723.

On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee
<ra...@gmail.com> wrote:
> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which ran
> map tasks to the node where reducers is going to run and those intermediate
> data is written to the reducers local fs. My question is that if there is a
> job processing huge amount of data and it has multiple mappers but only one
> reducer , then its possible that the job would never complete successfully
> as the single hosts disk might not be sufficient to hold all the map outputs
> of the job.
>
> The job essentially would fail after retrying configured number of attempts.
>
> Thanks,
> Rahul



--
Harsh J

Re: Reduce side question on MR

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Harsh for the response. It very much answers what I was looking for.

Regards,
Rahul


On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which
> ran map tasks to the node where reducers is going to run and those
> intermediate data is written to the reducers local fs. My question is that
> if there is a job processing huge amount of data and it has multiple
> mappers but only one reducer , then its possible that the job would never
> complete successfully as the single hosts disk might not be sufficient to
> hold all the map outputs of the job.
>
> The job essentially would fail after retrying configured number of
> attempts.
>
> Thanks,
> Rahul
>

Re: Reduce side question on MR

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Harsh for the response. It very much answers what I was looking for.

Regards,
Rahul


On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which
> ran map tasks to the node where reducers is going to run and those
> intermediate data is written to the reducers local fs. My question is that
> if there is a job processing huge amount of data and it has multiple
> mappers but only one reducer , then its possible that the job would never
> complete successfully as the single hosts disk might not be sufficient to
> hold all the map outputs of the job.
>
> The job essentially would fail after retrying configured number of
> attempts.
>
> Thanks,
> Rahul
>

Re: Reduce side question on MR

Posted by Harsh J <ha...@cloudera.com>.

I don't see a direct question asked, but here's a condition in the
source code you want to take a look at (*):
https://github.com/apache/hadoop-common/blob/branch-1/src/mapred/org/apache/hadoop/mapred/JobInProgress.java#L2316

(*) - Yet to appear in MRv2 - See/help out with MAPREDUCE-2723.

On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee
<ra...@gmail.com> wrote:
> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which ran
> map tasks to the node where reducers is going to run and those intermediate
> data is written to the reducers local fs. My question is that if there is a
> job processing huge amount of data and it has multiple mappers but only one
> reducer , then its possible that the job would never complete successfully
> as the single hosts disk might not be sufficient to hold all the map outputs
> of the job.
>
> The job essentially would fail after retrying configured number of attempts.
>
> Thanks,
> Rahul



--
Harsh J

Re: Reduce side question on MR

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Harsh for the response. It very much answers what I was looking for.

Regards,
Rahul


On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which
> ran map tasks to the node where reducers is going to run and those
> intermediate data is written to the reducers local fs. My question is that
> if there is a job processing huge amount of data and it has multiple
> mappers but only one reducer , then its possible that the job would never
> complete successfully as the single hosts disk might not be sufficient to
> hold all the map outputs of the job.
>
> The job essentially would fail after retrying configured number of
> attempts.
>
> Thanks,
> Rahul
>

Re: Reduce side question on MR

Posted by Harsh J <ha...@cloudera.com>.

I don't see a direct question asked, but here's a condition in the
source code you want to take a look at (*):
https://github.com/apache/hadoop-common/blob/branch-1/src/mapred/org/apache/hadoop/mapred/JobInProgress.java#L2316

(*) - Yet to appear in MRv2 - See/help out with MAPREDUCE-2723.

On Wed, May 29, 2013 at 8:10 PM, Rahul Bhattacharjee
<ra...@gmail.com> wrote:
> Hi,
>
> I have one question related to the reduce phase of MR jobs.
>
> The intermediate outputs of map tasks are pulled in from the nodes which ran
> map tasks to the node where reducers is going to run and those intermediate
> data is written to the reducers local fs. My question is that if there is a
> job processing huge amount of data and it has multiple mappers but only one
> reducer , then its possible that the job would never complete successfully
> as the single hosts disk might not be sufficient to hold all the map outputs
> of the job.
>
> The job essentially would fail after retrying configured number of attempts.
>
> Thanks,
> Rahul



--
Harsh J