You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Song Liu <la...@gmail.com> on 2010/02/17 00:51:10 UTC

Reducer stuck at pending state

Hi all, I recently have me t a problem that sometimes, reducer hang up at
pending state, with 0% complete.

It seems all the mappers are completely done, and when it just about to
start the reducer, the reducer stuck, without any given warnings and errors
and was staying at the pending state.

I have a cluster with 12 nodes. But this situation only appears when the
scale of data is large (2GB or more), smaller cases never met this problem.

Any one has met this issue before? I searched JIRA, some one proposed this
issue before, but no solution was given. (
https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647230#action_12647230
)

The typical case of this issue is captured in the attachment.

Regards
Song Liu

Re: Reducer stuck at pending state

Posted by Song Liu <la...@gmail.com>.
Hi Todd,I'm using hadoop 0.20.1, apache distribution.
I didnt set the property you mentioned and I think they should remain
default (1G?).

The cluster I'm playing with has four master nodes, and 96 slave nodes
physically. Hadoop uses one master node for namenode and jobstracker, and
picks 12 nodes for its data and tasktrackers.

Interestingly, I noticed the hardware specification is a liltle different
between master and slave mahchines. So I changed the namenode and
jobstracker to one of the slaves. The problem seems solved. (My program runs
normally SO FAR)

However, I cannot find the concrete hardware configuration for each node,
but I guess the differences should exist mainly on the CPUs or RAMs.

These are copied from the cluster's specification manual:

Slaves:

"each with two 2.6 GHz dual-core opteron processors, 8 GB RAM, 16 GB swap
space and 50 GB of local scratch space"

Masters:

"each with four 2.6 GHz dual-core opteron processors, 32 GB RAM, 64 GB swap
space, 64 GB of local scratch space"

Can you see what the problem is?

Thanks a lot.
Regards
Song Liu

On Wed, Feb 17, 2010 at 4:18 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Song,
>
> What version are you running? How much memory have you allocated to
> the reducers in mapred.child.java.opts?
>
> -Todd
>
> On Tue, Feb 16, 2010 at 4:01 PM, Song Liu <la...@gmail.com> wrote:
> > Sorry, seems no attachment is allowed, I paste it here:
> >
> > Jobid    Priority    User    Name    Map % Complete    Map Total    Maps
> > Completed    Reduce % Complete    Reduce Total    Reduces Completed
>  Job
> > Scheduling Information
> > job_2... NORMAL      sl9885    TF/IDF     100.00%          26
> > 26                0.00%                 1                0
> >    NA
> > job_2... NORMAL      sl9885    Rank    100.00%          22            22
> >            0.00%                 1                0                    NA
> > job_2... NORMAL      sl9885    TF/IDF     100.00%          20
> > 20                0.00%                1                0
> > NA
> >
> > The format is horrible, sorry for that, but it's the best I can do :(
> >
> > BTW, I guess it should not be my program's problem, since I have tested
> it
> > on some other clusters before.
> >
> > Regards
> > Song Liu
> >
> > On Tue, Feb 16, 2010 at 11:51 PM, Song Liu <la...@gmail.com>
> wrote:
> >
> >> Hi all, I recently have me t a problem that sometimes, reducer hang up
> at
> >> pending state, with 0% complete.
> >>
> >> It seems all the mappers are completely done, and when it just about to
> >> start the reducer, the reducer stuck, without any given warnings and
> errors
> >> and was staying at the pending state.
> >>
> >> I have a cluster with 12 nodes. But this situation only appears when the
> >> scale of data is large (2GB or more), smaller cases never met this
> problem.
> >>
> >> Any one has met this issue before? I searched JIRA, some one proposed
> this
> >> issue before, but no solution was given. (
> >>
> https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647230#action_12647230
> >> )
> >>
> >> The typical case of this issue is captured in the attachment.
> >>
> >> Regards
> >> Song Liu
> >>
> >
>

Re: Reducer stuck at pending state

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Song,

What version are you running? How much memory have you allocated to
the reducers in mapred.child.java.opts?

-Todd

On Tue, Feb 16, 2010 at 4:01 PM, Song Liu <la...@gmail.com> wrote:
> Sorry, seems no attachment is allowed, I paste it here:
>
> Jobid    Priority    User    Name    Map % Complete    Map Total    Maps
> Completed    Reduce % Complete    Reduce Total    Reduces Completed    Job
> Scheduling Information
> job_2... NORMAL      sl9885    TF/IDF     100.00%          26
> 26                0.00%                 1                0
>    NA
> job_2... NORMAL      sl9885    Rank    100.00%          22            22
>            0.00%                 1                0                    NA
> job_2... NORMAL      sl9885    TF/IDF     100.00%          20
> 20                0.00%                1                0
> NA
>
> The format is horrible, sorry for that, but it's the best I can do :(
>
> BTW, I guess it should not be my program's problem, since I have tested it
> on some other clusters before.
>
> Regards
> Song Liu
>
> On Tue, Feb 16, 2010 at 11:51 PM, Song Liu <la...@gmail.com> wrote:
>
>> Hi all, I recently have me t a problem that sometimes, reducer hang up at
>> pending state, with 0% complete.
>>
>> It seems all the mappers are completely done, and when it just about to
>> start the reducer, the reducer stuck, without any given warnings and errors
>> and was staying at the pending state.
>>
>> I have a cluster with 12 nodes. But this situation only appears when the
>> scale of data is large (2GB or more), smaller cases never met this problem.
>>
>> Any one has met this issue before? I searched JIRA, some one proposed this
>> issue before, but no solution was given. (
>> https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647230#action_12647230
>> )
>>
>> The typical case of this issue is captured in the attachment.
>>
>> Regards
>> Song Liu
>>
>

Re: Reducer stuck at pending state

Posted by Song Liu <la...@gmail.com>.
Sorry, seems no attachment is allowed, I paste it here:

Jobid    Priority    User    Name    Map % Complete    Map Total    Maps
Completed    Reduce % Complete    Reduce Total    Reduces Completed    Job
Scheduling Information
job_2... NORMAL      sl9885    TF/IDF     100.00%          26
26                0.00%                 1                0
    NA
job_2... NORMAL      sl9885    Rank    100.00%          22            22
            0.00%                 1                0                    NA
job_2... NORMAL      sl9885    TF/IDF     100.00%          20
20                0.00%                1                0
NA

The format is horrible, sorry for that, but it's the best I can do :(

BTW, I guess it should not be my program's problem, since I have tested it
on some other clusters before.

Regards
Song Liu

On Tue, Feb 16, 2010 at 11:51 PM, Song Liu <la...@gmail.com> wrote:

> Hi all, I recently have me t a problem that sometimes, reducer hang up at
> pending state, with 0% complete.
>
> It seems all the mappers are completely done, and when it just about to
> start the reducer, the reducer stuck, without any given warnings and errors
> and was staying at the pending state.
>
> I have a cluster with 12 nodes. But this situation only appears when the
> scale of data is large (2GB or more), smaller cases never met this problem.
>
> Any one has met this issue before? I searched JIRA, some one proposed this
> issue before, but no solution was given. (
> https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647230#action_12647230
> )
>
> The typical case of this issue is captured in the attachment.
>
> Regards
> Song Liu
>