You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Raghava Mutharaju <m....@gmail.com> on 2010/04/08 19:30:49 UTC

Reduce gets struck at 99%

Hello all,

         I got the time out error as mentioned below -- after 600 seconds,
that attempt was killed and the attempt would be deemed a failure. I
searched around about this error, and one of the suggestions to include
"progress" statements in the reducer -- it might be taking longer than 600
seconds and so is timing out. I added calls to context.progress() and
context.setStatus(str) in the reducer. Now, it works fine -- there are no
timeout errors.

         But, for a few jobs, it takes awfully long time to move from "Map
100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
was more than an hour. The reduce code is not complex -- 2 level loop and
couple of if-else blocks. The input size is also not huge, for the job that
gets struck for an hour at reduce 99%, it would take in 130. Some of them
are 1-3 MB in size and couple of them are 16MB in size.

         Has anyone encountered this problem before? Any pointers? I use
Hadoop 0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.

On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <m.vijayaraghava@gmail.com
> wrote:

> Hi all,
>
>        I am running a series of jobs one after another. While executing the
> 4th job, the job fails. It fails in the reducer --- the progress percentage
> would be map 100%, reduce 99%. It gives out the following message
>
> 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> attempt_201003240138_0110_r_000018_1, Status : FAILED
> Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
> seconds. Killing!
>
> It makes several attempts again to execute it but fails with similar
> message. I couldn't get anything from this error message and wanted to look
> at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
> find any files which match the timestamp of the job. Also I did not find
> history and userlogs in the logs folder. Should I look at some other place
> for the logs? What could be the possible causes for the above error?
>
>        I am using Hadoop 0.20.2 and I am running it on a cluster with 16
> nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>

Re: Reduce gets struck at 99%

Posted by prashant ullegaddi <pr...@gmail.com>.
Dear Raghava,

I also faced this problem. It mostly happens if the computation for the data
that reduce received is taking more time
and is not able to finish within the default time-out 600s. You can also
increase the time-out to ensure that all reduces
complete by setting the property "mapred.task.timeout".


On Thu, Apr 8, 2010 at 11:57 PM, Eric Arenas <ea...@rocketmail.com> wrote:

> Yes Raghava,
>
> I have experience that issue before, and the solution that you mentioned
> also solved my issue (adding a context.progress or setcontext to tell the JT
> that my jobs are still running)
>
> regards
>  Eric Arenas
>
>
>
>
> ________________________________
> From: Raghava Mutharaju <m....@gmail.com>
> To: common-user@hadoop.apache.org; mapreduce-user@hadoop.apache.org
> Sent: Thu, April 8, 2010 10:30:49 AM
> Subject: Reduce gets struck at 99%
>
> Hello all,
>
>         I got the time out error as mentioned below -- after 600 seconds,
> that attempt was killed and the attempt would be deemed a failure. I
> searched around about this error, and one of the suggestions to include
> "progress" statements in the reducer -- it might be taking longer than 600
> seconds and so is timing out. I added calls to context.progress() and
> context.setStatus(str) in the reducer. Now, it works fine -- there are no
> timeout errors.
>
>         But, for a few jobs, it takes awfully long time to move from "Map
> 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
> was more than an hour. The reduce code is not complex -- 2 level loop and
> couple of if-else blocks. The input size is also not huge, for the job that
> gets struck for an hour at reduce 99%, it would take in 130. Some of them
> are 1-3 MB in size and couple of them are 16MB in size.
>
>         Has anyone encountered this problem before? Any pointers? I use
> Hadoop 0.20.2 on a linux cluster of 16 nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>
>
> On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com> wrote:
>
> Hi all,
> >
> >       I am running a series of jobs one after another. While executing
> the 4th job, the job fails. It fails in the reducer --- the progress
> percentage would be map 100%, reduce 99%. It gives out the following message
> >
> >10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> attempt_201003240138_0110_r_000018_1, Status : FAILED
> >Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
> seconds. Killing!
> >
> >It makes several attempts again to execute it but fails with similar
> message. I couldn't get anything from this error message and wanted to look
> at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
> find any files which match the timestamp of the job. Also I did not find
> history and userlogs in the logs folder. Should I look at some other place
> for the logs? What could be the possible causes for the above error?
> >
> >       I am using Hadoop 0.20.2 and I am running it on a cluster with 16
> nodes.
> >
> >Thank you.
> >
> >Regards,
> >Raghava.
> >
>



-- 
Thanks and Regards,
Prashant Ullegaddi,
Search and Information Extraction Lab,
IIIT-Hyderabad, India.

Re: Reduce gets struck at 99%

Posted by Eric Arenas <ea...@rocketmail.com>.
Yes Raghava,

I have experience that issue before, and the solution that you mentioned also solved my issue (adding a context.progress or setcontext to tell the JT that my jobs are still running)

regards
 Eric Arenas




________________________________
From: Raghava Mutharaju <m....@gmail.com>
To: common-user@hadoop.apache.org; mapreduce-user@hadoop.apache.org
Sent: Thu, April 8, 2010 10:30:49 AM
Subject: Reduce gets struck at 99%

Hello all,

         I got the time out error as mentioned below -- after 600 seconds, that attempt was killed and the attempt would be deemed a failure. I searched around about this error, and one of the suggestions to include "progress" statements in the reducer -- it might be taking longer than 600 seconds and so is timing out. I added calls to context.progress() and context.setStatus(str) in the reducer. Now, it works fine -- there are no timeout errors.

         But, for a few jobs, it takes awfully long time to move from "Map 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it was more than an hour. The reduce code is not complex -- 2 level loop and couple of if-else blocks. The input size is also not huge, for the job that gets struck for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in size and couple of them are 16MB in size. 

         Has anyone encountered this problem before? Any pointers? I use Hadoop 0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.


On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <m....@gmail.com> wrote:

Hi all,
>
>       I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message
>
>10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_000018_1, Status : FAILED 
>Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 seconds. Killing!
>
>It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error?
>
>       I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.
>
>Thank you.
>
>Regards,
>Raghava.
>

Re: Reduce gets struck at 99%

Posted by Eric Arenas <ea...@rocketmail.com>.
Yes Raghava,

I have experience that issue before, and the solution that you mentioned also solved my issue (adding a context.progress or setcontext to tell the JT that my jobs are still running)

regards
 Eric Arenas




________________________________
From: Raghava Mutharaju <m....@gmail.com>
To: common-user@hadoop.apache.org; mapreduce-user@hadoop.apache.org
Sent: Thu, April 8, 2010 10:30:49 AM
Subject: Reduce gets struck at 99%

Hello all,

         I got the time out error as mentioned below -- after 600 seconds, that attempt was killed and the attempt would be deemed a failure. I searched around about this error, and one of the suggestions to include "progress" statements in the reducer -- it might be taking longer than 600 seconds and so is timing out. I added calls to context.progress() and context.setStatus(str) in the reducer. Now, it works fine -- there are no timeout errors.

         But, for a few jobs, it takes awfully long time to move from "Map 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it was more than an hour. The reduce code is not complex -- 2 level loop and couple of if-else blocks. The input size is also not huge, for the job that gets struck for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in size and couple of them are 16MB in size. 

         Has anyone encountered this problem before? Any pointers? I use Hadoop 0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.


On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <m....@gmail.com> wrote:

Hi all,
>
>       I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message
>
>10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_000018_1, Status : FAILED 
>Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 seconds. Killing!
>
>It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error?
>
>       I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.
>
>Thank you.
>
>Regards,
>Raghava.
>

Re: Reduce gets struck at 99%

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi Ted,

        Thank you for all the suggestions. I went through the job tracker
logs and I have attached the exceptions found in the logs. I found two
exceptions

1) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file    (DFS Client)

2) org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
File does not exist. Holder DFSClient_attempt_201004060646_0057_r_000014_0
does not have any open files.


The exception occurs at the point of writing out <K,V> pairs in the reducer
and it occurs only in certain task attempts. I am not using any custom
output format or record writers but I do use custom input reader.

What could have gone wrong here?

Thank you.

Regards,
Raghava.


On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yu...@gmail.com> wrote:

> Raghava:
> Are you able to share the last segment of reducer log ?
> You can get them from web UI:
>
> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
>
> Adding more log in your reducer task would help pinpoint where the issue
> is.
> Also look in job tracker log.
>
> Cheers
>
> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com
> > wrote:
>
> > Hi Ted,
> >
> >      Thank you for the suggestion. I enabled it using the Configuration
> > class because I cannot change hadoop-site.xml file (I am not an admin).
> The
> > situation is still the same --- it gets stuck at reduce 99% and does not
> > move further.
> >
> > Regards,
> > Raghava.
> >
> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > You need to turn on yourself (hadoop-site.xml):
> > > <property>
> > >  <name>mapred.reduce.tasks.speculative.execution</name>
> > >  <value>true</value>
> > > </property>
> > >
> > > <property>
> > >  <name>mapred.map.tasks.speculative.execution</name>
> > >  <value>true</value>
> > > </property>
> > >
> > >
> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
> > > m.vijayaraghava@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > >     Thank you Eric, Prashant and Greg. Although the timeout problem
> was
> > > > resolved, reduce is getting stuck at 99%. As of now, it has been
> stuck
> > > > there
> > > > for about 3 hrs. That is too high a wait time for my task. Do you
> guys
> > > see
> > > > any reason for this?
> > > >
> > > >      Speculative execution is "on" by default right? Or should I
> enable
> > > it?
> > > >
> > > > Regards,
> > > > Raghava.
> > > >
> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
> gregl@yahoo-inc.com
> > > > >wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > I have also experienced this problem. Have you tried speculative
> > > > execution?
> > > > > Also, I have had jobs that took a long time for one mapper /
> reducer
> > > > because
> > > > > of a record that was significantly larger than those contained in
> the
> > > > other
> > > > > filesplits. Do you know if it always slows down for the same
> > filesplit?
> > > > >
> > > > > Regards,
> > > > > Greg Lawrence
> > > > >
> > > > >
> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > >          I got the time out error as mentioned below -- after 600
> > > > seconds,
> > > > > that attempt was killed and the attempt would be deemed a failure.
> I
> > > > > searched around about this error, and one of the suggestions to
> > include
> > > > > "progress" statements in the reducer -- it might be taking longer
> > than
> > > > 600
> > > > > seconds and so is timing out. I added calls to context.progress()
> and
> > > > > context.setStatus(str) in the reducer. Now, it works fine -- there
> > are
> > > no
> > > > > timeout errors.
> > > > >
> > > > >          But, for a few jobs, it takes awfully long time to move
> from
> > > > "Map
> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for
> > some
> > > > it
> > > > > was more than an hour. The reduce code is not complex -- 2 level
> loop
> > > and
> > > > > couple of if-else blocks. The input size is also not huge, for the
> > job
> > > > that
> > > > > gets struck for an hour at reduce 99%, it would take in 130. Some
> of
> > > them
> > > > > are 1-3 MB in size and couple of them are 16MB in size.
> > > > >
> > > > >          Has anyone encountered this problem before? Any pointers?
> I
> > > use
> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raghava.
> > > > >
> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> > > > > m.vijayaraghava@gmail.com> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > >        I am running a series of jobs one after another. While
> > executing
> > > > the
> > > > > 4th job, the job fails. It fails in the reducer --- the progress
> > > > percentage
> > > > > would be map 100%, reduce 99%. It gives out the following message
> > > > >
> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> > > > > attempt_201003240138_0110_r_000018_1, Status : FAILED
> > > > > Task attempt_201003240138_0110_r_000018_1 failed to report status
> for
> > > 602
> > > > > seconds. Killing!
> > > > >
> > > > > It makes several attempts again to execute it but fails with
> similar
> > > > > message. I couldn't get anything from this error message and wanted
> > to
> > > > look
> > > > > at logs (located in the default dir of ${HADOOP_HOME/logs}). But I
> > > don't
> > > > > find any files which match the timestamp of the job. Also I did not
> > > find
> > > > > history and userlogs in the logs folder. Should I look at some
> other
> > > > place
> > > > > for the logs? What could be the possible causes for the above
> error?
> > > > >
> > > > >        I am using Hadoop 0.20.2 and I am running it on a cluster
> with
> > > 16
> > > > > nodes.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raghava.
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Reduce gets struck at 99%

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi Ted,

        Thank you for all the suggestions. I went through the job tracker
logs and I have attached the exceptions found in the logs. I found two
exceptions

1) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file    (DFS Client)

2) org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
File does not exist. Holder DFSClient_attempt_201004060646_0057_r_000014_0
does not have any open files.


The exception occurs at the point of writing out <K,V> pairs in the reducer
and it occurs only in certain task attempts. I am not using any custom
output format or record writers but I do use custom input reader.

What could have gone wrong here?

Thank you.

Regards,
Raghava.


On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yu...@gmail.com> wrote:

> Raghava:
> Are you able to share the last segment of reducer log ?
> You can get them from web UI:
>
> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
>
> Adding more log in your reducer task would help pinpoint where the issue
> is.
> Also look in job tracker log.
>
> Cheers
>
> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com
> > wrote:
>
> > Hi Ted,
> >
> >      Thank you for the suggestion. I enabled it using the Configuration
> > class because I cannot change hadoop-site.xml file (I am not an admin).
> The
> > situation is still the same --- it gets stuck at reduce 99% and does not
> > move further.
> >
> > Regards,
> > Raghava.
> >
> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > You need to turn on yourself (hadoop-site.xml):
> > > <property>
> > >  <name>mapred.reduce.tasks.speculative.execution</name>
> > >  <value>true</value>
> > > </property>
> > >
> > > <property>
> > >  <name>mapred.map.tasks.speculative.execution</name>
> > >  <value>true</value>
> > > </property>
> > >
> > >
> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
> > > m.vijayaraghava@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > >     Thank you Eric, Prashant and Greg. Although the timeout problem
> was
> > > > resolved, reduce is getting stuck at 99%. As of now, it has been
> stuck
> > > > there
> > > > for about 3 hrs. That is too high a wait time for my task. Do you
> guys
> > > see
> > > > any reason for this?
> > > >
> > > >      Speculative execution is "on" by default right? Or should I
> enable
> > > it?
> > > >
> > > > Regards,
> > > > Raghava.
> > > >
> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
> gregl@yahoo-inc.com
> > > > >wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > I have also experienced this problem. Have you tried speculative
> > > > execution?
> > > > > Also, I have had jobs that took a long time for one mapper /
> reducer
> > > > because
> > > > > of a record that was significantly larger than those contained in
> the
> > > > other
> > > > > filesplits. Do you know if it always slows down for the same
> > filesplit?
> > > > >
> > > > > Regards,
> > > > > Greg Lawrence
> > > > >
> > > > >
> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > >          I got the time out error as mentioned below -- after 600
> > > > seconds,
> > > > > that attempt was killed and the attempt would be deemed a failure.
> I
> > > > > searched around about this error, and one of the suggestions to
> > include
> > > > > "progress" statements in the reducer -- it might be taking longer
> > than
> > > > 600
> > > > > seconds and so is timing out. I added calls to context.progress()
> and
> > > > > context.setStatus(str) in the reducer. Now, it works fine -- there
> > are
> > > no
> > > > > timeout errors.
> > > > >
> > > > >          But, for a few jobs, it takes awfully long time to move
> from
> > > > "Map
> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for
> > some
> > > > it
> > > > > was more than an hour. The reduce code is not complex -- 2 level
> loop
> > > and
> > > > > couple of if-else blocks. The input size is also not huge, for the
> > job
> > > > that
> > > > > gets struck for an hour at reduce 99%, it would take in 130. Some
> of
> > > them
> > > > > are 1-3 MB in size and couple of them are 16MB in size.
> > > > >
> > > > >          Has anyone encountered this problem before? Any pointers?
> I
> > > use
> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raghava.
> > > > >
> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> > > > > m.vijayaraghava@gmail.com> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > >        I am running a series of jobs one after another. While
> > executing
> > > > the
> > > > > 4th job, the job fails. It fails in the reducer --- the progress
> > > > percentage
> > > > > would be map 100%, reduce 99%. It gives out the following message
> > > > >
> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> > > > > attempt_201003240138_0110_r_000018_1, Status : FAILED
> > > > > Task attempt_201003240138_0110_r_000018_1 failed to report status
> for
> > > 602
> > > > > seconds. Killing!
> > > > >
> > > > > It makes several attempts again to execute it but fails with
> similar
> > > > > message. I couldn't get anything from this error message and wanted
> > to
> > > > look
> > > > > at logs (located in the default dir of ${HADOOP_HOME/logs}). But I
> > > don't
> > > > > find any files which match the timestamp of the job. Also I did not
> > > find
> > > > > history and userlogs in the logs folder. Should I look at some
> other
> > > > place
> > > > > for the logs? What could be the possible causes for the above
> error?
> > > > >
> > > > >        I am using Hadoop 0.20.2 and I am running it on a cluster
> with
> > > 16
> > > > > nodes.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raghava.
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Reduce gets struck at 99%

Posted by Ted Yu <yu...@gmail.com>.
Raghava:
Are you able to share the last segment of reducer log ?
You can get them from web UI:
http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193

Adding more log in your reducer task would help pinpoint where the issue is.
Also look in job tracker log.

Cheers

On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <m.vijayaraghava@gmail.com
> wrote:

> Hi Ted,
>
>      Thank you for the suggestion. I enabled it using the Configuration
> class because I cannot change hadoop-site.xml file (I am not an admin). The
> situation is still the same --- it gets stuck at reduce 99% and does not
> move further.
>
> Regards,
> Raghava.
>
> On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > You need to turn on yourself (hadoop-site.xml):
> > <property>
> >  <name>mapred.reduce.tasks.speculative.execution</name>
> >  <value>true</value>
> > </property>
> >
> > <property>
> >  <name>mapred.map.tasks.speculative.execution</name>
> >  <value>true</value>
> > </property>
> >
> >
> > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
> > m.vijayaraghava@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > >     Thank you Eric, Prashant and Greg. Although the timeout problem was
> > > resolved, reduce is getting stuck at 99%. As of now, it has been stuck
> > > there
> > > for about 3 hrs. That is too high a wait time for my task. Do you guys
> > see
> > > any reason for this?
> > >
> > >      Speculative execution is "on" by default right? Or should I enable
> > it?
> > >
> > > Regards,
> > > Raghava.
> > >
> > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com
> > > >wrote:
> > >
> > > >  Hi,
> > > >
> > > > I have also experienced this problem. Have you tried speculative
> > > execution?
> > > > Also, I have had jobs that took a long time for one mapper / reducer
> > > because
> > > > of a record that was significantly larger than those contained in the
> > > other
> > > > filesplits. Do you know if it always slows down for the same
> filesplit?
> > > >
> > > > Regards,
> > > > Greg Lawrence
> > > >
> > > >
> > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m....@gmail.com>
> > > wrote:
> > > >
> > > > Hello all,
> > > >
> > > >          I got the time out error as mentioned below -- after 600
> > > seconds,
> > > > that attempt was killed and the attempt would be deemed a failure. I
> > > > searched around about this error, and one of the suggestions to
> include
> > > > "progress" statements in the reducer -- it might be taking longer
> than
> > > 600
> > > > seconds and so is timing out. I added calls to context.progress() and
> > > > context.setStatus(str) in the reducer. Now, it works fine -- there
> are
> > no
> > > > timeout errors.
> > > >
> > > >          But, for a few jobs, it takes awfully long time to move from
> > > "Map
> > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for
> some
> > > it
> > > > was more than an hour. The reduce code is not complex -- 2 level loop
> > and
> > > > couple of if-else blocks. The input size is also not huge, for the
> job
> > > that
> > > > gets struck for an hour at reduce 99%, it would take in 130. Some of
> > them
> > > > are 1-3 MB in size and couple of them are 16MB in size.
> > > >
> > > >          Has anyone encountered this problem before? Any pointers? I
> > use
> > > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> > > >
> > > > Thank you.
> > > >
> > > > Regards,
> > > > Raghava.
> > > >
> > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> > > > m.vijayaraghava@gmail.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > >        I am running a series of jobs one after another. While
> executing
> > > the
> > > > 4th job, the job fails. It fails in the reducer --- the progress
> > > percentage
> > > > would be map 100%, reduce 99%. It gives out the following message
> > > >
> > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> > > > attempt_201003240138_0110_r_000018_1, Status : FAILED
> > > > Task attempt_201003240138_0110_r_000018_1 failed to report status for
> > 602
> > > > seconds. Killing!
> > > >
> > > > It makes several attempts again to execute it but fails with similar
> > > > message. I couldn't get anything from this error message and wanted
> to
> > > look
> > > > at logs (located in the default dir of ${HADOOP_HOME/logs}). But I
> > don't
> > > > find any files which match the timestamp of the job. Also I did not
> > find
> > > > history and userlogs in the logs folder. Should I look at some other
> > > place
> > > > for the logs? What could be the possible causes for the above error?
> > > >
> > > >        I am using Hadoop 0.20.2 and I am running it on a cluster with
> > 16
> > > > nodes.
> > > >
> > > > Thank you.
> > > >
> > > > Regards,
> > > > Raghava.
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Reduce gets struck at 99%

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi Ted,

      Thank you for the suggestion. I enabled it using the Configuration
class because I cannot change hadoop-site.xml file (I am not an admin). The
situation is still the same --- it gets stuck at reduce 99% and does not
move further.

Regards,
Raghava.

On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yu...@gmail.com> wrote:

> You need to turn on yourself (hadoop-site.xml):
> <property>
>  <name>mapred.reduce.tasks.speculative.execution</name>
>  <value>true</value>
> </property>
>
> <property>
>  <name>mapred.map.tasks.speculative.execution</name>
>  <value>true</value>
> </property>
>
>
> On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com
> > wrote:
>
> > Hi,
> >
> >     Thank you Eric, Prashant and Greg. Although the timeout problem was
> > resolved, reduce is getting stuck at 99%. As of now, it has been stuck
> > there
> > for about 3 hrs. That is too high a wait time for my task. Do you guys
> see
> > any reason for this?
> >
> >      Speculative execution is "on" by default right? Or should I enable
> it?
> >
> > Regards,
> > Raghava.
> >
> > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com
> > >wrote:
> >
> > >  Hi,
> > >
> > > I have also experienced this problem. Have you tried speculative
> > execution?
> > > Also, I have had jobs that took a long time for one mapper / reducer
> > because
> > > of a record that was significantly larger than those contained in the
> > other
> > > filesplits. Do you know if it always slows down for the same filesplit?
> > >
> > > Regards,
> > > Greg Lawrence
> > >
> > >
> > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m....@gmail.com>
> > wrote:
> > >
> > > Hello all,
> > >
> > >          I got the time out error as mentioned below -- after 600
> > seconds,
> > > that attempt was killed and the attempt would be deemed a failure. I
> > > searched around about this error, and one of the suggestions to include
> > > "progress" statements in the reducer -- it might be taking longer than
> > 600
> > > seconds and so is timing out. I added calls to context.progress() and
> > > context.setStatus(str) in the reducer. Now, it works fine -- there are
> no
> > > timeout errors.
> > >
> > >          But, for a few jobs, it takes awfully long time to move from
> > "Map
> > > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some
> > it
> > > was more than an hour. The reduce code is not complex -- 2 level loop
> and
> > > couple of if-else blocks. The input size is also not huge, for the job
> > that
> > > gets struck for an hour at reduce 99%, it would take in 130. Some of
> them
> > > are 1-3 MB in size and couple of them are 16MB in size.
> > >
> > >          Has anyone encountered this problem before? Any pointers? I
> use
> > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Raghava.
> > >
> > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> > > m.vijayaraghava@gmail.com> wrote:
> > >
> > > Hi all,
> > >
> > >        I am running a series of jobs one after another. While executing
> > the
> > > 4th job, the job fails. It fails in the reducer --- the progress
> > percentage
> > > would be map 100%, reduce 99%. It gives out the following message
> > >
> > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> > > attempt_201003240138_0110_r_000018_1, Status : FAILED
> > > Task attempt_201003240138_0110_r_000018_1 failed to report status for
> 602
> > > seconds. Killing!
> > >
> > > It makes several attempts again to execute it but fails with similar
> > > message. I couldn't get anything from this error message and wanted to
> > look
> > > at logs (located in the default dir of ${HADOOP_HOME/logs}). But I
> don't
> > > find any files which match the timestamp of the job. Also I did not
> find
> > > history and userlogs in the logs folder. Should I look at some other
> > place
> > > for the logs? What could be the possible causes for the above error?
> > >
> > >        I am using Hadoop 0.20.2 and I am running it on a cluster with
> 16
> > > nodes.
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Raghava.
> > >
> > >
> > >
> > >
> >
>

Re: Reduce gets struck at 99%

Posted by Ted Yu <yu...@gmail.com>.
You need to turn on yourself (hadoop-site.xml):
<property>
  <name>mapred.reduce.tasks.speculative.execution</name>
  <value>true</value>
</property>

<property>
  <name>mapred.map.tasks.speculative.execution</name>
  <value>true</value>
</property>


On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <m.vijayaraghava@gmail.com
> wrote:

> Hi,
>
>     Thank you Eric, Prashant and Greg. Although the timeout problem was
> resolved, reduce is getting stuck at 99%. As of now, it has been stuck
> there
> for about 3 hrs. That is too high a wait time for my task. Do you guys see
> any reason for this?
>
>      Speculative execution is "on" by default right? Or should I enable it?
>
> Regards,
> Raghava.
>
> On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com
> >wrote:
>
> >  Hi,
> >
> > I have also experienced this problem. Have you tried speculative
> execution?
> > Also, I have had jobs that took a long time for one mapper / reducer
> because
> > of a record that was significantly larger than those contained in the
> other
> > filesplits. Do you know if it always slows down for the same filesplit?
> >
> > Regards,
> > Greg Lawrence
> >
> >
> > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m....@gmail.com>
> wrote:
> >
> > Hello all,
> >
> >          I got the time out error as mentioned below -- after 600
> seconds,
> > that attempt was killed and the attempt would be deemed a failure. I
> > searched around about this error, and one of the suggestions to include
> > "progress" statements in the reducer -- it might be taking longer than
> 600
> > seconds and so is timing out. I added calls to context.progress() and
> > context.setStatus(str) in the reducer. Now, it works fine -- there are no
> > timeout errors.
> >
> >          But, for a few jobs, it takes awfully long time to move from
> "Map
> > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some
> it
> > was more than an hour. The reduce code is not complex -- 2 level loop and
> > couple of if-else blocks. The input size is also not huge, for the job
> that
> > gets struck for an hour at reduce 99%, it would take in 130. Some of them
> > are 1-3 MB in size and couple of them are 16MB in size.
> >
> >          Has anyone encountered this problem before? Any pointers? I use
> > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> >
> > Thank you.
> >
> > Regards,
> > Raghava.
> >
> > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> > m.vijayaraghava@gmail.com> wrote:
> >
> > Hi all,
> >
> >        I am running a series of jobs one after another. While executing
> the
> > 4th job, the job fails. It fails in the reducer --- the progress
> percentage
> > would be map 100%, reduce 99%. It gives out the following message
> >
> > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> > attempt_201003240138_0110_r_000018_1, Status : FAILED
> > Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
> > seconds. Killing!
> >
> > It makes several attempts again to execute it but fails with similar
> > message. I couldn't get anything from this error message and wanted to
> look
> > at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
> > find any files which match the timestamp of the job. Also I did not find
> > history and userlogs in the logs folder. Should I look at some other
> place
> > for the logs? What could be the possible causes for the above error?
> >
> >        I am using Hadoop 0.20.2 and I am running it on a cluster with 16
> > nodes.
> >
> > Thank you.
> >
> > Regards,
> > Raghava.
> >
> >
> >
> >
>

Re: Reduce gets struck at 99%

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi,

     Thank you Eric, Prashant and Greg. Although the timeout problem was
resolved, reduce is getting stuck at 99%. As of now, it has been stuck there
for about 3 hrs. That is too high a wait time for my task. Do you guys see
any reason for this?

      Speculative execution is "on" by default right? Or should I enable it?

Regards,
Raghava.

On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gr...@yahoo-inc.com>wrote:

>  Hi,
>
> I have also experienced this problem. Have you tried speculative execution?
> Also, I have had jobs that took a long time for one mapper / reducer because
> of a record that was significantly larger than those contained in the other
> filesplits. Do you know if it always slows down for the same filesplit?
>
> Regards,
> Greg Lawrence
>
>
> On 4/8/10 10:30 AM, "Raghava Mutharaju" <m....@gmail.com> wrote:
>
> Hello all,
>
>          I got the time out error as mentioned below -- after 600 seconds,
> that attempt was killed and the attempt would be deemed a failure. I
> searched around about this error, and one of the suggestions to include
> "progress" statements in the reducer -- it might be taking longer than 600
> seconds and so is timing out. I added calls to context.progress() and
> context.setStatus(str) in the reducer. Now, it works fine -- there are no
> timeout errors.
>
>          But, for a few jobs, it takes awfully long time to move from "Map
> 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
> was more than an hour. The reduce code is not complex -- 2 level loop and
> couple of if-else blocks. The input size is also not huge, for the job that
> gets struck for an hour at reduce 99%, it would take in 130. Some of them
> are 1-3 MB in size and couple of them are 16MB in size.
>
>          Has anyone encountered this problem before? Any pointers? I use
> Hadoop 0.20.2 on a linux cluster of 16 nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>
> On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com> wrote:
>
> Hi all,
>
>        I am running a series of jobs one after another. While executing the
> 4th job, the job fails. It fails in the reducer --- the progress percentage
> would be map 100%, reduce 99%. It gives out the following message
>
> 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> attempt_201003240138_0110_r_000018_1, Status : FAILED
> Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
> seconds. Killing!
>
> It makes several attempts again to execute it but fails with similar
> message. I couldn't get anything from this error message and wanted to look
> at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
> find any files which match the timestamp of the job. Also I did not find
> history and userlogs in the logs folder. Should I look at some other place
> for the logs? What could be the possible causes for the above error?
>
>        I am using Hadoop 0.20.2 and I am running it on a cluster with 16
> nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>
>
>
>

Re: Reduce gets struck at 99%

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi,

     Thank you Eric, Prashant and Greg. Although the timeout problem was
resolved, reduce is getting stuck at 99%. As of now, it has been stuck there
for about 3 hrs. That is too high a wait time for my task. Do you guys see
any reason for this?

      Speculative execution is "on" by default right? Or should I enable it?

Regards,
Raghava.

On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gr...@yahoo-inc.com>wrote:

>  Hi,
>
> I have also experienced this problem. Have you tried speculative execution?
> Also, I have had jobs that took a long time for one mapper / reducer because
> of a record that was significantly larger than those contained in the other
> filesplits. Do you know if it always slows down for the same filesplit?
>
> Regards,
> Greg Lawrence
>
>
> On 4/8/10 10:30 AM, "Raghava Mutharaju" <m....@gmail.com> wrote:
>
> Hello all,
>
>          I got the time out error as mentioned below -- after 600 seconds,
> that attempt was killed and the attempt would be deemed a failure. I
> searched around about this error, and one of the suggestions to include
> "progress" statements in the reducer -- it might be taking longer than 600
> seconds and so is timing out. I added calls to context.progress() and
> context.setStatus(str) in the reducer. Now, it works fine -- there are no
> timeout errors.
>
>          But, for a few jobs, it takes awfully long time to move from "Map
> 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
> was more than an hour. The reduce code is not complex -- 2 level loop and
> couple of if-else blocks. The input size is also not huge, for the job that
> gets struck for an hour at reduce 99%, it would take in 130. Some of them
> are 1-3 MB in size and couple of them are 16MB in size.
>
>          Has anyone encountered this problem before? Any pointers? I use
> Hadoop 0.20.2 on a linux cluster of 16 nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>
> On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com> wrote:
>
> Hi all,
>
>        I am running a series of jobs one after another. While executing the
> 4th job, the job fails. It fails in the reducer --- the progress percentage
> would be map 100%, reduce 99%. It gives out the following message
>
> 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> attempt_201003240138_0110_r_000018_1, Status : FAILED
> Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
> seconds. Killing!
>
> It makes several attempts again to execute it but fails with similar
> message. I couldn't get anything from this error message and wanted to look
> at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
> find any files which match the timestamp of the job. Also I did not find
> history and userlogs in the logs folder. Should I look at some other place
> for the logs? What could be the possible causes for the above error?
>
>        I am using Hadoop 0.20.2 and I am running it on a cluster with 16
> nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>
>
>
>

Re: Reduce gets struck at 99%

Posted by Gregory Lawrence <gr...@yahoo-inc.com>.
Hi,

I have also experienced this problem. Have you tried speculative execution? Also, I have had jobs that took a long time for one mapper / reducer because of a record that was significantly larger than those contained in the other filesplits. Do you know if it always slows down for the same filesplit?

Regards,
Greg Lawrence

On 4/8/10 10:30 AM, "Raghava Mutharaju" <m....@gmail.com> wrote:

Hello all,

         I got the time out error as mentioned below -- after 600 seconds, that attempt was killed and the attempt would be deemed a failure. I searched around about this error, and one of the suggestions to include "progress" statements in the reducer -- it might be taking longer than 600 seconds and so is timing out. I added calls to context.progress() and context.setStatus(str) in the reducer. Now, it works fine -- there are no timeout errors.

         But, for a few jobs, it takes awfully long time to move from "Map 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it was more than an hour. The reduce code is not complex -- 2 level loop and couple of if-else blocks. The input size is also not huge, for the job that gets struck for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in size and couple of them are 16MB in size.

         Has anyone encountered this problem before? Any pointers? I use Hadoop 0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.

On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <m....@gmail.com> wrote:
Hi all,

       I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message

10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_000018_1, Status : FAILED
Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 seconds. Killing!

It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error?

       I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.

Thank you.

Regards,
Raghava.



Re: Reduce gets struck at 99%

Posted by Gregory Lawrence <gr...@yahoo-inc.com>.
Hi,

I have also experienced this problem. Have you tried speculative execution? Also, I have had jobs that took a long time for one mapper / reducer because of a record that was significantly larger than those contained in the other filesplits. Do you know if it always slows down for the same filesplit?

Regards,
Greg Lawrence

On 4/8/10 10:30 AM, "Raghava Mutharaju" <m....@gmail.com> wrote:

Hello all,

         I got the time out error as mentioned below -- after 600 seconds, that attempt was killed and the attempt would be deemed a failure. I searched around about this error, and one of the suggestions to include "progress" statements in the reducer -- it might be taking longer than 600 seconds and so is timing out. I added calls to context.progress() and context.setStatus(str) in the reducer. Now, it works fine -- there are no timeout errors.

         But, for a few jobs, it takes awfully long time to move from "Map 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it was more than an hour. The reduce code is not complex -- 2 level loop and couple of if-else blocks. The input size is also not huge, for the job that gets struck for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in size and couple of them are 16MB in size.

         Has anyone encountered this problem before? Any pointers? I use Hadoop 0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.

On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <m....@gmail.com> wrote:
Hi all,

       I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message

10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_000018_1, Status : FAILED
Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 seconds. Killing!

It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error?

       I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.

Thank you.

Regards,
Raghava.



Re: Reduce gets struck at 99%

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi,

        Thank you Ted. I would just describe the problem again, so that it
is easier for anyone reading this email chain.

I run a series of jobs one after another. Starting from the 4th job, Reducer
gets stuck at 99% (Map 100% and Reduce 99%). It gets stuck at 99% for a many
hours and then the job fails. Earlier there were 2 exceptions in the logs
--- DFSClient exception (could not completely write into a file <file name>)
and Lease Expired Exception. Then I increased the ulimit -n (max no of open
files) from 1024 to 32768 on the advise of Ted. After this, there are no
exceptions in the logs but the reduce still gets stuck at 99%.

Do you have any suggestions?

Thank you.

Regards,
Raghava.


On Sat, Apr 17, 2010 at 9:36 PM, Ted Yu <yu...@gmail.com> wrote:

> Hi,
> Putting this thread back in pool to leverage collective intelligence.
>
> If you get the full command line of the java processes, it wouldn't be
> difficult to correlate reduce task(s) with a particular job.
>
> Cheers
>
> On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com> wrote:
>
> > Hello Ted,
> >
> >        Thank you for the suggestions :). I haven't come across any other
> > serious issue before this one. Infact, the same MR job runs for a smaller
> > input size, although, lot slower than what we expected.
> >
> > I will use jstack to get stack trace. I had a question in this regard.
> How
> > would I know which MR job (job id) is related to which java process
> (pid)? I
> > can get a list of hadoop jobs with "hadoop job -list" and list of java
> > processes with "jps" but how I couldn't determine how to get the
> connection
> > between these 2 lists.
> >
> >
> > Thank you again.
> >
> > Regards,
> > Raghava.
> >
> > On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> If you look at
> >>
> https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776,
> >> you can see that hdfs-127-branch20-redone-v2.txt<
> https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was
> the latest.
> >>
> >> You need to download the source code corresponding to your version of
> >> hadoop, apply the patch and rebuild.
> >>
> >> If you haven't experienced serious issue with hadoop for other
> scenarios,
> >> we should try to find out the root cause for the current problem without
> the
> >> 127 patch.
> >>
> >> My advice is to use jstack to find what each thread was waiting for
> after
> >> reducers get stuck.
> >> I would expect a deadlock in either your code or hdfs, I would think it
> >> should the former.
> >>
> >> You can replace sensitive names in the stack traces and paste it if you
> >> cannot determine the deadlock.
> >>
> >> Cheers
> >>
> >>
> >> On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju <
> >> m.vijayaraghava@gmail.com> wrote:
> >>
> >>> Hello Ted,
> >>>
> >>>       Thank you for the reply. Will this change fix my issue? I asked
> >>> this because I again need to convince my admin to make this change.
> >>>
> >>>       We have a gateway to the cluster-head. We generally run our MR
> jobs
> >>> on the gateway. Should this change be made to the hadoop installation
> on the
> >>> gateway?
> >>>
> >>> 1) I am confused on which patch to be applied? There are 4 patches
> listed
> >>> at https://issues.apache.org/jira/browse/HDFS-127
> >>>
> >>> 2) How to apply the patch? Should we change the lines of code specified
> >>> and rebuild hadoop? Or is there any other way?
> >>>
> >>> Thank you again.
> >>>
> >>> Regards,
> >>> Raghava.
> >>>
> >>>
> >>> On Fri, Apr 16, 2010 at 6:42 PM, <yu...@gmail.com> wrote:
> >>>
> >>>> That patch is very important.
> >>>>
> >>>> please apply it.
> >>>>
> >>>> Sent from my Verizon Wireless BlackBerry
> >>>> ------------------------------
> >>>> *From: * Raghava Mutharaju <m....@gmail.com>
> >>>> *Date: *Fri, 16 Apr 2010 17:27:11 -0400
> >>>> *To: *Ted Yu<yu...@gmail.com>
> >>>> *Subject: *Re: Reduce gets struck at 99%
> >>>>
> >>>> Hi Ted,
> >>>>
> >>>>         It took sometime to contact my department's admin (he was on
> >>>> leave) and ask him to make ulimit changes effective in the cluster
> (just
> >>>> adding entry in /etc/security/limits.conf was not sufficient, so took
> >>>> sometime to figure out). Now the ulimit is 32768. I ran the set of MR
> jobs,
> >>>> the result is the same --- it gets stuck at Reduce 99%. But this time,
> there
> >>>> are no exceptions in the logs. I view JobTracker logs through the Web
> UI. I
> >>>> checked "Running Jobs" as well as "Failed Jobs".
> >>>>
> >>>> I haven't asked the admin to apply the patch
> >>>> https://issues.apache.org/jira/browse/HDFS-127 that you mentioned
> >>>> earlier. Is this important?
> >>>>
> >>>> Do you any suggestions?
> >>>>
> >>>> Thank you.
> >>>>
> >>>> Regards,
> >>>> Raghava.
> >>>>
> >>>> On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>>> For the user under whom you launch MR jobs.
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju <
> >>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Ted,
> >>>>>>
> >>>>>>        Sorry to bug you again :) but I do not have an account on all
> >>>>>> the datanodes, I just have it on the machine on which I start the MR
> jobs.
> >>>>>> So is it required to increase the ulimit on all the nodes (in this
> case the
> >>>>>> admin may have to increase it for all the users?)
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Raghava.
> >>>>>>
> >>>>>> On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> ulimit should be increased on all nodes.
> >>>>>>>
> >>>>>>> The link I gave you lists several actions to take. I think they're
> >>>>>>> not specifically for hbase.
> >>>>>>> Also make sure the following is applied:
> >>>>>>> https://issues.apache.org/jira/browse/HDFS-127
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju <
> >>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hello Ted,
> >>>>>>>>
> >>>>>>>>        Should the increase in ulimit to 32768 be applied on all
> the
> >>>>>>>> datanodes (its a 16 node cluster)? Is this related to HBase,
> because I am
> >>>>>>>> not using HBase.
> >>>>>>>>        Are the exceptions & delay (at Reduce 99%) due to this?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Raghava.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> Your ulimit is low.
> >>>>>>>>> Ask your admin to increase it to 32768
> >>>>>>>>>
> >>>>>>>>> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju <
> >>>>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Ted,
> >>>>>>>>>>
> >>>>>>>>>> I am pasting below the timestamps from the log.
> >>>>>>>>>>
> >>>>>>>>>>        Lease-exception:
> >>>>>>>>>>
> >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle
> Finished
> >>>>>>>>>> Sort Finished Finish Time Errors Task Logs
> >>>>>>>>>>    Counters Actions
> >>>>>>>>>>    attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15
> >>>>>>>>>> FAILED 0.00%
> >>>>>>>>>>    8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec) 8-Apr-2010
> >>>>>>>>>> 07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins, 39sec)
> >>>>>>>>>>
> >>>>>>>>>> -------------------------------------
> >>>>>>>>>>
> >>>>>>>>>>         DFS Client Exception:
> >>>>>>>>>>
> >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle
> Finished
> >>>>>>>>>> Sort Finished Finish Time Errors Task Logs
> >>>>>>>>>>    Counters Actions
> >>>>>>>>>>    attempt_201004060646_0057_r_000006_0 /default-rack/
> >>>>>>>>>> nimbus3.cs.wright.edu FAILED 0.00%
> >>>>>>>>>>    8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec) 8-Apr-2010
> >>>>>>>>>> 07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins, 46sec)
> >>>>>>>>>> ------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> The file limit is set to 1024. I checked couple of datanodes. I
> >>>>>>>>>> haven't checked the headnode though.
> >>>>>>>>>>
> >>>>>>>>>> The no of currently open files under my username, on the system
> on
> >>>>>>>>>> which I started the MR jobs are 346
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thank you for you help :)
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Raghava.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Can you give me the timestamps of the two exceptions ?
> >>>>>>>>>>> I want to see if they're related.
> >>>>>>>>>>>
> >>>>>>>>>>> I saw DFSClient$DFSOutputStream.close() in the first stack
> trace.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> just to double check it's not a file
> >>>>>>>>>>>> limits issue could you run the following on each of the hosts:
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ ulimit -a
> >>>>>>>>>>>> $ lsof | wc -l
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first command will show you (among other things) the file
> >>>>>>>>>>>> limits, it
> >>>>>>>>>>>> should be above the default 1024.  The second will tell you
> have
> >>>>>>>>>>>> many files
> >>>>>>>>>>>> are currently open...
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 8, 2010 at 7:40 PM, Raghava Mutharaju <
> >>>>>>>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Ted,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>         Thank you for all the suggestions. I went through the
> >>>>>>>>>>>>> job tracker logs and I have attached the exceptions found in
> the logs. I
> >>>>>>>>>>>>> found two exceptions
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
> >>>>>>>>>>>>> Could not complete write to file    (DFS Client)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2) org.apache.hadoop.ipc.RemoteException:
> >>>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
> No lease on
> >>>>>>>>>>>>>
> /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
> >>>>>>>>>>>>> File does not exist. Holder
> DFSClient_attempt_201004060646_0057_r_000014_0
> >>>>>>>>>>>>> does not have any open files.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The exception occurs at the point of writing out <K,V> pairs
> in
> >>>>>>>>>>>>> the reducer and it occurs only in certain task attempts. I am
> not using any
> >>>>>>>>>>>>> custom output format or record writers but I do use custom
> input reader.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What could have gone wrong here?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Raghava.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Raghava:
> >>>>>>>>>>>>>> Are you able to share the last segment of reducer log ?
> >>>>>>>>>>>>>> You can get them from web UI:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Adding more log in your reducer task would help pinpoint
> where
> >>>>>>>>>>>>>> the issue is.
> >>>>>>>>>>>>>> Also look in job tracker log.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
> >>>>>>>>>>>>>> m.vijayaraghava@gmail.com
> >>>>>>>>>>>>>> > wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> > Hi Ted,
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> >      Thank you for the suggestion. I enabled it using the
> >>>>>>>>>>>>>> Configuration
> >>>>>>>>>>>>>> > class because I cannot change hadoop-site.xml file (I am
> not
> >>>>>>>>>>>>>> an admin). The
> >>>>>>>>>>>>>> > situation is still the same --- it gets stuck at reduce
> 99%
> >>>>>>>>>>>>>> and does not
> >>>>>>>>>>>>>> > move further.
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > Regards,
> >>>>>>>>>>>>>> > Raghava.
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <
> yuzhihong@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > > You need to turn on yourself (hadoop-site.xml):
> >>>>>>>>>>>>>> > > <property>
> >>>>>>>>>>>>>> > >  <name>mapred.reduce.tasks.speculative.execution</name>
> >>>>>>>>>>>>>> > >  <value>true</value>
> >>>>>>>>>>>>>> > > </property>
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > <property>
> >>>>>>>>>>>>>> > >  <name>mapred.map.tasks.speculative.execution</name>
> >>>>>>>>>>>>>> > >  <value>true</value>
> >>>>>>>>>>>>>> > > </property>
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
> >>>>>>>>>>>>>> > > m.vijayaraghava@gmail.com
> >>>>>>>>>>>>>> > > > wrote:
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > > Hi,
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > >     Thank you Eric, Prashant and Greg. Although the
> >>>>>>>>>>>>>> timeout problem was
> >>>>>>>>>>>>>> > > > resolved, reduce is getting stuck at 99%. As of now,
> it
> >>>>>>>>>>>>>> has been stuck
> >>>>>>>>>>>>>> > > > there
> >>>>>>>>>>>>>> > > > for about 3 hrs. That is too high a wait time for my
> >>>>>>>>>>>>>> task. Do you guys
> >>>>>>>>>>>>>> > > see
> >>>>>>>>>>>>>> > > > any reason for this?
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > >      Speculative execution is "on" by default right?
> Or
> >>>>>>>>>>>>>> should I enable
> >>>>>>>>>>>>>> > > it?
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > Regards,
> >>>>>>>>>>>>>> > > > Raghava.
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
> >>>>>>>>>>>>>> gregl@yahoo-inc.com
> >>>>>>>>>>>>>> > > > >wrote:
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > >  Hi,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > I have also experienced this problem. Have you tried
> >>>>>>>>>>>>>> speculative
> >>>>>>>>>>>>>> > > > execution?
> >>>>>>>>>>>>>> > > > > Also, I have had jobs that took a long time for one
> >>>>>>>>>>>>>> mapper / reducer
> >>>>>>>>>>>>>> > > > because
> >>>>>>>>>>>>>> > > > > of a record that was significantly larger than those
> >>>>>>>>>>>>>> contained in the
> >>>>>>>>>>>>>> > > > other
> >>>>>>>>>>>>>> > > > > filesplits. Do you know if it always slows down for
> >>>>>>>>>>>>>> the same
> >>>>>>>>>>>>>> > filesplit?
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Greg Lawrence
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <
> >>>>>>>>>>>>>> m.vijayaraghava@gmail.com>
> >>>>>>>>>>>>>> > > > wrote:
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Hello all,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >          I got the time out error as mentioned below
> >>>>>>>>>>>>>> -- after 600
> >>>>>>>>>>>>>> > > > seconds,
> >>>>>>>>>>>>>> > > > > that attempt was killed and the attempt would be
> >>>>>>>>>>>>>> deemed a failure. I
> >>>>>>>>>>>>>> > > > > searched around about this error, and one of the
> >>>>>>>>>>>>>> suggestions to
> >>>>>>>>>>>>>> > include
> >>>>>>>>>>>>>> > > > > "progress" statements in the reducer -- it might be
> >>>>>>>>>>>>>> taking longer
> >>>>>>>>>>>>>> > than
> >>>>>>>>>>>>>> > > > 600
> >>>>>>>>>>>>>> > > > > seconds and so is timing out. I added calls to
> >>>>>>>>>>>>>> context.progress() and
> >>>>>>>>>>>>>> > > > > context.setStatus(str) in the reducer. Now, it works
> >>>>>>>>>>>>>> fine -- there
> >>>>>>>>>>>>>> > are
> >>>>>>>>>>>>>> > > no
> >>>>>>>>>>>>>> > > > > timeout errors.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >          But, for a few jobs, it takes awfully long
> >>>>>>>>>>>>>> time to move from
> >>>>>>>>>>>>>> > > > "Map
> >>>>>>>>>>>>>> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its
> >>>>>>>>>>>>>> 15mins and for
> >>>>>>>>>>>>>> > some
> >>>>>>>>>>>>>> > > > it
> >>>>>>>>>>>>>> > > > > was more than an hour. The reduce code is not
> complex
> >>>>>>>>>>>>>> -- 2 level loop
> >>>>>>>>>>>>>> > > and
> >>>>>>>>>>>>>> > > > > couple of if-else blocks. The input size is also not
> >>>>>>>>>>>>>> huge, for the
> >>>>>>>>>>>>>> > job
> >>>>>>>>>>>>>> > > > that
> >>>>>>>>>>>>>> > > > > gets struck for an hour at reduce 99%, it would take
> >>>>>>>>>>>>>> in 130. Some of
> >>>>>>>>>>>>>> > > them
> >>>>>>>>>>>>>> > > > > are 1-3 MB in size and couple of them are 16MB in
> >>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >          Has anyone encountered this problem before?
> >>>>>>>>>>>>>> Any pointers? I
> >>>>>>>>>>>>>> > > use
> >>>>>>>>>>>>>> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Thank you.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Raghava.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> >>>>>>>>>>>>>> > > > > m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Hi all,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        I am running a series of jobs one after
> >>>>>>>>>>>>>> another. While
> >>>>>>>>>>>>>> > executing
> >>>>>>>>>>>>>> > > > the
> >>>>>>>>>>>>>> > > > > 4th job, the job fails. It fails in the reducer ---
> >>>>>>>>>>>>>> the progress
> >>>>>>>>>>>>>> > > > percentage
> >>>>>>>>>>>>>> > > > > would be map 100%, reduce 99%. It gives out the
> >>>>>>>>>>>>>> following message
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> >>>>>>>>>>>>>> > > > > attempt_201003240138_0110_r_000018_1, Status :
> FAILED
> >>>>>>>>>>>>>> > > > > Task attempt_201003240138_0110_r_000018_1 failed to
> >>>>>>>>>>>>>> report status for
> >>>>>>>>>>>>>> > > 602
> >>>>>>>>>>>>>> > > > > seconds. Killing!
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > It makes several attempts again to execute it but
> >>>>>>>>>>>>>> fails with similar
> >>>>>>>>>>>>>> > > > > message. I couldn't get anything from this error
> >>>>>>>>>>>>>> message and wanted
> >>>>>>>>>>>>>> > to
> >>>>>>>>>>>>>> > > > look
> >>>>>>>>>>>>>> > > > > at logs (located in the default dir of
> >>>>>>>>>>>>>> ${HADOOP_HOME/logs}). But I
> >>>>>>>>>>>>>> > > don't
> >>>>>>>>>>>>>> > > > > find any files which match the timestamp of the job.
> >>>>>>>>>>>>>> Also I did not
> >>>>>>>>>>>>>> > > find
> >>>>>>>>>>>>>> > > > > history and userlogs in the logs folder. Should I
> look
> >>>>>>>>>>>>>> at some other
> >>>>>>>>>>>>>> > > > place
> >>>>>>>>>>>>>> > > > > for the logs? What could be the possible causes for
> >>>>>>>>>>>>>> the above error?
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        I am using Hadoop 0.20.2 and I am running it
> on
> >>>>>>>>>>>>>> a cluster with
> >>>>>>>>>>>>>> > > 16
> >>>>>>>>>>>>>> > > > > nodes.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Thank you.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Raghava.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Reduce gets struck at 99%

Posted by Raghava Mutharaju <m....@gmail.com>.
Hi,

        Thank you Ted. I would just describe the problem again, so that it
is easier for anyone reading this email chain.

I run a series of jobs one after another. Starting from the 4th job, Reducer
gets stuck at 99% (Map 100% and Reduce 99%). It gets stuck at 99% for a many
hours and then the job fails. Earlier there were 2 exceptions in the logs
--- DFSClient exception (could not completely write into a file <file name>)
and Lease Expired Exception. Then I increased the ulimit -n (max no of open
files) from 1024 to 32768 on the advise of Ted. After this, there are no
exceptions in the logs but the reduce still gets stuck at 99%.

Do you have any suggestions?

Thank you.

Regards,
Raghava.


On Sat, Apr 17, 2010 at 9:36 PM, Ted Yu <yu...@gmail.com> wrote:

> Hi,
> Putting this thread back in pool to leverage collective intelligence.
>
> If you get the full command line of the java processes, it wouldn't be
> difficult to correlate reduce task(s) with a particular job.
>
> Cheers
>
> On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com> wrote:
>
> > Hello Ted,
> >
> >        Thank you for the suggestions :). I haven't come across any other
> > serious issue before this one. Infact, the same MR job runs for a smaller
> > input size, although, lot slower than what we expected.
> >
> > I will use jstack to get stack trace. I had a question in this regard.
> How
> > would I know which MR job (job id) is related to which java process
> (pid)? I
> > can get a list of hadoop jobs with "hadoop job -list" and list of java
> > processes with "jps" but how I couldn't determine how to get the
> connection
> > between these 2 lists.
> >
> >
> > Thank you again.
> >
> > Regards,
> > Raghava.
> >
> > On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> If you look at
> >>
> https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776,
> >> you can see that hdfs-127-branch20-redone-v2.txt<
> https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was
> the latest.
> >>
> >> You need to download the source code corresponding to your version of
> >> hadoop, apply the patch and rebuild.
> >>
> >> If you haven't experienced serious issue with hadoop for other
> scenarios,
> >> we should try to find out the root cause for the current problem without
> the
> >> 127 patch.
> >>
> >> My advice is to use jstack to find what each thread was waiting for
> after
> >> reducers get stuck.
> >> I would expect a deadlock in either your code or hdfs, I would think it
> >> should the former.
> >>
> >> You can replace sensitive names in the stack traces and paste it if you
> >> cannot determine the deadlock.
> >>
> >> Cheers
> >>
> >>
> >> On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju <
> >> m.vijayaraghava@gmail.com> wrote:
> >>
> >>> Hello Ted,
> >>>
> >>>       Thank you for the reply. Will this change fix my issue? I asked
> >>> this because I again need to convince my admin to make this change.
> >>>
> >>>       We have a gateway to the cluster-head. We generally run our MR
> jobs
> >>> on the gateway. Should this change be made to the hadoop installation
> on the
> >>> gateway?
> >>>
> >>> 1) I am confused on which patch to be applied? There are 4 patches
> listed
> >>> at https://issues.apache.org/jira/browse/HDFS-127
> >>>
> >>> 2) How to apply the patch? Should we change the lines of code specified
> >>> and rebuild hadoop? Or is there any other way?
> >>>
> >>> Thank you again.
> >>>
> >>> Regards,
> >>> Raghava.
> >>>
> >>>
> >>> On Fri, Apr 16, 2010 at 6:42 PM, <yu...@gmail.com> wrote:
> >>>
> >>>> That patch is very important.
> >>>>
> >>>> please apply it.
> >>>>
> >>>> Sent from my Verizon Wireless BlackBerry
> >>>> ------------------------------
> >>>> *From: * Raghava Mutharaju <m....@gmail.com>
> >>>> *Date: *Fri, 16 Apr 2010 17:27:11 -0400
> >>>> *To: *Ted Yu<yu...@gmail.com>
> >>>> *Subject: *Re: Reduce gets struck at 99%
> >>>>
> >>>> Hi Ted,
> >>>>
> >>>>         It took sometime to contact my department's admin (he was on
> >>>> leave) and ask him to make ulimit changes effective in the cluster
> (just
> >>>> adding entry in /etc/security/limits.conf was not sufficient, so took
> >>>> sometime to figure out). Now the ulimit is 32768. I ran the set of MR
> jobs,
> >>>> the result is the same --- it gets stuck at Reduce 99%. But this time,
> there
> >>>> are no exceptions in the logs. I view JobTracker logs through the Web
> UI. I
> >>>> checked "Running Jobs" as well as "Failed Jobs".
> >>>>
> >>>> I haven't asked the admin to apply the patch
> >>>> https://issues.apache.org/jira/browse/HDFS-127 that you mentioned
> >>>> earlier. Is this important?
> >>>>
> >>>> Do you any suggestions?
> >>>>
> >>>> Thank you.
> >>>>
> >>>> Regards,
> >>>> Raghava.
> >>>>
> >>>> On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>>> For the user under whom you launch MR jobs.
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju <
> >>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Ted,
> >>>>>>
> >>>>>>        Sorry to bug you again :) but I do not have an account on all
> >>>>>> the datanodes, I just have it on the machine on which I start the MR
> jobs.
> >>>>>> So is it required to increase the ulimit on all the nodes (in this
> case the
> >>>>>> admin may have to increase it for all the users?)
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Raghava.
> >>>>>>
> >>>>>> On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> ulimit should be increased on all nodes.
> >>>>>>>
> >>>>>>> The link I gave you lists several actions to take. I think they're
> >>>>>>> not specifically for hbase.
> >>>>>>> Also make sure the following is applied:
> >>>>>>> https://issues.apache.org/jira/browse/HDFS-127
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju <
> >>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hello Ted,
> >>>>>>>>
> >>>>>>>>        Should the increase in ulimit to 32768 be applied on all
> the
> >>>>>>>> datanodes (its a 16 node cluster)? Is this related to HBase,
> because I am
> >>>>>>>> not using HBase.
> >>>>>>>>        Are the exceptions & delay (at Reduce 99%) due to this?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Raghava.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> Your ulimit is low.
> >>>>>>>>> Ask your admin to increase it to 32768
> >>>>>>>>>
> >>>>>>>>> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju <
> >>>>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Ted,
> >>>>>>>>>>
> >>>>>>>>>> I am pasting below the timestamps from the log.
> >>>>>>>>>>
> >>>>>>>>>>        Lease-exception:
> >>>>>>>>>>
> >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle
> Finished
> >>>>>>>>>> Sort Finished Finish Time Errors Task Logs
> >>>>>>>>>>    Counters Actions
> >>>>>>>>>>    attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15
> >>>>>>>>>> FAILED 0.00%
> >>>>>>>>>>    8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec) 8-Apr-2010
> >>>>>>>>>> 07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins, 39sec)
> >>>>>>>>>>
> >>>>>>>>>> -------------------------------------
> >>>>>>>>>>
> >>>>>>>>>>         DFS Client Exception:
> >>>>>>>>>>
> >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle
> Finished
> >>>>>>>>>> Sort Finished Finish Time Errors Task Logs
> >>>>>>>>>>    Counters Actions
> >>>>>>>>>>    attempt_201004060646_0057_r_000006_0 /default-rack/
> >>>>>>>>>> nimbus3.cs.wright.edu FAILED 0.00%
> >>>>>>>>>>    8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec) 8-Apr-2010
> >>>>>>>>>> 07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins, 46sec)
> >>>>>>>>>> ------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> The file limit is set to 1024. I checked couple of datanodes. I
> >>>>>>>>>> haven't checked the headnode though.
> >>>>>>>>>>
> >>>>>>>>>> The no of currently open files under my username, on the system
> on
> >>>>>>>>>> which I started the MR jobs are 346
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thank you for you help :)
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Raghava.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Can you give me the timestamps of the two exceptions ?
> >>>>>>>>>>> I want to see if they're related.
> >>>>>>>>>>>
> >>>>>>>>>>> I saw DFSClient$DFSOutputStream.close() in the first stack
> trace.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> just to double check it's not a file
> >>>>>>>>>>>> limits issue could you run the following on each of the hosts:
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ ulimit -a
> >>>>>>>>>>>> $ lsof | wc -l
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first command will show you (among other things) the file
> >>>>>>>>>>>> limits, it
> >>>>>>>>>>>> should be above the default 1024.  The second will tell you
> have
> >>>>>>>>>>>> many files
> >>>>>>>>>>>> are currently open...
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 8, 2010 at 7:40 PM, Raghava Mutharaju <
> >>>>>>>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Ted,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>         Thank you for all the suggestions. I went through the
> >>>>>>>>>>>>> job tracker logs and I have attached the exceptions found in
> the logs. I
> >>>>>>>>>>>>> found two exceptions
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
> >>>>>>>>>>>>> Could not complete write to file    (DFS Client)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2) org.apache.hadoop.ipc.RemoteException:
> >>>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
> No lease on
> >>>>>>>>>>>>>
> /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
> >>>>>>>>>>>>> File does not exist. Holder
> DFSClient_attempt_201004060646_0057_r_000014_0
> >>>>>>>>>>>>> does not have any open files.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The exception occurs at the point of writing out <K,V> pairs
> in
> >>>>>>>>>>>>> the reducer and it occurs only in certain task attempts. I am
> not using any
> >>>>>>>>>>>>> custom output format or record writers but I do use custom
> input reader.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What could have gone wrong here?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Raghava.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Raghava:
> >>>>>>>>>>>>>> Are you able to share the last segment of reducer log ?
> >>>>>>>>>>>>>> You can get them from web UI:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Adding more log in your reducer task would help pinpoint
> where
> >>>>>>>>>>>>>> the issue is.
> >>>>>>>>>>>>>> Also look in job tracker log.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
> >>>>>>>>>>>>>> m.vijayaraghava@gmail.com
> >>>>>>>>>>>>>> > wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> > Hi Ted,
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> >      Thank you for the suggestion. I enabled it using the
> >>>>>>>>>>>>>> Configuration
> >>>>>>>>>>>>>> > class because I cannot change hadoop-site.xml file (I am
> not
> >>>>>>>>>>>>>> an admin). The
> >>>>>>>>>>>>>> > situation is still the same --- it gets stuck at reduce
> 99%
> >>>>>>>>>>>>>> and does not
> >>>>>>>>>>>>>> > move further.
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > Regards,
> >>>>>>>>>>>>>> > Raghava.
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <
> yuzhihong@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > > You need to turn on yourself (hadoop-site.xml):
> >>>>>>>>>>>>>> > > <property>
> >>>>>>>>>>>>>> > >  <name>mapred.reduce.tasks.speculative.execution</name>
> >>>>>>>>>>>>>> > >  <value>true</value>
> >>>>>>>>>>>>>> > > </property>
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > <property>
> >>>>>>>>>>>>>> > >  <name>mapred.map.tasks.speculative.execution</name>
> >>>>>>>>>>>>>> > >  <value>true</value>
> >>>>>>>>>>>>>> > > </property>
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
> >>>>>>>>>>>>>> > > m.vijayaraghava@gmail.com
> >>>>>>>>>>>>>> > > > wrote:
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > > Hi,
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > >     Thank you Eric, Prashant and Greg. Although the
> >>>>>>>>>>>>>> timeout problem was
> >>>>>>>>>>>>>> > > > resolved, reduce is getting stuck at 99%. As of now,
> it
> >>>>>>>>>>>>>> has been stuck
> >>>>>>>>>>>>>> > > > there
> >>>>>>>>>>>>>> > > > for about 3 hrs. That is too high a wait time for my
> >>>>>>>>>>>>>> task. Do you guys
> >>>>>>>>>>>>>> > > see
> >>>>>>>>>>>>>> > > > any reason for this?
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > >      Speculative execution is "on" by default right?
> Or
> >>>>>>>>>>>>>> should I enable
> >>>>>>>>>>>>>> > > it?
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > Regards,
> >>>>>>>>>>>>>> > > > Raghava.
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
> >>>>>>>>>>>>>> gregl@yahoo-inc.com
> >>>>>>>>>>>>>> > > > >wrote:
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > >  Hi,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > I have also experienced this problem. Have you tried
> >>>>>>>>>>>>>> speculative
> >>>>>>>>>>>>>> > > > execution?
> >>>>>>>>>>>>>> > > > > Also, I have had jobs that took a long time for one
> >>>>>>>>>>>>>> mapper / reducer
> >>>>>>>>>>>>>> > > > because
> >>>>>>>>>>>>>> > > > > of a record that was significantly larger than those
> >>>>>>>>>>>>>> contained in the
> >>>>>>>>>>>>>> > > > other
> >>>>>>>>>>>>>> > > > > filesplits. Do you know if it always slows down for
> >>>>>>>>>>>>>> the same
> >>>>>>>>>>>>>> > filesplit?
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Greg Lawrence
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <
> >>>>>>>>>>>>>> m.vijayaraghava@gmail.com>
> >>>>>>>>>>>>>> > > > wrote:
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Hello all,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >          I got the time out error as mentioned below
> >>>>>>>>>>>>>> -- after 600
> >>>>>>>>>>>>>> > > > seconds,
> >>>>>>>>>>>>>> > > > > that attempt was killed and the attempt would be
> >>>>>>>>>>>>>> deemed a failure. I
> >>>>>>>>>>>>>> > > > > searched around about this error, and one of the
> >>>>>>>>>>>>>> suggestions to
> >>>>>>>>>>>>>> > include
> >>>>>>>>>>>>>> > > > > "progress" statements in the reducer -- it might be
> >>>>>>>>>>>>>> taking longer
> >>>>>>>>>>>>>> > than
> >>>>>>>>>>>>>> > > > 600
> >>>>>>>>>>>>>> > > > > seconds and so is timing out. I added calls to
> >>>>>>>>>>>>>> context.progress() and
> >>>>>>>>>>>>>> > > > > context.setStatus(str) in the reducer. Now, it works
> >>>>>>>>>>>>>> fine -- there
> >>>>>>>>>>>>>> > are
> >>>>>>>>>>>>>> > > no
> >>>>>>>>>>>>>> > > > > timeout errors.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >          But, for a few jobs, it takes awfully long
> >>>>>>>>>>>>>> time to move from
> >>>>>>>>>>>>>> > > > "Map
> >>>>>>>>>>>>>> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its
> >>>>>>>>>>>>>> 15mins and for
> >>>>>>>>>>>>>> > some
> >>>>>>>>>>>>>> > > > it
> >>>>>>>>>>>>>> > > > > was more than an hour. The reduce code is not
> complex
> >>>>>>>>>>>>>> -- 2 level loop
> >>>>>>>>>>>>>> > > and
> >>>>>>>>>>>>>> > > > > couple of if-else blocks. The input size is also not
> >>>>>>>>>>>>>> huge, for the
> >>>>>>>>>>>>>> > job
> >>>>>>>>>>>>>> > > > that
> >>>>>>>>>>>>>> > > > > gets struck for an hour at reduce 99%, it would take
> >>>>>>>>>>>>>> in 130. Some of
> >>>>>>>>>>>>>> > > them
> >>>>>>>>>>>>>> > > > > are 1-3 MB in size and couple of them are 16MB in
> >>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >          Has anyone encountered this problem before?
> >>>>>>>>>>>>>> Any pointers? I
> >>>>>>>>>>>>>> > > use
> >>>>>>>>>>>>>> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Thank you.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Raghava.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> >>>>>>>>>>>>>> > > > > m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Hi all,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        I am running a series of jobs one after
> >>>>>>>>>>>>>> another. While
> >>>>>>>>>>>>>> > executing
> >>>>>>>>>>>>>> > > > the
> >>>>>>>>>>>>>> > > > > 4th job, the job fails. It fails in the reducer ---
> >>>>>>>>>>>>>> the progress
> >>>>>>>>>>>>>> > > > percentage
> >>>>>>>>>>>>>> > > > > would be map 100%, reduce 99%. It gives out the
> >>>>>>>>>>>>>> following message
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> >>>>>>>>>>>>>> > > > > attempt_201003240138_0110_r_000018_1, Status :
> FAILED
> >>>>>>>>>>>>>> > > > > Task attempt_201003240138_0110_r_000018_1 failed to
> >>>>>>>>>>>>>> report status for
> >>>>>>>>>>>>>> > > 602
> >>>>>>>>>>>>>> > > > > seconds. Killing!
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > It makes several attempts again to execute it but
> >>>>>>>>>>>>>> fails with similar
> >>>>>>>>>>>>>> > > > > message. I couldn't get anything from this error
> >>>>>>>>>>>>>> message and wanted
> >>>>>>>>>>>>>> > to
> >>>>>>>>>>>>>> > > > look
> >>>>>>>>>>>>>> > > > > at logs (located in the default dir of
> >>>>>>>>>>>>>> ${HADOOP_HOME/logs}). But I
> >>>>>>>>>>>>>> > > don't
> >>>>>>>>>>>>>> > > > > find any files which match the timestamp of the job.
> >>>>>>>>>>>>>> Also I did not
> >>>>>>>>>>>>>> > > find
> >>>>>>>>>>>>>> > > > > history and userlogs in the logs folder. Should I
> look
> >>>>>>>>>>>>>> at some other
> >>>>>>>>>>>>>> > > > place
> >>>>>>>>>>>>>> > > > > for the logs? What could be the possible causes for
> >>>>>>>>>>>>>> the above error?
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        I am using Hadoop 0.20.2 and I am running it
> on
> >>>>>>>>>>>>>> a cluster with
> >>>>>>>>>>>>>> > > 16
> >>>>>>>>>>>>>> > > > > nodes.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Thank you.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Raghava.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Reduce gets struck at 99%

Posted by Ted Yu <yu...@gmail.com>.
Hi,
Putting this thread back in pool to leverage collective intelligence.

If you get the full command line of the java processes, it wouldn't be
difficult to correlate reduce task(s) with a particular job.

Cheers

On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju <
m.vijayaraghava@gmail.com> wrote:

> Hello Ted,
>
>        Thank you for the suggestions :). I haven't come across any other
> serious issue before this one. Infact, the same MR job runs for a smaller
> input size, although, lot slower than what we expected.
>
> I will use jstack to get stack trace. I had a question in this regard. How
> would I know which MR job (job id) is related to which java process (pid)? I
> can get a list of hadoop jobs with "hadoop job -list" and list of java
> processes with "jps" but how I couldn't determine how to get the connection
> between these 2 lists.
>
>
> Thank you again.
>
> Regards,
> Raghava.
>
> On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> If you look at
>> https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776,
>> you can see that hdfs-127-branch20-redone-v2.txt<https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was the latest.
>>
>> You need to download the source code corresponding to your version of
>> hadoop, apply the patch and rebuild.
>>
>> If you haven't experienced serious issue with hadoop for other scenarios,
>> we should try to find out the root cause for the current problem without the
>> 127 patch.
>>
>> My advice is to use jstack to find what each thread was waiting for after
>> reducers get stuck.
>> I would expect a deadlock in either your code or hdfs, I would think it
>> should the former.
>>
>> You can replace sensitive names in the stack traces and paste it if you
>> cannot determine the deadlock.
>>
>> Cheers
>>
>>
>> On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju <
>> m.vijayaraghava@gmail.com> wrote:
>>
>>> Hello Ted,
>>>
>>>       Thank you for the reply. Will this change fix my issue? I asked
>>> this because I again need to convince my admin to make this change.
>>>
>>>       We have a gateway to the cluster-head. We generally run our MR jobs
>>> on the gateway. Should this change be made to the hadoop installation on the
>>> gateway?
>>>
>>> 1) I am confused on which patch to be applied? There are 4 patches listed
>>> at https://issues.apache.org/jira/browse/HDFS-127
>>>
>>> 2) How to apply the patch? Should we change the lines of code specified
>>> and rebuild hadoop? Or is there any other way?
>>>
>>> Thank you again.
>>>
>>> Regards,
>>> Raghava.
>>>
>>>
>>> On Fri, Apr 16, 2010 at 6:42 PM, <yu...@gmail.com> wrote:
>>>
>>>> That patch is very important.
>>>>
>>>> please apply it.
>>>>
>>>> Sent from my Verizon Wireless BlackBerry
>>>> ------------------------------
>>>> *From: * Raghava Mutharaju <m....@gmail.com>
>>>> *Date: *Fri, 16 Apr 2010 17:27:11 -0400
>>>> *To: *Ted Yu<yu...@gmail.com>
>>>> *Subject: *Re: Reduce gets struck at 99%
>>>>
>>>> Hi Ted,
>>>>
>>>>         It took sometime to contact my department's admin (he was on
>>>> leave) and ask him to make ulimit changes effective in the cluster (just
>>>> adding entry in /etc/security/limits.conf was not sufficient, so took
>>>> sometime to figure out). Now the ulimit is 32768. I ran the set of MR jobs,
>>>> the result is the same --- it gets stuck at Reduce 99%. But this time, there
>>>> are no exceptions in the logs. I view JobTracker logs through the Web UI. I
>>>> checked "Running Jobs" as well as "Failed Jobs".
>>>>
>>>> I haven't asked the admin to apply the patch
>>>> https://issues.apache.org/jira/browse/HDFS-127 that you mentioned
>>>> earlier. Is this important?
>>>>
>>>> Do you any suggestions?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Raghava.
>>>>
>>>> On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> For the user under whom you launch MR jobs.
>>>>>
>>>>>
>>>>> On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju <
>>>>> m.vijayaraghava@gmail.com> wrote:
>>>>>
>>>>>> Hi Ted,
>>>>>>
>>>>>>        Sorry to bug you again :) but I do not have an account on all
>>>>>> the datanodes, I just have it on the machine on which I start the MR jobs.
>>>>>> So is it required to increase the ulimit on all the nodes (in this case the
>>>>>> admin may have to increase it for all the users?)
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Raghava.
>>>>>>
>>>>>> On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> ulimit should be increased on all nodes.
>>>>>>>
>>>>>>> The link I gave you lists several actions to take. I think they're
>>>>>>> not specifically for hbase.
>>>>>>> Also make sure the following is applied:
>>>>>>> https://issues.apache.org/jira/browse/HDFS-127
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju <
>>>>>>> m.vijayaraghava@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello Ted,
>>>>>>>>
>>>>>>>>        Should the increase in ulimit to 32768 be applied on all the
>>>>>>>> datanodes (its a 16 node cluster)? Is this related to HBase, because I am
>>>>>>>> not using HBase.
>>>>>>>>        Are the exceptions & delay (at Reduce 99%) due to this?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Raghava.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Your ulimit is low.
>>>>>>>>> Ask your admin to increase it to 32768
>>>>>>>>>
>>>>>>>>> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju <
>>>>>>>>> m.vijayaraghava@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ted,
>>>>>>>>>>
>>>>>>>>>> I am pasting below the timestamps from the log.
>>>>>>>>>>
>>>>>>>>>>        Lease-exception:
>>>>>>>>>>
>>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle Finished
>>>>>>>>>> Sort Finished Finish Time Errors Task Logs
>>>>>>>>>>    Counters Actions
>>>>>>>>>>    attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15
>>>>>>>>>> FAILED 0.00%
>>>>>>>>>>    8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec) 8-Apr-2010
>>>>>>>>>> 07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins, 39sec)
>>>>>>>>>>
>>>>>>>>>> -------------------------------------
>>>>>>>>>>
>>>>>>>>>>         DFS Client Exception:
>>>>>>>>>>
>>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle Finished
>>>>>>>>>> Sort Finished Finish Time Errors Task Logs
>>>>>>>>>>    Counters Actions
>>>>>>>>>>    attempt_201004060646_0057_r_000006_0 /default-rack/
>>>>>>>>>> nimbus3.cs.wright.edu FAILED 0.00%
>>>>>>>>>>    8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec) 8-Apr-2010
>>>>>>>>>> 07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins, 46sec)
>>>>>>>>>> ------------------------------------------
>>>>>>>>>>
>>>>>>>>>> The file limit is set to 1024. I checked couple of datanodes. I
>>>>>>>>>> haven't checked the headnode though.
>>>>>>>>>>
>>>>>>>>>> The no of currently open files under my username, on the system on
>>>>>>>>>> which I started the MR jobs are 346
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you for you help :)
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Raghava.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu <yu...@gmail.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> Can you give me the timestamps of the two exceptions ?
>>>>>>>>>>> I want to see if they're related.
>>>>>>>>>>>
>>>>>>>>>>> I saw DFSClient$DFSOutputStream.close() in the first stack trace.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu <yu...@gmail.com>wrote:
>>>>>>>>>>>
>>>>>>>>>>>> just to double check it's not a file
>>>>>>>>>>>> limits issue could you run the following on each of the hosts:
>>>>>>>>>>>>
>>>>>>>>>>>> $ ulimit -a
>>>>>>>>>>>> $ lsof | wc -l
>>>>>>>>>>>>
>>>>>>>>>>>> The first command will show you (among other things) the file
>>>>>>>>>>>> limits, it
>>>>>>>>>>>> should be above the default 1024.  The second will tell you have
>>>>>>>>>>>> many files
>>>>>>>>>>>> are currently open...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 8, 2010 at 7:40 PM, Raghava Mutharaju <
>>>>>>>>>>>> m.vijayaraghava@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>>>
>>>>>>>>>>>>>         Thank you for all the suggestions. I went through the
>>>>>>>>>>>>> job tracker logs and I have attached the exceptions found in the logs. I
>>>>>>>>>>>>> found two exceptions
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) org.apache.hadoop.ipc.RemoteException: java.io.IOException:
>>>>>>>>>>>>> Could not complete write to file    (DFS Client)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) org.apache.hadoop.ipc.RemoteException:
>>>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
>>>>>>>>>>>>> /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
>>>>>>>>>>>>> File does not exist. Holder DFSClient_attempt_201004060646_0057_r_000014_0
>>>>>>>>>>>>> does not have any open files.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The exception occurs at the point of writing out <K,V> pairs in
>>>>>>>>>>>>> the reducer and it occurs only in certain task attempts. I am not using any
>>>>>>>>>>>>> custom output format or record writers but I do use custom input reader.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What could have gone wrong here?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Raghava.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yu...@gmail.com>wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Raghava:
>>>>>>>>>>>>>> Are you able to share the last segment of reducer log ?
>>>>>>>>>>>>>> You can get them from web UI:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Adding more log in your reducer task would help pinpoint where
>>>>>>>>>>>>>> the issue is.
>>>>>>>>>>>>>> Also look in job tracker log.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
>>>>>>>>>>>>>> m.vijayaraghava@gmail.com
>>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Hi Ted,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >      Thank you for the suggestion. I enabled it using the
>>>>>>>>>>>>>> Configuration
>>>>>>>>>>>>>> > class because I cannot change hadoop-site.xml file (I am not
>>>>>>>>>>>>>> an admin). The
>>>>>>>>>>>>>> > situation is still the same --- it gets stuck at reduce 99%
>>>>>>>>>>>>>> and does not
>>>>>>>>>>>>>> > move further.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Regards,
>>>>>>>>>>>>>> > Raghava.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yu...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > > You need to turn on yourself (hadoop-site.xml):
>>>>>>>>>>>>>> > > <property>
>>>>>>>>>>>>>> > >  <name>mapred.reduce.tasks.speculative.execution</name>
>>>>>>>>>>>>>> > >  <value>true</value>
>>>>>>>>>>>>>> > > </property>
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > > <property>
>>>>>>>>>>>>>> > >  <name>mapred.map.tasks.speculative.execution</name>
>>>>>>>>>>>>>> > >  <value>true</value>
>>>>>>>>>>>>>> > > </property>
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
>>>>>>>>>>>>>> > > m.vijayaraghava@gmail.com
>>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > > > Hi,
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > >     Thank you Eric, Prashant and Greg. Although the
>>>>>>>>>>>>>> timeout problem was
>>>>>>>>>>>>>> > > > resolved, reduce is getting stuck at 99%. As of now, it
>>>>>>>>>>>>>> has been stuck
>>>>>>>>>>>>>> > > > there
>>>>>>>>>>>>>> > > > for about 3 hrs. That is too high a wait time for my
>>>>>>>>>>>>>> task. Do you guys
>>>>>>>>>>>>>> > > see
>>>>>>>>>>>>>> > > > any reason for this?
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > >      Speculative execution is "on" by default right? Or
>>>>>>>>>>>>>> should I enable
>>>>>>>>>>>>>> > > it?
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > Regards,
>>>>>>>>>>>>>> > > > Raghava.
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
>>>>>>>>>>>>>> gregl@yahoo-inc.com
>>>>>>>>>>>>>> > > > >wrote:
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > >  Hi,
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > I have also experienced this problem. Have you tried
>>>>>>>>>>>>>> speculative
>>>>>>>>>>>>>> > > > execution?
>>>>>>>>>>>>>> > > > > Also, I have had jobs that took a long time for one
>>>>>>>>>>>>>> mapper / reducer
>>>>>>>>>>>>>> > > > because
>>>>>>>>>>>>>> > > > > of a record that was significantly larger than those
>>>>>>>>>>>>>> contained in the
>>>>>>>>>>>>>> > > > other
>>>>>>>>>>>>>> > > > > filesplits. Do you know if it always slows down for
>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>> > filesplit?
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > Regards,
>>>>>>>>>>>>>> > > > > Greg Lawrence
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <
>>>>>>>>>>>>>> m.vijayaraghava@gmail.com>
>>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > Hello all,
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >          I got the time out error as mentioned below
>>>>>>>>>>>>>> -- after 600
>>>>>>>>>>>>>> > > > seconds,
>>>>>>>>>>>>>> > > > > that attempt was killed and the attempt would be
>>>>>>>>>>>>>> deemed a failure. I
>>>>>>>>>>>>>> > > > > searched around about this error, and one of the
>>>>>>>>>>>>>> suggestions to
>>>>>>>>>>>>>> > include
>>>>>>>>>>>>>> > > > > "progress" statements in the reducer -- it might be
>>>>>>>>>>>>>> taking longer
>>>>>>>>>>>>>> > than
>>>>>>>>>>>>>> > > > 600
>>>>>>>>>>>>>> > > > > seconds and so is timing out. I added calls to
>>>>>>>>>>>>>> context.progress() and
>>>>>>>>>>>>>> > > > > context.setStatus(str) in the reducer. Now, it works
>>>>>>>>>>>>>> fine -- there
>>>>>>>>>>>>>> > are
>>>>>>>>>>>>>> > > no
>>>>>>>>>>>>>> > > > > timeout errors.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >          But, for a few jobs, it takes awfully long
>>>>>>>>>>>>>> time to move from
>>>>>>>>>>>>>> > > > "Map
>>>>>>>>>>>>>> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its
>>>>>>>>>>>>>> 15mins and for
>>>>>>>>>>>>>> > some
>>>>>>>>>>>>>> > > > it
>>>>>>>>>>>>>> > > > > was more than an hour. The reduce code is not complex
>>>>>>>>>>>>>> -- 2 level loop
>>>>>>>>>>>>>> > > and
>>>>>>>>>>>>>> > > > > couple of if-else blocks. The input size is also not
>>>>>>>>>>>>>> huge, for the
>>>>>>>>>>>>>> > job
>>>>>>>>>>>>>> > > > that
>>>>>>>>>>>>>> > > > > gets struck for an hour at reduce 99%, it would take
>>>>>>>>>>>>>> in 130. Some of
>>>>>>>>>>>>>> > > them
>>>>>>>>>>>>>> > > > > are 1-3 MB in size and couple of them are 16MB in
>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >          Has anyone encountered this problem before?
>>>>>>>>>>>>>> Any pointers? I
>>>>>>>>>>>>>> > > use
>>>>>>>>>>>>>> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > Thank you.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > Regards,
>>>>>>>>>>>>>> > > > > Raghava.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
>>>>>>>>>>>>>> > > > > m.vijayaraghava@gmail.com> wrote:
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > Hi all,
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >        I am running a series of jobs one after
>>>>>>>>>>>>>> another. While
>>>>>>>>>>>>>> > executing
>>>>>>>>>>>>>> > > > the
>>>>>>>>>>>>>> > > > > 4th job, the job fails. It fails in the reducer ---
>>>>>>>>>>>>>> the progress
>>>>>>>>>>>>>> > > > percentage
>>>>>>>>>>>>>> > > > > would be map 100%, reduce 99%. It gives out the
>>>>>>>>>>>>>> following message
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
>>>>>>>>>>>>>> > > > > attempt_201003240138_0110_r_000018_1, Status : FAILED
>>>>>>>>>>>>>> > > > > Task attempt_201003240138_0110_r_000018_1 failed to
>>>>>>>>>>>>>> report status for
>>>>>>>>>>>>>> > > 602
>>>>>>>>>>>>>> > > > > seconds. Killing!
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > It makes several attempts again to execute it but
>>>>>>>>>>>>>> fails with similar
>>>>>>>>>>>>>> > > > > message. I couldn't get anything from this error
>>>>>>>>>>>>>> message and wanted
>>>>>>>>>>>>>> > to
>>>>>>>>>>>>>> > > > look
>>>>>>>>>>>>>> > > > > at logs (located in the default dir of
>>>>>>>>>>>>>> ${HADOOP_HOME/logs}). But I
>>>>>>>>>>>>>> > > don't
>>>>>>>>>>>>>> > > > > find any files which match the timestamp of the job.
>>>>>>>>>>>>>> Also I did not
>>>>>>>>>>>>>> > > find
>>>>>>>>>>>>>> > > > > history and userlogs in the logs folder. Should I look
>>>>>>>>>>>>>> at some other
>>>>>>>>>>>>>> > > > place
>>>>>>>>>>>>>> > > > > for the logs? What could be the possible causes for
>>>>>>>>>>>>>> the above error?
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >        I am using Hadoop 0.20.2 and I am running it on
>>>>>>>>>>>>>> a cluster with
>>>>>>>>>>>>>> > > 16
>>>>>>>>>>>>>> > > > > nodes.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > Thank you.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > > Regards,
>>>>>>>>>>>>>> > > > > Raghava.
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>