You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Alexandru Pacurar <Al...@PropertyShark.com> on 2015/02/11 11:03:41 UTC

Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

I've observed that the error only occurs on one datanode, and to make it more interesting when studying the CPU graphs on that particular node we've observed that there was no more CPU activity on that node.
The nodemanager process is still running but I don't register any CPU spikes so I'm guessing it doesn't do anything. Looking through nodemanager logs all that it does is staring containers:

2015-02-11 13:03:56,111 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,112 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1423062241884_12668_01_000028 to application application_1423062241884_12668

It trying to localize the container:

2015-02-11 13:03:56,171 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from NEW to LOCALIZING
2015-02-11 13:03:56,173 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,191 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /hadoop/tmp/nmPrivate/container_1423062241884_12668_01_000028.tokens. Credentials list

And then after 10 minutes it kills it:

2015-02-11 13:14:21,941 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12668_01_000028
2015-02-11 13:14:21,941 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from LOCALIZING to KILLING

And then the usual stuff like removing it from HDFS, and aggregating the logs.

The whole CPU inactivity and localizing problems seem to occur after I get this error in the logs:

2015-02-11 13:03:28,399 WARN org.apache.hadoop.util.Shell: Interrupted while reading the error stream
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1281)
        at java.lang.Thread.join(Thread.java:1355)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:497)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
        at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1061)
        at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
        at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppLogDirs(DefaultContainerExecutor.java:547)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:95)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995)
2015-02-11 13:03:28,706 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12666_01_000045

I have found an issue regarding this: HADOOP-10622 (https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHADOOP-10622&sa=D&sntz=1&usg=AFQjCNHjLjEd_GCKDjHpV74tSdZHFT21Wg).
It affects 2.3 and I'm using 2.4.1 but seeing that that it's the same error I'll try the patch.

Also as strange behavior after the error I get a lot of this messages in the log:

2015-02-11 13:03:38,272 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,277 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,368 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,396 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,427 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...

Thank you,
Alex

From: Alexandru Pacurar
Sent: Thursday, February 12, 2015 9:42 AM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

I've observed that the error only occurs on one datanode, and to make it more interesting when studying the CPU graphs on that particular node we've observed that there was no more CPU activity on that node.
The nodemanager process is still running but I don't register any CPU spikes so I'm guessing it doesn't do anything. Looking through nodemanager logs all that it does is staring containers:

2015-02-11 13:03:56,111 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,112 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1423062241884_12668_01_000028 to application application_1423062241884_12668

It trying to localize the container:

2015-02-11 13:03:56,171 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from NEW to LOCALIZING
2015-02-11 13:03:56,173 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,191 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /hadoop/tmp/nmPrivate/container_1423062241884_12668_01_000028.tokens. Credentials list

And then after 10 minutes it kills it:

2015-02-11 13:14:21,941 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12668_01_000028
2015-02-11 13:14:21,941 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from LOCALIZING to KILLING

And then the usual stuff like removing it from HDFS, and aggregating the logs.

The whole CPU inactivity and localizing problems seem to occur after I get this error in the logs:

2015-02-11 13:03:28,399 WARN org.apache.hadoop.util.Shell: Interrupted while reading the error stream
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1281)
        at java.lang.Thread.join(Thread.java:1355)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:497)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
        at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1061)
        at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
        at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppLogDirs(DefaultContainerExecutor.java:547)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:95)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995)
2015-02-11 13:03:28,706 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12666_01_000045

I have found an issue regarding this: HADOOP-10622 (https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHADOOP-10622&sa=D&sntz=1&usg=AFQjCNHjLjEd_GCKDjHpV74tSdZHFT21Wg).
It affects 2.3 and I'm using 2.4.1 but seeing that that it's the same error I'll try the patch.

Also as strange behavior after the error I get a lot of this messages in the log:

2015-02-11 13:03:38,272 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,277 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,368 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,396 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,427 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...

Thank you,
Alex

From: Alexandru Pacurar
Sent: Thursday, February 12, 2015 9:42 AM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

I've observed that the error only occurs on one datanode, and to make it more interesting when studying the CPU graphs on that particular node we've observed that there was no more CPU activity on that node.
The nodemanager process is still running but I don't register any CPU spikes so I'm guessing it doesn't do anything. Looking through nodemanager logs all that it does is staring containers:

2015-02-11 13:03:56,111 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,112 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1423062241884_12668_01_000028 to application application_1423062241884_12668

It trying to localize the container:

2015-02-11 13:03:56,171 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from NEW to LOCALIZING
2015-02-11 13:03:56,173 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,191 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /hadoop/tmp/nmPrivate/container_1423062241884_12668_01_000028.tokens. Credentials list

And then after 10 minutes it kills it:

2015-02-11 13:14:21,941 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12668_01_000028
2015-02-11 13:14:21,941 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from LOCALIZING to KILLING

And then the usual stuff like removing it from HDFS, and aggregating the logs.

The whole CPU inactivity and localizing problems seem to occur after I get this error in the logs:

2015-02-11 13:03:28,399 WARN org.apache.hadoop.util.Shell: Interrupted while reading the error stream
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1281)
        at java.lang.Thread.join(Thread.java:1355)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:497)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
        at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1061)
        at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
        at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppLogDirs(DefaultContainerExecutor.java:547)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:95)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995)
2015-02-11 13:03:28,706 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12666_01_000045

I have found an issue regarding this: HADOOP-10622 (https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHADOOP-10622&sa=D&sntz=1&usg=AFQjCNHjLjEd_GCKDjHpV74tSdZHFT21Wg).
It affects 2.3 and I'm using 2.4.1 but seeing that that it's the same error I'll try the patch.

Also as strange behavior after the error I get a lot of this messages in the log:

2015-02-11 13:03:38,272 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,277 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,368 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,396 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,427 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...

Thank you,
Alex

From: Alexandru Pacurar
Sent: Thursday, February 12, 2015 9:42 AM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

I've observed that the error only occurs on one datanode, and to make it more interesting when studying the CPU graphs on that particular node we've observed that there was no more CPU activity on that node.
The nodemanager process is still running but I don't register any CPU spikes so I'm guessing it doesn't do anything. Looking through nodemanager logs all that it does is staring containers:

2015-02-11 13:03:56,111 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,112 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1423062241884_12668_01_000028 to application application_1423062241884_12668

It trying to localize the container:

2015-02-11 13:03:56,171 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from NEW to LOCALIZING
2015-02-11 13:03:56,173 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1423062241884_12668_01_000028
2015-02-11 13:03:56,191 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /hadoop/tmp/nmPrivate/container_1423062241884_12668_01_000028.tokens. Credentials list

And then after 10 minutes it kills it:

2015-02-11 13:14:21,941 INFO  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12668_01_000028
2015-02-11 13:14:21,941 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1423062241884_12668_01_000028 transitioned from LOCALIZING to KILLING

And then the usual stuff like removing it from HDFS, and aggregating the logs.

The whole CPU inactivity and localizing problems seem to occur after I get this error in the logs:

2015-02-11 13:03:28,399 WARN org.apache.hadoop.util.Shell: Interrupted while reading the error stream
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1281)
        at java.lang.Thread.join(Thread.java:1355)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:497)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
        at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1061)
        at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
        at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppLogDirs(DefaultContainerExecutor.java:547)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:95)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995)
2015-02-11 13:03:28,706 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1423062241884_12666_01_000045

I have found an issue regarding this: HADOOP-10622 (https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHADOOP-10622&sa=D&sntz=1&usg=AFQjCNHjLjEd_GCKDjHpV74tSdZHFT21Wg).
It affects 2.3 and I'm using 2.4.1 but seeing that that it's the same error I'll try the patch.

Also as strange behavior after the error I get a lot of this messages in the log:

2015-02-11 13:03:38,272 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,277 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,368 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,396 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...
2015-02-11 13:03:38,427 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...

Thank you,
Alex

From: Alexandru Pacurar
Sent: Thursday, February 12, 2015 9:42 AM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Hello,

Regarding the AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs error, I managed to get en extended status for it. The other message that I get is java.lang.Exception: Container is not yet running. Current state is LOCALIZING. So the container spends 10 minutes in the LOCALIZING state  than it fails.

Thank you,
Alex

From: Alexandru Pacurar
Sent: Wednesday, February 11, 2015 1:35 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Alexandru Pacurar <Al...@PropertyShark.com>.

Thank you for the quick reply.

I will modify the value to check if this is the threshold I'm hitting, but I was thinking of decreasing it because my jobs take to long If they get this time out. I would rather fail fast, than keep the cluster busy with jobs stuck in timeouts. Ideally I would like to troubleshoot the issue and not fail at all :) .

My MR job is not a custom one it is a job from Nutch 1.8 . Actually there are several jobs from Nutch that fail (ex: Generator, Indexer ).

Also because this is related to Nutch 1.8 also should I move the question to the Nutch mailing list?

Thanks,
Alex

From: Rohith Sharma K S [mailto:rohithsharmaks@huawei.com]
Sent: Wednesday, February 11, 2015 12:32 PM
To: user@hadoop.apache.org
Subject: RE: Time out after 600 for YARN mapreduce application

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.


Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Rohith Sharma K S <ro...@huawei.com>.

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Rohith Sharma K S <ro...@huawei.com>.

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Rohith Sharma K S <ro...@huawei.com>.

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex

RE: Time out after 600 for YARN mapreduce application

Posted by Rohith Sharma K S <ro...@huawei.com>.

Looking into attemptID, this is mapper task getting timed out in MapReduce job.  The configuration that can be used to increase the value is 'mapreduce.task.timeout'.

The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins.  Does MR job is custom job?  If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout.

Thanks & Regards
Rohith Sharma K S
From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org
Subject: Time out after 600 for YARN mapreduce application

Hello,

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms
yarn.nm.liveness-monitor.expiry-interval-ms
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

Thanks,
Alex