You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by ed <ed...@gmail.com> on 2015/10/16 21:41:36 UTC

YARN: "Unauthorized request to start container, Expired Token" causes job failure

Hello,

We just kicked off a large MR job that uses all the containers on our
cluster.  The job ran for 24 hours and then failed with the following error
in the map phase (no reducers had started yet):

2015-10-16 12:38:17,781 ERROR [ContainerLauncher #2]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Container launch failed for container_1444916180373_0003_01_089692 :
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to
start container.

This token is expired. current time is 1445013467749 found 1445013416633

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

        at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)

        at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)

        at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)

        at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


The job has not had issues in the past although this time it was running on
a particularly large dataset.  I checked all of our nodes and the times on
the nodes are all properly synced with NTP.  I found the JIRA issue
YARN-1417 which seems to describe the problem we're having (
https://issues.apache.org/jira/browse/YARN-1417) but this issue is mark
resolved and the patch was included in CDH5.0.0 (we are running 5.0.2) so
we should not be having that particular problem.


Could this be another bug in YARN related to expired tokens being
assigned?  I searched through JIRA but did not see any open issues that
might relate to the error we're seeing.  Are there any work around to this
or has anyone seen this happen before?  Please let me know if there is any
other information I can provide.


Best Regards,


Ed Dorsey

Re: YARN: "Unauthorized request to start container, Expired Token" causes job failure

Posted by ed <ed...@gmail.com>.

I'm still working on this issue with the expired token error killing a long
running job.  I noticed that the job failed soon after 24 hours and that
there is a setting
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
which by default is set to 24 hours.  I could not find more information on
this other than the description which states that this is the interval at
which the master key rollovers to generate container tokens.  Is it
possible that this master key rolled over at 24 hours and thus caused the
expired token issue?

Unfortunately I could not find the the
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
setting in Cloudera Manager (I know that is cloudera specific) but I think
I can set it manually if anyone thinks that is worth trying.

Best Regards,

Ed Dorsey

On Fri, Oct 16, 2015 at 3:41 PM, ed <ed...@gmail.com> wrote:

> Hello,
>
> We just kicked off a large MR job that uses all the containers on our
> cluster.  The job ran for 24 hours and then failed with the following error
> in the map phase (no reducers had started yet):
>
> 2015-10-16 12:38:17,781 ERROR [ContainerLauncher #2]
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> Container launch failed for container_1444916180373_0003_01_089692 :
> org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to
> start container.
>
> This token is expired. current time is 1445013467749 found 1445013416633
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
> The job has not had issues in the past although this time it was running
> on a particularly large dataset.  I checked all of our nodes and the times
> on the nodes are all properly synced with NTP.  I found the JIRA issue
> YARN-1417 which seems to describe the problem we're having (
> https://issues.apache.org/jira/browse/YARN-1417) but this issue is mark
> resolved and the patch was included in CDH5.0.0 (we are running 5.0.2) so
> we should not be having that particular problem.
>
>
> Could this be another bug in YARN related to expired tokens being
> assigned?  I searched through JIRA but did not see any open issues that
> might relate to the error we're seeing.  Are there any work around to this
> or has anyone seen this happen before?  Please let me know if there is any
> other information I can provide.
>
>
> Best Regards,
>
>
> Ed Dorsey
>

Re: YARN: "Unauthorized request to start container, Expired Token" causes job failure

Posted by ed <ed...@gmail.com>.

I'm still working on this issue with the expired token error killing a long
running job.  I noticed that the job failed soon after 24 hours and that
there is a setting
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
which by default is set to 24 hours.  I could not find more information on
this other than the description which states that this is the interval at
which the master key rollovers to generate container tokens.  Is it
possible that this master key rolled over at 24 hours and thus caused the
expired token issue?

Unfortunately I could not find the the
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
setting in Cloudera Manager (I know that is cloudera specific) but I think
I can set it manually if anyone thinks that is worth trying.

Best Regards,

Ed Dorsey

On Fri, Oct 16, 2015 at 3:41 PM, ed <ed...@gmail.com> wrote:

> Hello,
>
> We just kicked off a large MR job that uses all the containers on our
> cluster.  The job ran for 24 hours and then failed with the following error
> in the map phase (no reducers had started yet):
>
> 2015-10-16 12:38:17,781 ERROR [ContainerLauncher #2]
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> Container launch failed for container_1444916180373_0003_01_089692 :
> org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to
> start container.
>
> This token is expired. current time is 1445013467749 found 1445013416633
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
> The job has not had issues in the past although this time it was running
> on a particularly large dataset.  I checked all of our nodes and the times
> on the nodes are all properly synced with NTP.  I found the JIRA issue
> YARN-1417 which seems to describe the problem we're having (
> https://issues.apache.org/jira/browse/YARN-1417) but this issue is mark
> resolved and the patch was included in CDH5.0.0 (we are running 5.0.2) so
> we should not be having that particular problem.
>
>
> Could this be another bug in YARN related to expired tokens being
> assigned?  I searched through JIRA but did not see any open issues that
> might relate to the error we're seeing.  Are there any work around to this
> or has anyone seen this happen before?  Please let me know if there is any
> other information I can provide.
>
>
> Best Regards,
>
>
> Ed Dorsey
>

Re: YARN: "Unauthorized request to start container, Expired Token" causes job failure

Posted by ed <ed...@gmail.com>.

I'm still working on this issue with the expired token error killing a long
running job.  I noticed that the job failed soon after 24 hours and that
there is a setting
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
which by default is set to 24 hours.  I could not find more information on
this other than the description which states that this is the interval at
which the master key rollovers to generate container tokens.  Is it
possible that this master key rolled over at 24 hours and thus caused the
expired token issue?

Unfortunately I could not find the the
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
setting in Cloudera Manager (I know that is cloudera specific) but I think
I can set it manually if anyone thinks that is worth trying.

Best Regards,

Ed Dorsey

On Fri, Oct 16, 2015 at 3:41 PM, ed <ed...@gmail.com> wrote:

> Hello,
>
> We just kicked off a large MR job that uses all the containers on our
> cluster.  The job ran for 24 hours and then failed with the following error
> in the map phase (no reducers had started yet):
>
> 2015-10-16 12:38:17,781 ERROR [ContainerLauncher #2]
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> Container launch failed for container_1444916180373_0003_01_089692 :
> org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to
> start container.
>
> This token is expired. current time is 1445013467749 found 1445013416633
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
> The job has not had issues in the past although this time it was running
> on a particularly large dataset.  I checked all of our nodes and the times
> on the nodes are all properly synced with NTP.  I found the JIRA issue
> YARN-1417 which seems to describe the problem we're having (
> https://issues.apache.org/jira/browse/YARN-1417) but this issue is mark
> resolved and the patch was included in CDH5.0.0 (we are running 5.0.2) so
> we should not be having that particular problem.
>
>
> Could this be another bug in YARN related to expired tokens being
> assigned?  I searched through JIRA but did not see any open issues that
> might relate to the error we're seeing.  Are there any work around to this
> or has anyone seen this happen before?  Please let me know if there is any
> other information I can provide.
>
>
> Best Regards,
>
>
> Ed Dorsey
>

Re: YARN: "Unauthorized request to start container, Expired Token" causes job failure

Posted by ed <ed...@gmail.com>.

I'm still working on this issue with the expired token error killing a long
running job.  I noticed that the job failed soon after 24 hours and that
there is a setting
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
which by default is set to 24 hours.  I could not find more information on
this other than the description which states that this is the interval at
which the master key rollovers to generate container tokens.  Is it
possible that this master key rolled over at 24 hours and thus caused the
expired token issue?

Unfortunately I could not find the the
"yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs"
setting in Cloudera Manager (I know that is cloudera specific) but I think
I can set it manually if anyone thinks that is worth trying.

Best Regards,

Ed Dorsey

On Fri, Oct 16, 2015 at 3:41 PM, ed <ed...@gmail.com> wrote:

> Hello,
>
> We just kicked off a large MR job that uses all the containers on our
> cluster.  The job ran for 24 hours and then failed with the following error
> in the map phase (no reducers had started yet):
>
> 2015-10-16 12:38:17,781 ERROR [ContainerLauncher #2]
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> Container launch failed for container_1444916180373_0003_01_089692 :
> org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to
> start container.
>
> This token is expired. current time is 1445013467749 found 1445013416633
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>
>         at
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
>
>         at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
> The job has not had issues in the past although this time it was running
> on a particularly large dataset.  I checked all of our nodes and the times
> on the nodes are all properly synced with NTP.  I found the JIRA issue
> YARN-1417 which seems to describe the problem we're having (
> https://issues.apache.org/jira/browse/YARN-1417) but this issue is mark
> resolved and the patch was included in CDH5.0.0 (we are running 5.0.2) so
> we should not be having that particular problem.
>
>
> Could this be another bug in YARN related to expired tokens being
> assigned?  I searched through JIRA but did not see any open issues that
> might relate to the error we're seeing.  Are there any work around to this
> or has anyone seen this happen before?  Please let me know if there is any
> other information I can provide.
>
>
> Best Regards,
>
>
> Ed Dorsey
>