You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Gopal Vijayaraghavan <go...@apache.org> on 2016/07/06 06:55:15 UTC

Re: Tez jobs on YARN failing sporadically..


> when the executor is overwhelmed with tasks or execute() is called while
>shutting down. I'm confounded as to why this would be an issue suddenly.

> Container container_e23_1466828114374_53316_01_000018 finished with
>diagnostics set to Container failed, exitCode=-1000. Task
>java.util.concurrent.ExecutorCompletionService$QueueingFuture@6c5f576
 rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
 Terminated, pool size = 0, active threads = 0, queued tasks = 0,
completed tasks = 111

As always, this needs more info mostly from the yarn logs -applicationId
<application>.

It's not entirely clear whether this is happening in the NM or the task
itself.

The active threads = 0, suggests this might be related to pam_limits
nproc, causing threads to exit without running.

Did you reboot the system recently?

Cheers,
Gopal

Re: Tez jobs on YARN failing sporadically..

Posted by Gautam <ga...@gmail.com>.

We found out what happened here. As suspected this wasn't an issue with
Tez. The job localizer thread on some NMs was crashing with :

2016-07-02 10:20:17,881 ERROR
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Failed to submit rsrc { {
hdfs://master-nn-host:8020/parquet_loader/0052919-160630152347927-oozie-oozi-W/script.q,
1467450680162, FILE, null
},pending,[(container_e25_1467304052008_27086_01_000077)],36144839749319326,FAILED}
for download. Either queue is full or threadpool is
shutdown.java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.ExecutorCompletionService$QueueingFuture@921a73e
rejected from java.util.concurrent.ThreadPoolExecutor@3283d190[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
109]



I think we ran into  one of the many localization issues reported here:
https://issues.apache.org/jira/browse/YARN-543

In particular the symptom is that NM fails to spawn the task container due
to init issues. This affected MR and Tez jobs alike. Sometimes even
crashing the AM initialization itself.

*Restarting the affected NMs fixed the issue. *


-Gautam.


On Tue, Jul 5, 2016 at 11:55 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
>
> > when the executor is overwhelmed with tasks or execute() is called while
> >shutting down. I'm confounded as to why this would be an issue suddenly.
>
> > Container container_e23_1466828114374_53316_01_000018 finished with
> >diagnostics set to Container failed, exitCode=-1000. Task
> >java.util.concurrent.ExecutorCompletionService$QueueingFuture@6c5f576
>  rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
>  Terminated, pool size = 0, active threads = 0, queued tasks = 0,
> completed tasks = 111
>
> As always, this needs more info mostly from the yarn logs -applicationId
> <application>.
>
> It's not entirely clear whether this is happening in the NM or the task
> itself.
>
> The active threads = 0, suggests this might be related to pam_limits
> nproc, causing threads to exit without running.
>
> Did you reboot the system recently?
>
> Cheers,
> Gopal
>
>
>


-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."