You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Gautam <ga...@gmail.com> on 2016/06/29 00:58:40 UTC

Tez jobs on YARN failing sporadically..

Hello,

We have Tez being used for one of our main ETL workflows and have been
using it for couple months now. We recently started seeing the following
error for a query that regularly runs and hasn't been changed in any way.
It's a job that counts an hour's worth of data in a M-R-R flow. This error
happens in the Map phase. I could send more details about the job but I
don't think this is something specific to this query.

I believe this error shows up in java.util.concurrent.ThreadPoolExecutor
when the executor is overwhelmed with tasks or execute() is called while
shutting down. I'm confounded as to why this would be an issue suddenly. I
also believe this isn't Tez's fault in particular, could be YARN hitting
some limits. Which means this is prolly happening to MR jobs as well.

Have others faced this issue? If not, what should I be looking at to get
more data around this issue..

*The Error:*

Task failed, taskId=task_1466828114374_53316_1_00_000029, diagnostics=
 TaskAttempt 0 failed, info=
 Container container_e23_1466828114374_53316_01_000009 finished with
diagnostics set to
 Container failed, exitCode=-1000. Task
java.util.concurrent.ExecutorCompletionService$QueueingFuture@732af2f3
rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
 Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed
tasks = 111
...
...
TaskAttempt 3 failed, info=
 Container container_e23_1466828114374_53316_01_000018 finished with
diagnostics set to
 Container failed, exitCode=-1000. Task
java.util.concurrent.ExecutorCompletionService$QueueingFuture@6c5f576
rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
 Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed
tasks = 111



Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1
killedTasks:115
Vertex vertex_1466828114374_53316_1_00
 Map 1
killed/failed due to:OWN_TASK_FAILURE



Thanks,
-Gautam.

Re: Tez jobs on YARN failing sporadically..

Posted by saquib khan <sk...@gmail.com>.

Unsubscribe

On Tuesday, June 28, 2016, Gautam <ga...@gmail.com> wrote:

> Hello,
>
> We have Tez being used for one of our main ETL workflows and have been
> using it for couple months now. We recently started seeing the following
> error for a query that regularly runs and hasn't been changed in any way.
> It's a job that counts an hour's worth of data in a M-R-R flow. This error
> happens in the Map phase. I could send more details about the job but I
> don't think this is something specific to this query.
>
> I believe this error shows up in java.util.concurrent.ThreadPoolExecutor
> when the executor is overwhelmed with tasks or execute() is called while
> shutting down. I'm confounded as to why this would be an issue suddenly. I
> also believe this isn't Tez's fault in particular, could be YARN hitting
> some limits. Which means this is prolly happening to MR jobs as well.
>
> Have others faced this issue? If not, what should I be looking at to get
> more data around this issue..
>
> *The Error:*
>
> Task failed, taskId=task_1466828114374_53316_1_00_000029, diagnostics=
>  TaskAttempt 0 failed, info=
>  Container container_e23_1466828114374_53316_01_000009 finished with
> diagnostics set to
>  Container failed, exitCode=-1000. Task
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@732af2f3
> rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
>  Terminated, pool size = 0, active threads = 0, queued tasks = 0,
> completed tasks = 111
> ...
> ...
> TaskAttempt 3 failed, info=
>  Container container_e23_1466828114374_53316_01_000018 finished with
> diagnostics set to
>  Container failed, exitCode=-1000. Task
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@6c5f576
> rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
>  Terminated, pool size = 0, active threads = 0, queued tasks = 0,
> completed tasks = 111
>
>
>
> Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1
> killedTasks:115
> Vertex vertex_1466828114374_53316_1_00
>  Map 1
> killed/failed due to:OWN_TASK_FAILURE
>
>
>
> Thanks,
> -Gautam.
>

Re: Tez jobs on YARN failing sporadically..

Posted by Gautam <ga...@gmail.com>.

*Software Versions*

- Hive : 1.1.0
- Tez : 0.7.1
- Hadoop : 2.6.0

On Tue, Jun 28, 2016 at 5:58 PM, Gautam <ga...@gmail.com> wrote:

> Hello,
>
> We have Tez being used for one of our main ETL workflows and have been
> using it for couple months now. We recently started seeing the following
> error for a query that regularly runs and hasn't been changed in any way.
> It's a job that counts an hour's worth of data in a M-R-R flow. This error
> happens in the Map phase. I could send more details about the job but I
> don't think this is something specific to this query.
>
> I believe this error shows up in java.util.concurrent.ThreadPoolExecutor
> when the executor is overwhelmed with tasks or execute() is called while
> shutting down. I'm confounded as to why this would be an issue suddenly. I
> also believe this isn't Tez's fault in particular, could be YARN hitting
> some limits. Which means this is prolly happening to MR jobs as well.
>
> Have others faced this issue? If not, what should I be looking at to get
> more data around this issue..
>
> *The Error:*
>
> Task failed, taskId=task_1466828114374_53316_1_00_000029, diagnostics=
>  TaskAttempt 0 failed, info=
>  Container container_e23_1466828114374_53316_01_000009 finished with
> diagnostics set to
>  Container failed, exitCode=-1000. Task
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@732af2f3
> rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
>  Terminated, pool size = 0, active threads = 0, queued tasks = 0,
> completed tasks = 111
> ...
> ...
> TaskAttempt 3 failed, info=
>  Container container_e23_1466828114374_53316_01_000018 finished with
> diagnostics set to
>  Container failed, exitCode=-1000. Task
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@6c5f576
> rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
>  Terminated, pool size = 0, active threads = 0, queued tasks = 0,
> completed tasks = 111
>
>
>
> Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1
> killedTasks:115
> Vertex vertex_1466828114374_53316_1_00
>  Map 1
> killed/failed due to:OWN_TASK_FAILURE
>
>
>
> Thanks,
> -Gautam.
>



-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."

Re: Tez jobs on YARN failing sporadically..

Posted by Gautam <ga...@gmail.com>.

We found out what happened here. As suspected this wasn't an issue with
Tez. The job localizer thread on some NMs was crashing with :

2016-07-02 10:20:17,881 ERROR
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Failed to submit rsrc { {
hdfs://master-nn-host:8020/parquet_loader/0052919-160630152347927-oozie-oozi-W/script.q,
1467450680162, FILE, null
},pending,[(container_e25_1467304052008_27086_01_000077)],36144839749319326,FAILED}
for download. Either queue is full or threadpool is
shutdown.java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.ExecutorCompletionService$QueueingFuture@921a73e
rejected from java.util.concurrent.ThreadPoolExecutor@3283d190[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
109]



I think we ran into  one of the many localization issues reported here:
https://issues.apache.org/jira/browse/YARN-543

In particular the symptom is that NM fails to spawn the task container due
to init issues. This affected MR and Tez jobs alike. Sometimes even
crashing the AM initialization itself.

*Restarting the affected NMs fixed the issue. *


-Gautam.


On Tue, Jul 5, 2016 at 11:55 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
>
> > when the executor is overwhelmed with tasks or execute() is called while
> >shutting down. I'm confounded as to why this would be an issue suddenly.
>
> > Container container_e23_1466828114374_53316_01_000018 finished with
> >diagnostics set to Container failed, exitCode=-1000. Task
> >java.util.concurrent.ExecutorCompletionService$QueueingFuture@6c5f576
>  rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
>  Terminated, pool size = 0, active threads = 0, queued tasks = 0,
> completed tasks = 111
>
> As always, this needs more info mostly from the yarn logs -applicationId
> <application>.
>
> It's not entirely clear whether this is happening in the NM or the task
> itself.
>
> The active threads = 0, suggests this might be related to pam_limits
> nproc, causing threads to exit without running.
>
> Did you reboot the system recently?
>
> Cheers,
> Gopal
>
>
>


-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."

Re: Tez jobs on YARN failing sporadically..

Posted by Gopal Vijayaraghavan <go...@apache.org>.


> when the executor is overwhelmed with tasks or execute() is called while
>shutting down. I'm confounded as to why this would be an issue suddenly.

> Container container_e23_1466828114374_53316_01_000018 finished with
>diagnostics set to Container failed, exitCode=-1000. Task
>java.util.concurrent.ExecutorCompletionService$QueueingFuture@6c5f576
 rejected from java.util.concurrent.ThreadPoolExecutor@9bf8295
 Terminated, pool size = 0, active threads = 0, queued tasks = 0,
completed tasks = 111

As always, this needs more info mostly from the yarn logs -applicationId
<application>.

It's not entirely clear whether this is happening in the NM or the task
itself.

The active threads = 0, suggests this might be related to pam_limits
nproc, causing threads to exit without running.

Did you reboot the system recently?

Cheers,
Gopal