You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2015/04/29 17:48:06 UTC

[jira] [Resolved] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion

     [ https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth resolved HIVE-10480.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: llap

> LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-10480
>                 URL: https://issues.apache.org/jira/browse/HIVE-10480
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>             Fix For: llap
>
>         Attachments: HIVE-10480.1.txt
>
>
> No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the cluster), or all 3.
> So for now I will just dump all I have here.
> TPCH Q1 started running for a long time for me on large number of runs today (didn't happen yesterday). It would always be one Map task timing out.
>  Example attempt (logs from am):
> {noformat}
> 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] tezplugins.LlapTaskCommunicator: Successfully launched task: attempt_1429683757595_0321_9_00_000928_0
> 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out after 300 secs, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
> {noformat}
> No other lines for this attempt in between.
> However there's this:
> {noformat}
> 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: Unable to read call parameters for client 172.19.128.56on connection protocol org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind RPC_WRITABLE
> java.lang.ArrayIndexOutOfBoundsException
> 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable to read call parameters: null]
> {noformat}
> On LLAP, the following is logged 
> {noformat}
> 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): IPC server unable to read call parameters: null
>         at org.apache.hadoop.ipc.Client.call(Client.java:1492)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1423)
>         at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)
>         at com.sun.proxy.$Proxy19.heartbeat(Unknown Source)
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258)
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186)
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The attempt starts but is then interrupted (not clear by whom)
> {noformat}
> 2015-04-24 11:11:01,144 [Initializer 0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Initialized Input with src edge: lineitem
> 2015-04-24 11:11:01,145 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0321_9_00_000928_0
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
>         at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
>         at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>         at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:218)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:177)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
> {noformat}
> There's a heartbeat error that may or may not be spurious:
> {noformat}
> 2015-04-24 11:11:01,146 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Ignoring the following exception since a previous exception is already registered
> java.lang.NullPointerException
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.access$300(LlapTaskReporter.java:129)
>         at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter.taskFailed(LlapTaskReporter.java:394)
>         at org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:261)
>         at org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:52)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:228)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
> {noformat}
> And nothing more for this task.
> Task succeeds on rerun. Other tasks also succeed on this node meanwhile.
> 1) Is it Hadoop IPC bug due to new patch?
> 2) Even if so (assuming IPC is not totally broken) I wonder if the Heartbeat NPE (which we have seen before, without the IPC patch too IIRC) is actually a real issue that prevents the task from being sent to AM?
> 3) Who interrupts the task and why? AM doesn't have any logs about that and it happens after RPC error, not before.
> Btw, there's another flavor or RPC error:
> {noformat}
> 2015-04-24 10:36:30,183 INFO [Socket Reader #1 for port 59446] ipc.Server: Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable to read call parameters: 1382376565]
> {noformat}
> application_1429683757595_0320, application_1429683757595_0321



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)