You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2021/12/26 21:58:00 UTC

[jira] [Comment Edited] (TEZ-4364) TestFaultTolerance timeout on master

    [ https://issues.apache.org/jira/browse/TEZ-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465470#comment-17465470 ] 

László Bodor edited comment on TEZ-4364 at 12/26/21, 9:57 PM:
--------------------------------------------------------------

looks like this is cause by TEZ-4338, as I found in a task attempt log:  [^syslog_attempt_1640554229092_0001_1_01_000002_0] 
{code}
2021-12-26 22:30:39,354 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: TaskReporter reporter error which will cause the task to fail
java.lang.NullPointerException
	at org.apache.tez.runtime.api.events.EventProtos$InputReadErrorEventProto$Builder.setDestinationLocalhostName(EventProtos.java:2508)
	at org.apache.tez.runtime.api.impl.TezEvent.serializeEvent(TezEvent.java:196)
	at org.apache.tez.runtime.api.impl.TezEvent.write(TezEvent.java:349)
	at org.apache.tez.runtime.api.impl.TezHeartbeatRequest.write(TezHeartbeatRequest.java:98)
	at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:202)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invocation.write(WritableRpcEngine.java:176)
	at org.apache.hadoop.ipc.RpcWritable$WritableWrapper.writeTo(RpcWritable.java:75)
	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1133)
	at org.apache.hadoop.ipc.Client.call(Client.java:1458)
	at org.apache.hadoop.ipc.Client.call(Client.java:1405)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
	at com.sun.proxy.$Proxy8.heartbeat(Unknown Source)
	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:278)
	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:202)
	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:136)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}

after fixing TestInput to properly fill hostname in InputReadErrorEvent, the issue cannot be reproduced
looks like an unexpected error while generating the InputReadErrorEvent can cause the DAG to fail, but I would not consider this as a blocker product bug


was (Author: abstractdog):
looks like this is cause by TEZ-4338, as I found in a task attempt log:  [^syslog_attempt_1640554229092_0001_1_01_000002_0] 
{code}
2021-12-26 22:30:39,354 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: TaskReporter reporter error which will cause the task to fail
java.lang.NullPointerException
	at org.apache.tez.runtime.api.events.EventProtos$InputReadErrorEventProto$Builder.setDestinationLocalhostName(EventProtos.java:2508)
	at org.apache.tez.runtime.api.impl.TezEvent.serializeEvent(TezEvent.java:196)
	at org.apache.tez.runtime.api.impl.TezEvent.write(TezEvent.java:349)
	at org.apache.tez.runtime.api.impl.TezHeartbeatRequest.write(TezHeartbeatRequest.java:98)
	at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:202)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invocation.write(WritableRpcEngine.java:176)
	at org.apache.hadoop.ipc.RpcWritable$WritableWrapper.writeTo(RpcWritable.java:75)
	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1133)
	at org.apache.hadoop.ipc.Client.call(Client.java:1458)
	at org.apache.hadoop.ipc.Client.call(Client.java:1405)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
	at com.sun.proxy.$Proxy8.heartbeat(Unknown Source)
	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:278)
	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:202)
	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:136)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}

after fixing TestInput to properly fill hostname in InputReadErrorEvent, issue cannot be reproduced

> TestFaultTolerance timeout on master
> ------------------------------------
>
>                 Key: TEZ-4364
>                 URL: https://issues.apache.org/jira/browse/TEZ-4364
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: surefire_jstack.log, syslog_attempt_1640554229092_0001_1_01_000002_0
>
>
> TestFaultTolerance test becomes flakier recently.  It's important to be investigated because a unit test failure could also imply a product bug while handling failure scenarios.
> According to surefire process' jstack, it can be reproduced only by TestFaultTolerance.testBasicInputFailureWithoutExitDeadline [^surefire_jstack.log]
> {code}
> "Thread-1355" #1569 prio=5 os_prio=31 tid=0x00007fe76660c800 nid=0x43d07 waiting on condition [0x000070002ab38000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:155)
> 	at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:142)
> 	at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:138)
> 	at org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithoutExitDeadline(TestFaultTolerance.java:351)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 	at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {code}
> this is when it waits for the DAG to finish



--
This message was sent by Atlassian Jira
(v8.20.1#820001)