You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2015/04/25 01:54:38 UTC

[jira] [Comment Edited] (TEZ-2314) Tez task attempt failures due to bad event serialization

    [ https://issues.apache.org/jira/browse/TEZ-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512053#comment-14512053 ] 

Bikas Saha edited comment on TEZ-2314 at 4/24/15 11:53 PM:
-----------------------------------------------------------

Looking at this more, the real issue is that heartbeating and sending all these objects happen regardless of whether initialization is in progress or not. Synchronization will not result in correct data. E.g. sending 2 out of 3 members is still possible. Besides, accessing these (and other future members) while initialization is in progress is fraught with errors. Changing the heartbeat code to check for initialization before sending such data. The heartbeat will still occur (or else long running initialization will result in the task timing out on the am liveliness monitor) but only sending the data is guarded. Also, sending stats at the same frequency as counters. Should have done this earlier since frequent updates for these could overload the AM (similar to the issue with counters).
[~rohini] Since your large jobs frequently repro this error, could you please check with this patch? Thanks!


was (Author: bikassaha):
Looking at this a more, the real issue is that heartbeating and sending all these objects happen regardless of whether initialization is in progress or not. Synchronization will not result in correct data. E.g. sending 2 out of 3 members is still possible. Besides, accessing these (and other future members) while initialization is in progress is fraught with errors. Changing the heartbeat code to check for initialization before sending such data. The heartbeat will still occur (or else long running initialization will result in the task timing out on the am liveliness monitor) but only sending the data is guarded. Also, sending stats at the same frequency as counters. Should have done this earlier since frequent updates for these could overload the AM (similar to the issue with counters).
[~rohini] Since your large jobs frequently repro this error, could you please check with this patch? Thanks!

> Tez task attempt failures due to bad event serialization
> --------------------------------------------------------
>
>                 Key: TEZ-2314
>                 URL: https://issues.apache.org/jira/browse/TEZ-2314
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Bikas Saha
>            Priority: Blocker
>         Attachments: TEZ-2314.1.patch, TEZ-2314.log.patch
>
>
> {code}
> 2015-04-13 19:21:48,516 WARN [Socket Reader #3 for port 53530] ipc.Server: Unable to read call parameters for client 10.216.13.112on connection protocol org.apache.tez.common.TezTaskUmbilicalProtocol for rpcKind RPC_WRITABLE
> java.lang.ArrayIndexOutOfBoundsException: 1935896432
>         at org.apache.tez.runtime.api.impl.EventMetaData.readFields(EventMetaData.java:120)
>         at org.apache.tez.runtime.api.impl.TezEvent.readFields(TezEvent.java:271)
>         at org.apache.tez.runtime.api.impl.TezHeartbeatRequest.readFields(TezHeartbeatRequest.java:110)
>         at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
>         at org.apache.hadoop.ipc.WritableRpcEngine$Invocation.readFields(WritableRpcEngine.java:160)
>         at org.apache.hadoop.ipc.Server$Connection.processRpcRequest(Server.java:1884)
>         at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1816)
>         at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1574)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:806)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:673)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:644)
> {code}
> cc/ [~hitesh] and [~bikassaha]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)