You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Sharma (JIRA)" <ji...@apache.org> on 2018/09/28 02:14:00 UTC

[jira] [Commented] (TEZ-3996) Reorder input failed events before data movement events

    [ https://issues.apache.org/jira/browse/TEZ-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631261#comment-16631261 ] 

Hitesh Sharma commented on TEZ-3996:
------------------------------------

A potential solution is to reorder the events in TaskReporter.

https://github.com/apache/tez/blob/64c04f1121ef1d04118e36b0e4fc3808205a50a8/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskReporter.java#L305

 
{code:java}
Collection<TezEvent> tezEvents = response.getEvents();

List<TezEvent> dataMovementEvents = new ArrayList<>();
List<TezEvent> inputFailedEvents = new ArrayList<>();
List<TezEvent> otherEvents = new ArrayList<>();
for (final TezEvent te : tezEvents) {
if (te.getEvent() instanceof DataMovementEvent) {
dataMovementEvents.add(te);
} else if (te.getEvent() instanceof InputFailedEvent) {
inputFailedEvents.add(te);
} else {
otherEvents.add(te);
}
}
// Reorder
tezEvents = new ArrayList<>();
tezEvents.addAll(inputFailedEvents);
tezEvents.addAll(dataMovementEvents);
tezEvents.addAll(otherEvents);
// Now raise events
task.handleEvents(tezEvents);
{code}
Thoughts on this approach?

> Reorder input failed events before data movement events
> -------------------------------------------------------
>
>                 Key: TEZ-3996
>                 URL: https://issues.apache.org/jira/browse/TEZ-3996
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Sharma
>            Priority: Minor
>
> We have a custom processor (AbstractLogicalIOProcessor) that waits for DataMovementEvent to arrive and then starts an external process to do some work. When a revocation happens then the processor recieves an InputFailedEvent, which tells it about the failed input, and we fail the processor as it is working on old inputs. When the new inputs are available then Tez restarts the processor and sends the InputFailedEvent along with all the DataMovementEvent which includes the older versions and the new version that was revocated.
> The issue we are seeing is that the events arrive out of order i.e. many times we see the older DataMovementEvent first at which our processor thinks it is good to start. We then receive the InputFailedEvent and the new version of DataMovementEvent, but that's late and the processor fails. This keeps repeating on every subsequent task attempt and the task fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)