You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Ahmed Hussein (Jira)" <ji...@apache.org> on 2020/01/22 14:39:00 UTC

[jira] [Comment Edited] (TEZ-3391) MR split file validation should be done in the AM

    [ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021119#comment-17021119 ] 

Ahmed Hussein edited comment on TEZ-3391 at 1/22/20 2:38 PM:
-------------------------------------------------------------

I agree with [~rohini] that the implementation is not efficient.
The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to the task initializer. This may imply significant code changes.
The existing code also has significant space overhead. Because each task creates an array of meta split. This means the code is {{n^2}} space complexity. The patch will reduce the space complexity but it each task needs to go through the entire meta file.
Finally, the code was not closing the InputStream properly. An exception would leak the handler.

[~jeagles], Can you please take a look at the patch and merge it at your convenience?


was (Author: ahussein):
I agree with [~rohini] that the implementation is not efficient.
The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to the task initializer. This may imply significant code changes.
The existing code also has significant space overhead. Because each task creates an array of meta split. This means the code is {{n^2}} space complexity. The patch will reduce the space complexity but it each task needs to go through the entire meta file.

[~jeagles], Can you please take a look at the patch and merge it at your convenience?

> MR split file validation should be done in the AM
> -------------------------------------------------
>
>                 Key: TEZ-3391
>                 URL: https://issues.apache.org/jira/browse/TEZ-3391
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Ahmed Hussein
>            Priority: Major
>         Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch
>
>
>   We had a case  where Split metadata size exceeded 10000000. Instead of job failing from validation during initialization in AM like mapreduce, each of the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)