You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2014/07/08 01:31:35 UTC

[jira] [Commented] (TEZ-1265) Custom input to fetch source task inputs in order

    [ https://issues.apache.org/jira/browse/TEZ-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054287#comment-14054287 ] 

Rohini Palaniswamy commented on TEZ-1265:
-----------------------------------------

[~sseth] had better suggestion of making it even more generic and controllable than fetching in order - https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14054266&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054266

A custom version of the Input which, instead of providing a unified view of the data, gives access to individual chunks along with meta-information (taskId etc). This could, additionally, be fully controlled by the user in terms of which chunks need to be fetched.

> Custom input to fetch source task inputs in order
> -------------------------------------------------
>
>                 Key: TEZ-1265
>                 URL: https://issues.apache.org/jira/browse/TEZ-1265
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
> Consider the case of having to LIMIT m records after an Order by. A distributed orderby vertex produces data in sorted order from task0,task1...taskn. Each task limits its output to m records (the output count can be <m also). The limit vertex (parallelism 1) following the order by vertex has to fetch output of all n tasks, shuffle merge its inputs (to maintain the order) and then limit m records again.  So need a input that fetches from source tasks in order and reads them in order. Since data produced is ordered from task0,task1...taskn it can be consumed without shuffle and sort. If the limit is hit early it can skip fetching more task inputs. 
> More details in https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217



--
This message was sent by Atlassian JIRA
(v6.2#6252)