You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2014/07/08 01:29:34 UTC

[jira] [Created] (TEZ-1265) Custom input to fetch source task inputs in order

Rohini Palaniswamy created TEZ-1265:
---------------------------------------

             Summary: Custom input to fetch source task inputs in order
                 Key: TEZ-1265
                 URL: https://issues.apache.org/jira/browse/TEZ-1265
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Rohini Palaniswamy


Consider the case of having to LIMIT m records after an Order by. A distributed orderby vertex produces data in sorted order from task0,task1...taskn. Each task limits its output to m records (the output count can be <m also). The limit vertex (parallelism 1) following the order by vertex has to fetch output of all n tasks, shuffle merge its inputs (to maintain the order) and then limit m records again.  So need a input that fetches from source tasks in order and reads them in order. Since data produced is ordered from task0,task1...taskn it can be consumed without shuffle and sort. If the limit is hit early it can skip fetching more task inputs. 

More details in https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217



--
This message was sent by Atlassian JIRA
(v6.2#6252)