You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Syed Shameerur Rahman (Jira)" <ji...@apache.org> on 2020/05/18 11:22:00 UTC

[jira] [Commented] (TEZ-2103) Implement a Partial completion VertexManagerPlugin

    [ https://issues.apache.org/jira/browse/TEZ-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110149#comment-17110149 ] 

Syed Shameerur Rahman commented on TEZ-2103:
--------------------------------------------

 [~gopalv] [~jeagles] Could you please review the design decisions in *TEZ-2103.WIP.patch*. I am not sure if this is the right approach to solve the problem but here is the summary.

*Tez Side:*

1) On *V_TASK_COMPLETED* event, check if the task is succeeded if yes, then calculate the no. of records returned from the task counters (*groupName*: HIVE, *counterName*: RECORDS_OUT_0)

2) Check if the maxRow limit have been exceeded, if yes for all the remaining tasks send TaskEventTermination event. (marking the state of tasks/task attempts as killed due to short circuit makes more sense than marking it as SUCCEEDED).

3) Finally mark the vertex as SUCCEEDED


*Hive Side:*

1) We do have a class in hive *GlobalLimitOptimizer* which can detect simple select queries with where clause and limit and extract the defined limit from such queries.

2) Pass the extracted limit from hive as dagConf, which can be used to set the limit's value in tez.

I am not sure about the Pig's use case.

> Implement a Partial completion VertexManagerPlugin
> --------------------------------------------------
>
>                 Key: TEZ-2103
>                 URL: https://issues.apache.org/jira/browse/TEZ-2103
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Gopal Vijayaraghavan
>            Priority: Major
>              Labels: gsoc, gsoc2015, hadoop, java, tez
>         Attachments: TEZ-2103.WIP.patch
>
>
> Currently, there is no sibling communication between tasks - this implies that a task can be completed by the first vertex in a wave of tasks, but the entire wave of tasks has to complete before success can be reported.
> This occurs in limit + filter query patterns common between the data access engines.
> {code}
> select * from data where x > 1 limit 10;
> {code}
> will run through a full-table scan worth of tasks to generate 10 rows per task, to aggregate it to produce the final 10 row result.
> The VertexManager receives counters/events early enough to short-circuit the rest of the vertex tasks, to prevent the remainder of tasks from getting scheduled when the limit condition has been satisfied by an initial sub-set of the tasks.
> This is a specialization of the VertexManagerPlugin for this common case scheduling pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)