You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2011/06/25 07:22:47 UTC
[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054816#comment-13054816 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

The current optimizer can identify correlations with query plan tree structures like TPC-H Q17 (in attached file Queries). Using Q17 as an example, sub-query (denoted as sub-Q1 and originally executed by MapReduce job J1) "SELECT l_partkey as t_partkey, 0.2 * avg(l_quantity) AS t_avg_quantity FROM lineitem GROUP BY l_partkey" has correlation with sub-query (denoted as sub-Q2 and originally executed by MapReduce job J2) "SELECT l_quantity, l_partkey, l_extendedprice FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey AND p.p_brand = 'Brand#52' AND p.p_container = 'JUMBO CAN'", because (1)sub-Q1 and sub-Q2 share the same input 'lineitem'; (2) ReduceSinkOperators in J1 and J2 share the same 'key', which is l_partkey (p_partkey). Also, because intermediate tables generated by sub-Q1 and sub-Q2 will be joined by a MapReduce job J3, of which the 'key' of ReduceSinkOperator is 'l_partkey', J3 has correlation with J1 and J2. Thus, J1, J2 and J3 can be merged into one MapReduce job J'. In the map function of J', a composite operator will be used to execute FilterOperators (if any) for sub-Q1 and sub-Q2. Then, in the reduce function of J', a dispatch operator is used to dispatch reduce-input records to JoinOperator in J1 and GroupByOperator in J2. Then, the results of JoinOperator and GroupByOperator will be fed to the JoinOperator in J3. 

For this optimizer, there are several issues. 

1: Because for the MapReduce job executing correlated MapReduce jobs, intermediate key/value pairs will be consumed by multiple operators, Map-side Aggregation is disabled. 
2: For the MapReduce job executing correlated MapReduce jobs, if the depth of execution path in the reduce function is not the same (for example "SELECT * FROM lineitem l1 JOIN (SELECT l_partkey FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey) tmp ON l1.partkey = tmp.partkey"), one or multiple YSmartForwardOperator should be used. I have not completely solved this issue.
3: For two independent MapReduce jobs J1 and J2, the current correlation identifier only searches ReduceSinkOperators with the same 'key(s)' for correlation, actually the set of 'key(s)' of the ReduceSinkOperator in J1 is a subset of that in J2, these two MapReduce jobs are correlated. (Also, sub-queries with distinct keyword associated with Group By clause is under this issue, since distinct keyword is handled by using all columns as 'keys' in its corresponding ReduceSinkOperator)
4: The current correlation identifier can not identify correlations represented by columns involving "max(<column name>)" or "min(<column name>)".

I will start working on this optimizer in August and will firstly solve issues 2-4 mentioned above.  

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira