You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Prajakta Kalmegh (JIRA)" <ji...@apache.org> on 2010/12/08 06:22:08 UTC
[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

    [ https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969186#action_12969186 ] 

Prajakta Kalmegh commented on HIVE-1694:
----------------------------------------

Hi,

I am Prajakta from Persistent Systems Ltd. and am working on the changes that John and Namit have suggested above along with Nikhil and Prafulla.
This is a design note about implementation of above review comments.

We're implementing the following changes as a single transformation in optimizer:
    (1) Table replacement: involves modification of some internal members of TableScanOperator.
    (2) Group by removal: involves removal of some operators (GBY-RS-GBY) where GBY is done at both mapper-reducer side; and re-setting of correct parent and child operators within the DAG.
    (3) Sub-query insertion: involves creation of new DAG for sub-query and attaching it to the original DAG at an appropriate place.
    (4) Projection modification: involves steps similar to (3).
    
We have implemented the above changes as a proof of concept. In this implementation, we have invoked this rule as the first transformation in the optimizer code so that lineage information is computed later as part of the Generator transformation. Another reason that we have applied this as the first transformation is that, as of now, the implementation uses the query block (QB) information to decide if the transformation can be applied for the input query (similar to the canApplyThisRule() method in the original rewrite code). Finally, to make the changes (3) and (4), we are modifying the operator DAG. However, we are not modifying the original query block (QB). Hence, this leaves the QB and the operator DAG out of sync.

The major issues in this implementation approach are:
1. The changes listed above require either modification of operator DAG (in case of 2) or creation of new operator DAG(in case of 3 and 4). The implementation requires some hacks in the SemanticAnalyzer code if we create a new DAG (as in the case of replaceViewReferenceWithDefinition() method which uses ParseDriver() to do the same). However, the methods are private (like genBodyPlan(...), genSelectPlan(...) etc) making it all the more difficult to implement changes (3) and (4) without access to these methods.
2. The creation of new DAG will require creating all associated data structures like QB, ASTNode etc as this information is necessary to generate DAG operator plan for the sub-queries. This approach would be very similar to what we are already doing in rewrite i.e creating new QB and ASTNode. 
3. Since we are creating a new DAG and appending it to the enclosing query DAG, we also need to modify the row schema and row resolvers for the operators.

One of the questions that underlies before finalizing the above approach is whether the cost-based optimizer, which is to be implemented in the future, will work on the query block or on the DAG operator tree. In case it works on the operator DAG, then the implementation changes we listed here are bound to be done. However, if the cost-based optimizer is to work on the query block, then we feel that the initial query rewrite engine code which worked after semantic analysis but before plan generation can be made to work with the cost-based optimizer. It will be a valuable input from your side if you could comment on the cost-based optimizer.
        

> Accelerate query execution using indexes
> ----------------------------------------
>
>                 Key: HIVE-1694
>                 URL: https://issues.apache.org/jira/browse/HIVE-1694
>             Project: Hive
>          Issue Type: New Feature
>          Components: Indexing, Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Nikhil Deshpande
>            Assignee: Nikhil Deshpande
>         Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting indexes in Hive compiler & execution engine for SELECT queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query execution.
> The aim of this effort is to use indexes to accelerate query execution (for certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold the information about index based plans & operator implementations for above mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.