You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Shyam Rai (JIRA)" <ji...@apache.org> on 2018/08/28 06:34:00 UTC

[jira] [Created] (HIVE-20473) Optimization for materialized views

Shyam Rai created HIVE-20473:
--------------------------------

             Summary: Optimization for materialized views
                 Key: HIVE-20473
                 URL: https://issues.apache.org/jira/browse/HIVE-20473
             Project: Hive
          Issue Type: Improvement
          Components: Hive
    Affects Versions: 3.0.0
         Environment: Can be reproduced on a Single node pseudo cluster. 
            Reporter: Shyam Rai


Optimizer is taking advantage of materialized view only when the query syntax matches the way view was created. Here is an example.

*Source table on which materialized views are created*

{code}
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `mysource`(                           |
|   `id` int,                                        |
|   `name` string,                                   |
|   `start_date` date)                               |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://xlhive3.openstacklocal:8020/warehouse/tablespace/managed/hive/mysource' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'transactional'='true',                          |
|   'transactional_properties'='insert_only',        |
|   'transient_lastDdlTime'='1535392655')            |
+----------------------------------------------------+
{code}

One of the materialized views "view_1" is created to fetch the data between IDs 1 and 2 using this statement
{code}
select `mysource`.`id`, `mysource`.`name`, `mysource`.`start_date` from `default`.`mysource` where `mysource`.`id` between 1 and 2
{code}

*When a SELECT is executed against the source table using the following SELECT statement, this works fine and can be validated with the explain plan.
*
{code}
0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id between 1 and 2;
INFO  : Compiling command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c): explain select * from mysource where id between 1 and 2
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c); Time taken: 0.224 seconds
INFO  : Executing command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c): explain select * from mysource where id between 1 and 2
INFO  : Starting task [Stage-1:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c); Time taken: 0.006 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-0 is a root stage                          |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         TableScan                                  |
|           alias: default.view_1                    |
|           Select Operator                          |
|             expressions: id (type: int), name (type: string), start_date (type: date) |
|             outputColumnNames: _col0, _col1, _col2 |
|             ListSink                               |
|                                                    |
+----------------------------------------------------+
{code}

If the rewrite of the same SELECT is written using >= and <=, which should yield the same result, the optimizer does not take advantage of the materialized view, unless of course we create another view with this >= and <= syntax. 

{code}
0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id >= 1 and <=2;
Error: Error while compiling statement: FAILED: ParseException line 1:49 cannot recognize input near '<=' '2' '<EOF>' in expression specification (state=42000,code=40000)
0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id >= 1 and id <=2;
INFO  : Compiling command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca): explain select * from mysource where id >= 1 and id <=2
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca); Time taken: 0.226 seconds
INFO  : Executing command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca): explain select * from mysource where id >= 1 and id <=2
INFO  : Starting task [Stage-1:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-0 is a root stage                          |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         TableScan                                  |
|           alias: mysource                          |
|           filterExpr: ((id >= 1) and (id <= 2)) (type: boolean) |
|           Filter Operator                          |
|             predicate: ((id >= 1) and (id <= 2)) (type: boolean) |
|             Select Operator                        |
|               expressions: id (type: int), name (type: string), start_date (type: date) |
|               outputColumnNames: _col0, _col1, _col2 |
|               ListSink                             |
|                                                    |
+----------------------------------------------------+
{code}






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)