You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Shyam Rai (JIRA)" <ji...@apache.org> on 2018/08/28 06:34:00 UTC
[jira] [Created] (HIVE-20473) Optimization for materialized views
Shyam Rai created HIVE-20473:
--------------------------------
Summary: Optimization for materialized views
Key: HIVE-20473
URL: https://issues.apache.org/jira/browse/HIVE-20473
Project: Hive
Issue Type: Improvement
Components: Hive
Affects Versions: 3.0.0
Environment: Can be reproduced on a Single node pseudo cluster.
Reporter: Shyam Rai
Optimizer is taking advantage of materialized view only when the query syntax matches the way view was created. Here is an example.
*Source table on which materialized views are created*
{code}
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `mysource`( |
| `id` int, |
| `name` string, |
| `start_date` date) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://xlhive3.openstacklocal:8020/warehouse/tablespace/managed/hive/mysource' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='insert_only', |
| 'transient_lastDdlTime'='1535392655') |
+----------------------------------------------------+
{code}
One of the materialized views "view_1" is created to fetch the data between IDs 1 and 2 using this statement
{code}
select `mysource`.`id`, `mysource`.`name`, `mysource`.`start_date` from `default`.`mysource` where `mysource`.`id` between 1 and 2
{code}
*When a SELECT is executed against the source table using the following SELECT statement, this works fine and can be validated with the explain plan.
*
{code}
0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id between 1 and 2;
INFO : Compiling command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c): explain select * from mysource where id between 1 and 2
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c); Time taken: 0.224 seconds
INFO : Executing command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c): explain select * from mysource where id between 1 and 2
INFO : Starting task [Stage-1:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c); Time taken: 0.006 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-0 is a root stage |
| |
| STAGE PLANS: |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| TableScan |
| alias: default.view_1 |
| Select Operator |
| expressions: id (type: int), name (type: string), start_date (type: date) |
| outputColumnNames: _col0, _col1, _col2 |
| ListSink |
| |
+----------------------------------------------------+
{code}
If the rewrite of the same SELECT is written using >= and <=, which should yield the same result, the optimizer does not take advantage of the materialized view, unless of course we create another view with this >= and <= syntax.
{code}
0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id >= 1 and <=2;
Error: Error while compiling statement: FAILED: ParseException line 1:49 cannot recognize input near '<=' '2' '<EOF>' in expression specification (state=42000,code=40000)
0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id >= 1 and id <=2;
INFO : Compiling command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca): explain select * from mysource where id >= 1 and id <=2
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca); Time taken: 0.226 seconds
INFO : Executing command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca): explain select * from mysource where id >= 1 and id <=2
INFO : Starting task [Stage-1:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-0 is a root stage |
| |
| STAGE PLANS: |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| TableScan |
| alias: mysource |
| filterExpr: ((id >= 1) and (id <= 2)) (type: boolean) |
| Filter Operator |
| predicate: ((id >= 1) and (id <= 2)) (type: boolean) |
| Select Operator |
| expressions: id (type: int), name (type: string), start_date (type: date) |
| outputColumnNames: _col0, _col1, _col2 |
| ListSink |
| |
+----------------------------------------------------+
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)