You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Hari Sankar Sivarama Subramaniyan (JIRA)" <ji...@apache.org> on 2014/03/20 21:23:43 UTC
[jira] [Commented] (HIVE-6642) Query fails to vectorize when a non string partition column is part of the query expression

    [ https://issues.apache.org/jira/browse/HIVE-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942228#comment-13942228 ] 

Hari Sankar Sivarama Subramaniyan commented on HIVE-6642:
---------------------------------------------------------

https://reviews.apache.org/r/19492/

> Query fails to vectorize when a non string partition column is part of the query expression
> -------------------------------------------------------------------------------------------
>
>                 Key: HIVE-6642
>                 URL: https://issues.apache.org/jira/browse/HIVE-6642
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Hari Sankar Sivarama Subramaniyan
>            Assignee: Hari Sankar Sivarama Subramaniyan
>         Attachments: HIVE-6642.1.patch
>
>
> drop table if exists alltypesorc_part;
> CREATE TABLE alltypesorc_part (
> ctinyint tinyint,
> csmallint smallint,
> cint int,
> cbigint bigint,
> cfloat float,
> cdouble double,
> cstring1 string,
> cstring2 string,
> ctimestamp1 timestamp,
> ctimestamp2 timestamp,
> cboolean1 boolean,
> cboolean2 boolean) partitioned by (ds int) STORED AS ORC;
> insert overwrite table alltypesorc_part partition (ds=2011) select * from alltypesorc limit 100;
> insert overwrite table alltypesorc_part partition (ds=2012) select * from alltypesorc limit 200;
> explain select *
> from (select ds from alltypesorc_part) t1,
>      alltypesorc t2
> where t1.ds = t2.cint
> order by t2.ctimestamp1
> limit 100;
> The above query fails to vectorize because (select ds from alltypesorc_part) t1 returns a string column and the join equality on t2 is performed on an int column. The correct output when vectorization is turned on should be:
> STAGE DEPENDENCIES:
>   Stage-5 is a root stage
>   Stage-2 depends on stages: Stage-5
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-5
>     Map Reduce Local Work
>       Alias -> Map Local Tables:
>         t1:alltypesorc_part
>           Fetch Operator
>             limit: -1
>       Alias -> Map Local Operator Tree:
>         t1:alltypesorc_part
>           TableScan
>             alias: alltypesorc_part
>             Statistics: Num rows: 300 Data size: 62328 Basic stats: COMPLETE Column stats: COMPLETE
>             Select Operator
>               expressions: ds (type: int)
>               outputColumnNames: _col0
>               Statistics: Num rows: 300 Data size: 1200 Basic stats: COMPLETE Column stats: COMPLETE
>               HashTable Sink Operator
>                 condition expressions:
>                   0 {_col0}
>                   1 {ctinyint} {csmallint} {cint} {cbigint} {cfloat} {cdouble} {cstring1} {cstring2} {ctimestamp1} {ctimestamp2} {cboolean1} {cboolean2}
>                 keys:
>                   0 _col0 (type: int)
>                   1 cint (type: int)
>   Stage: Stage-2
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: t2
>             Statistics: Num rows: 3536 Data size: 1131711 Basic stats: COMPLETE Column stats: NONE
>             Map Join Operator
>               condition map:
>                    Inner Join 0 to 1
>               condition expressions:
>                 0 {_col0}
>                 1 {ctinyint} {csmallint} {cint} {cbigint} {cfloat} {cdouble} {cstring1} {cstring2} {ctimestamp1} {ctimestamp2} {cboolean1} {cboolean2}
>               keys:
>                 0 _col0 (type: int)
>                 1 cint (type: int)
>               outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
>               Statistics: Num rows: 3889 Data size: 1244882 Basic stats: COMPLETE Column stats: NONE
>               Filter Operator
>                 predicate: (_col0 = _col3) (type: boolean)
>                 Statistics: Num rows: 1944 Data size: 622280 Basic stats: COMPLETE Column stats: NONE
>                 Select Operator
>                   expressions: _col0 (type: int), _col1 (type: tinyint), _col2 (type: smallint), _col3 (type: int), _col4 (type: bigint), _col5 (type: float), _col6 (type: double), _col7 (type: string), _col8 (type: string), _col\
> 9 (type: timestamp), _col10 (type: timestamp), _col11 (type: boolean), _col12 (type: boolean)
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
>                   Statistics: Num rows: 1944 Data size: 622280 Basic stats: COMPLETE Column stats: NONE
>                   Reduce Output Operator
>                     key expressions: _col9 (type: timestamp)
>                     sort order: +
>                     Statistics: Num rows: 1944 Data size: 622280 Basic stats: COMPLETE Column stats: NONE
>                     value expressions: _col0 (type: int), _col1 (type: tinyint), _col2 (type: smallint), _col3 (type: int), _col4 (type: bigint), _col5 (type: float), _col6 (type: double), _col7 (type: string), _col8 (type: strin\
> g), _col9 (type: timestamp), _col10 (type: timestamp), _col11 (type: boolean), _col12 (type: boolean)
>       Local Work:
>         Map Reduce Local Work
>       Execution mode: vectorized
>       Reduce Operator Tree:
>         Extract
>           Statistics: Num rows: 1944 Data size: 622280 Basic stats: COMPLETE Column stats: NONE
>           Limit
>             Number of rows: 100
>             Statistics: Num rows: 100 Data size: 32000 Basic stats: COMPLETE Column stats: NONE
>             File Output Operator
>               compressed: false
>               Statistics: Num rows: 100 Data size: 32000 Basic stats: COMPLETE Column stats: NONE
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: 100
> where as with the current code, vectorization fails to take place because of the following exception
> 14/03/12 14:43:19 DEBUG vector.VectorizationContext: No vector udf found for GenericUDFOPEqual, descriptor: Argument Count = 2, mode = FILTER, Argument Types = {STRING,LONG}, Input Expression Types = {COLUMN,COLUMN}
> 14/03/12 14:43:19 DEBUG physical.Vectorizer: Failed to vectorize
> org.apache.hadoop.hive.ql.metadata.HiveException: Udf: GenericUDFOPEqual, is not supported
> 	at org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getGenericUdfVectorExpression(VectorizationContext.java:854)
> 	at org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:300)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateExprNodeDesc(Vectorizer.java:682)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateFilterOperator(Vectorizer.java:606)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateOperator(Vectorizer.java:537)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$ValidationNodeProcessor.process(Vectorizer.java:367)
> 	at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:78)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:132)
> 	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:109)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateMapWork(Vectorizer.java:314)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.convertMapWork(Vectorizer.java:283)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.dispatch(Vectorizer.java:270)
> 	at org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> 	at org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:194)
> 	at org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:139)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.resolve(Vectorizer.java:519)
> 	at org.apache.hadoop.hive.ql.optimizer.physical.PhysicalOptimizer.optimize(PhysicalOptimizer.java:100)
> 	at org.apache.hadoop.hive.ql.parse.MapReduceCompiler.optimizeTaskPlan(MapReduceCompiler.java:290)
> 	at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:216)
> 	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9286)
> 	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
> 	at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:64)
> 	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
> 	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:398)
> 	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:294)
> 	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:948)
> 	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:996)
> 	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:884)
> 	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:874)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
> 	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
> 	at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:457)
> 	at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:467)
> 	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:125)
> 	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
> 	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
> 	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:687)
> 	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:626)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:160)



--
This message was sent by Atlassian JIRA
(v6.2#6252)