You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2023/03/21 08:25:00 UTC

[jira] [Commented] (HIVE-27159) Filters are not pushed down for decimal format in Parquet

    [ https://issues.apache.org/jira/browse/HIVE-27159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703090#comment-17703090 ] 

Stamatis Zampetakis commented on HIVE-27159:
--------------------------------------------

This looks like a duplicate of HIVE-26091; let's close one of them.

> Filters are not pushed down for decimal format in Parquet
> ---------------------------------------------------------
>
>                 Key: HIVE-27159
>                 URL: https://issues.apache.org/jira/browse/HIVE-27159
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>
> Decimal filters are not created and pushed down in parquet readers. This causes latency delays and unwanted row processing in query execution. 
> It throws exception in runtime and processes more rows. 
> E.g Q13.
> {noformat}
> Parquet: (Map 1)
> INFO  : Task Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :      Map 1          31254.00              0              0     549,181,950              133
> INFO  :      Map 3              0.00              0              0          73,049              365
> INFO  :      Map 4           2027.00              0              0       6,000,000        1,689,919
> INFO  :      Map 5              0.00              0              0           7,200            1,440
> INFO  :      Map 6            517.00              0              0       1,920,800          493,920
> INFO  :      Map 7              0.00              0              0           1,002            1,002
> INFO  :  Reducer 2          18716.00              0              0             133                0
> INFO  : ----------------------------------------------------------------------------------------------
> ORC:
> INFO  : Task Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :      Map 1           6556.00              0              0     267,146,063              152
> INFO  :      Map 3              0.00              0              0          10,000              365
> INFO  :      Map 4           2014.00              0              0       6,000,000        1,689,919
> INFO  :      Map 5              0.00              0              0           7,200            1,440
> INFO  :      Map 6            504.00              0              0       1,920,800          493,920
> INFO  :  Reducer 2           3159.00              0              0             152                0
> INFO  : ----------------------------------------------------------------------------------------------
> {noformat}
> {noformat}
>  Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean)
>                   probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_112_container, bigKeyColName:ss_hdemo_sk, smallTablePos:1, keyRatio:5.042575832290721E-6
>                   Statistics: Num rows: 2750380056 Data size: 1321831086472 Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean)
>                     Statistics: Num rows: 2500252205 Data size: 1201619783884 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: ss_cdemo_sk (type: bigint), ss_hdemo_sk (type: bigint), ss_addr_sk (type: bigint), ss_store_sk (type: bigint), ss_quantity (type: int), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_sold_date_sk (type: bigint), ss_net_profit BETWEEN 100 AND 200 (type: boolean), ss_net_profit BETWEEN 150 AND 300 (type: boolean), ss_net_profit BETWEEN 50 AND 250 (type: boolean), ss_sales_price BETWEEN 100 AND 150 (type: boolean), ss_sales_price BETWEEN 50 AND 100 (type: boolean), ss_sales_price BETWEEN 150 AND 200 (type: boolean)
>                       outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
>                       Statistics: Num rows: 2500252205 Data size: 714761816164 Basic stats: COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0 _col7 (type: bigint)
>                           1 _col0 (type: bigint)
>                         outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13
>                         input vertices:
>                           1 Map 3
>                         Statistics: Num rows: 502508168 Data size: 127400492016 Basic stats: COMPLETE Column stats: COMPLETE
>                         Map Join Operator
>                           condition map:
>                                Inner Join 0 to 1
>                           keys:
>                             0 _col2 (type: bigint)
>                             1 _col0 (type: bigint)
>                           outputColumnNames: _col0, _col1, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13, _col16, _col17, _col18
>                           input vertices:
>                             1 Map 4
>                           Statistics: Num rows: 86972608 Data size: 10207471112 Basic stats: COMPLETE Column stats: COMPLETE
>                           Filter Operator
>                             predicate: ((_col16 and _col8) or (_col17 and _col9) or (_col18 and _col10)) (type: boolean)
>                             Statistics: Num rows: 65229456 Data size: 7655603392 Basic stats: COMPLETE Column stats: COMPLETE
>                             Map Join Operator
>                               condition map:
>                                    Inner Join 0 to 1
>                               keys:
>                                 0 _col1 (type: bigint)
>                                 1 _col0 (type: bigint)
>                               outputColumnNames: _col0, _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21
>                               input vertices:
>                                 1 Map 5
>                               Statistics: Num rows: 13045892 Data size: 260918084 Basic stats: COMPLETE Column stats: COMPLETE
>                               Map Join Operator
>                                 condition map:
>                                      Inner Join 0 to 1
>                                 keys:
>                                   0 _col0 (type: bigint)
>                                   1 _col0 (type: bigint)
>                                 outputColumnNames: _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21, _col23, _col24, _col25, _col26, _col27, _col28
>                                 input vertices:
>                                   1 Map 6
>                                 Statistics: Num rows: 3354659 Data size: 147605232 Basic stats: COMPLETE Column stats: COMPLETE
>                                 Filter Operator
>                                   predicate: ((_col23 and _col24 and _col11 and _col20) or (_col25 and _col26 and _col12 and _col21) or (_col27 and _col28 and _col13 and _col21)) (type: boolean)
>                                   Statistics: Num rows: 628998 Data size: 27676148 Basic stats: COMPLETE Column stats: COMPLETE
>                                   Map Join Operator
>                                     condition map:
>                                          Inner Join 0 to 1
>                                     keys:
>                                       0 _col3 (type: bigint)
>                                       1 _col0 (type: bigint)
>                                     outputColumnNames: _col4, _col5, _col6
>                                     input vertices:
>                                       1 Map 7
>                                     Statistics: Num rows: 628998 Data size: 228 Basic stats: COMPLETE Column stats: COMPLETE
>                                     Group By Operator
>                                       aggregations: sum(_col4), count(_col4), sum(_col5), count(_col5), sum(_col6), count(_col6)
>                                       minReductionHashAggr: 0.99
>                                       mode: hash
>                                       outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>                                       Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE
>                                       Reduce Output Operator
>                                         null sort order:
>                                         sort order:
>                                         Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE
>                                         value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: decimal(17,2)), _col3 (type: bigint), _col4 (type: decimal(17,2)), _col5 (type: bigint)
> {noformat}
> Stack:
> {noformat}
> fail to build predicate filter leaf with errorsorg.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL
> org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL
> 	at org.apache.hadoop.hive.ql.io.parquet.LeafFilterFactory.getLeafFilterBuilderByType(LeafFilterFactory.java:210)
> 	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.buildFilterPredicateFromPredicateLeaf(ParquetFilterPredicateConverter.java:130)
> 	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:111)
> 	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:97)
> 	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:71)
> 	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:88)
> 	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.toFilterPredicate(ParquetFilterPredicateConverter.java:57)
> 	at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setFilter(ParquetRecordReaderBase.java:202)
> 	at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:139)
> 	at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setupMetadataAndParquetSplit(ParquetRecordReaderBase.java:88)
> 	at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:178)
> 	at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:52)
> 	at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:87)
> 	at org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72)
> 	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:460)
> 	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
> 	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
> 	at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
> 	at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
> 	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:437)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:297)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:84)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:70)
> 	at java.base/java.security.AccessController.doPrivileged(Native Method)
> 	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:70)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:40)
> 	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> 	at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
> 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)