You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2023/03/21 08:25:00 UTC
[jira] [Commented] (HIVE-27159) Filters are not pushed down for decimal format in Parquet
[ https://issues.apache.org/jira/browse/HIVE-27159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703090#comment-17703090 ]
Stamatis Zampetakis commented on HIVE-27159:
--------------------------------------------
This looks like a duplicate of HIVE-26091; let's close one of them.
> Filters are not pushed down for decimal format in Parquet
> ---------------------------------------------------------
>
> Key: HIVE-27159
> URL: https://issues.apache.org/jira/browse/HIVE-27159
> Project: Hive
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance
>
> Decimal filters are not created and pushed down in parquet readers. This causes latency delays and unwanted row processing in query execution.
> It throws exception in runtime and processes more rows.
> E.g Q13.
> {noformat}
> Parquet: (Map 1)
> INFO : Task Execution Summary
> INFO : ----------------------------------------------------------------------------------------------
> INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
> INFO : ----------------------------------------------------------------------------------------------
> INFO : Map 1 31254.00 0 0 549,181,950 133
> INFO : Map 3 0.00 0 0 73,049 365
> INFO : Map 4 2027.00 0 0 6,000,000 1,689,919
> INFO : Map 5 0.00 0 0 7,200 1,440
> INFO : Map 6 517.00 0 0 1,920,800 493,920
> INFO : Map 7 0.00 0 0 1,002 1,002
> INFO : Reducer 2 18716.00 0 0 133 0
> INFO : ----------------------------------------------------------------------------------------------
> ORC:
> INFO : Task Execution Summary
> INFO : ----------------------------------------------------------------------------------------------
> INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
> INFO : ----------------------------------------------------------------------------------------------
> INFO : Map 1 6556.00 0 0 267,146,063 152
> INFO : Map 3 0.00 0 0 10,000 365
> INFO : Map 4 2014.00 0 0 6,000,000 1,689,919
> INFO : Map 5 0.00 0 0 7,200 1,440
> INFO : Map 6 504.00 0 0 1,920,800 493,920
> INFO : Reducer 2 3159.00 0 0 152 0
> INFO : ----------------------------------------------------------------------------------------------
> {noformat}
> {noformat}
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean)
> probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_112_container, bigKeyColName:ss_hdemo_sk, smallTablePos:1, keyRatio:5.042575832290721E-6
> Statistics: Num rows: 2750380056 Data size: 1321831086472 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean)
> Statistics: Num rows: 2500252205 Data size: 1201619783884 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: ss_cdemo_sk (type: bigint), ss_hdemo_sk (type: bigint), ss_addr_sk (type: bigint), ss_store_sk (type: bigint), ss_quantity (type: int), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_sold_date_sk (type: bigint), ss_net_profit BETWEEN 100 AND 200 (type: boolean), ss_net_profit BETWEEN 150 AND 300 (type: boolean), ss_net_profit BETWEEN 50 AND 250 (type: boolean), ss_sales_price BETWEEN 100 AND 150 (type: boolean), ss_sales_price BETWEEN 50 AND 100 (type: boolean), ss_sales_price BETWEEN 150 AND 200 (type: boolean)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
> Statistics: Num rows: 2500252205 Data size: 714761816164 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col7 (type: bigint)
> 1 _col0 (type: bigint)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13
> input vertices:
> 1 Map 3
> Statistics: Num rows: 502508168 Data size: 127400492016 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col2 (type: bigint)
> 1 _col0 (type: bigint)
> outputColumnNames: _col0, _col1, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13, _col16, _col17, _col18
> input vertices:
> 1 Map 4
> Statistics: Num rows: 86972608 Data size: 10207471112 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: ((_col16 and _col8) or (_col17 and _col9) or (_col18 and _col10)) (type: boolean)
> Statistics: Num rows: 65229456 Data size: 7655603392 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col1 (type: bigint)
> 1 _col0 (type: bigint)
> outputColumnNames: _col0, _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21
> input vertices:
> 1 Map 5
> Statistics: Num rows: 13045892 Data size: 260918084 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col0 (type: bigint)
> 1 _col0 (type: bigint)
> outputColumnNames: _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21, _col23, _col24, _col25, _col26, _col27, _col28
> input vertices:
> 1 Map 6
> Statistics: Num rows: 3354659 Data size: 147605232 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: ((_col23 and _col24 and _col11 and _col20) or (_col25 and _col26 and _col12 and _col21) or (_col27 and _col28 and _col13 and _col21)) (type: boolean)
> Statistics: Num rows: 628998 Data size: 27676148 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col3 (type: bigint)
> 1 _col0 (type: bigint)
> outputColumnNames: _col4, _col5, _col6
> input vertices:
> 1 Map 7
> Statistics: Num rows: 628998 Data size: 228 Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: sum(_col4), count(_col4), sum(_col5), count(_col5), sum(_col6), count(_col6)
> minReductionHashAggr: 0.99
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
> Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> null sort order:
> sort order:
> Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: decimal(17,2)), _col3 (type: bigint), _col4 (type: decimal(17,2)), _col5 (type: bigint)
> {noformat}
> Stack:
> {noformat}
> fail to build predicate filter leaf with errorsorg.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL
> org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL
> at org.apache.hadoop.hive.ql.io.parquet.LeafFilterFactory.getLeafFilterBuilderByType(LeafFilterFactory.java:210)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.buildFilterPredicateFromPredicateLeaf(ParquetFilterPredicateConverter.java:130)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:111)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:97)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:71)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:88)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.toFilterPredicate(ParquetFilterPredicateConverter.java:57)
> at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setFilter(ParquetRecordReaderBase.java:202)
> at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:139)
> at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setupMetadataAndParquetSplit(ParquetRecordReaderBase.java:88)
> at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:178)
> at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:52)
> at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:87)
> at org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72)
> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:460)
> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
> at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
> at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
> at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:437)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:297)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)
> at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:84)
> at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:70)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:70)
> at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:40)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)