You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2023/03/20 23:05:00 UTC
[jira] [Created] (HIVE-27159) Filters are not pushed down for decimal format in Parquet
Rajesh Balamohan created HIVE-27159:
---------------------------------------
Summary: Filters are not pushed down for decimal format in Parquet
Key: HIVE-27159
URL: https://issues.apache.org/jira/browse/HIVE-27159
Project: Hive
Issue Type: Improvement
Reporter: Rajesh Balamohan
Decimal filters are not created and pushed down in parquet readers. This causes latency delays and unwanted row processing in query execution.
It throws exception in runtime and processes more rows.
E.g Q13.
{noformat}
Parquet: (Map 1)
INFO : Task Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 31254.00 0 0 549,181,950 133
INFO : Map 3 0.00 0 0 73,049 365
INFO : Map 4 2027.00 0 0 6,000,000 1,689,919
INFO : Map 5 0.00 0 0 7,200 1,440
INFO : Map 6 517.00 0 0 1,920,800 493,920
INFO : Map 7 0.00 0 0 1,002 1,002
INFO : Reducer 2 18716.00 0 0 133 0
INFO : ----------------------------------------------------------------------------------------------
ORC:
INFO : Task Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 6556.00 0 0 267,146,063 152
INFO : Map 3 0.00 0 0 10,000 365
INFO : Map 4 2014.00 0 0 6,000,000 1,689,919
INFO : Map 5 0.00 0 0 7,200 1,440
INFO : Map 6 504.00 0 0 1,920,800 493,920
INFO : Reducer 2 3159.00 0 0 152 0
INFO : ----------------------------------------------------------------------------------------------
{noformat}
{noformat}
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean)
probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_112_container, bigKeyColName:ss_hdemo_sk, smallTablePos:1, keyRatio:5.042575832290721E-6
Statistics: Num rows: 2750380056 Data size: 1321831086472 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean)
Statistics: Num rows: 2500252205 Data size: 1201619783884 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: ss_cdemo_sk (type: bigint), ss_hdemo_sk (type: bigint), ss_addr_sk (type: bigint), ss_store_sk (type: bigint), ss_quantity (type: int), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_sold_date_sk (type: bigint), ss_net_profit BETWEEN 100 AND 200 (type: boolean), ss_net_profit BETWEEN 150 AND 300 (type: boolean), ss_net_profit BETWEEN 50 AND 250 (type: boolean), ss_sales_price BETWEEN 100 AND 150 (type: boolean), ss_sales_price BETWEEN 50 AND 100 (type: boolean), ss_sales_price BETWEEN 150 AND 200 (type: boolean)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
Statistics: Num rows: 2500252205 Data size: 714761816164 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col7 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13
input vertices:
1 Map 3
Statistics: Num rows: 502508168 Data size: 127400492016 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col2 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col1, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13, _col16, _col17, _col18
input vertices:
1 Map 4
Statistics: Num rows: 86972608 Data size: 10207471112 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((_col16 and _col8) or (_col17 and _col9) or (_col18 and _col10)) (type: boolean)
Statistics: Num rows: 65229456 Data size: 7655603392 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col1 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21
input vertices:
1 Map 5
Statistics: Num rows: 13045892 Data size: 260918084 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21, _col23, _col24, _col25, _col26, _col27, _col28
input vertices:
1 Map 6
Statistics: Num rows: 3354659 Data size: 147605232 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((_col23 and _col24 and _col11 and _col20) or (_col25 and _col26 and _col12 and _col21) or (_col27 and _col28 and _col13 and _col21)) (type: boolean)
Statistics: Num rows: 628998 Data size: 27676148 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col3 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col4, _col5, _col6
input vertices:
1 Map 7
Statistics: Num rows: 628998 Data size: 228 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: sum(_col4), count(_col4), sum(_col5), count(_col5), sum(_col6), count(_col6)
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: decimal(17,2)), _col3 (type: bigint), _col4 (type: decimal(17,2)), _col5 (type: bigint)
{noformat}
Stack:
{noformat}
fail to build predicate filter leaf with errorsorg.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL
org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL
at org.apache.hadoop.hive.ql.io.parquet.LeafFilterFactory.getLeafFilterBuilderByType(LeafFilterFactory.java:210)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.buildFilterPredicateFromPredicateLeaf(ParquetFilterPredicateConverter.java:130)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:111)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:97)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:71)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:88)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.toFilterPredicate(ParquetFilterPredicateConverter.java:57)
at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setFilter(ParquetRecordReaderBase.java:202)
at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:139)
at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setupMetadataAndParquetSplit(ParquetRecordReaderBase.java:88)
at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:178)
at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:52)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:87)
at org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:460)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:437)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:297)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:84)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:70)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:70)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:40)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)