You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Qifan Chen (Jira)" <ji...@apache.org> on 2020/08/14 17:35:00 UTC
[jira] [Resolved] (IMPALA-9744) Treat corrupt table stats as
missing to avoid bad plans
[ https://issues.apache.org/jira/browse/IMPALA-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qifan Chen resolved IMPALA-9744.
--------------------------------
Fix Version/s: Impala 4.0
Resolution: Fixed
> Treat corrupt table stats as missing to avoid bad plans
> -------------------------------------------------------
>
> Key: IMPALA-9744
> URL: https://issues.apache.org/jira/browse/IMPALA-9744
> Project: IMPALA
> Issue Type: Sub-task
> Components: Frontend
> Reporter: Tim Armstrong
> Assignee: Qifan Chen
> Priority: Major
> Labels: ramp-up
> Fix For: Impala 4.0
>
>
> We currently detect corrupt stats (0 rows but data in partition) but only flag it. The 0 row count is used for planning. I ran into a scenario where this lead to an extremely pathological plan - the 0 row count lead to flipping a nested loop join to put the big table on the build side and running out of memory.
> I propose doing something very conservative to avoid this scenario: if we see corrupt stats in any partition, and the row count is computed to be zero, ignore the row count and treat it the same as missing stats in the planner.
> Here's an example where we end up with corrupt stats. Warning: this can remove the data file from your alltypes type, I recommend copying the file to a different location before running this.
> {noformat}
> # In beeline against HS2
> !connect jdbc:hive2://localhost:11050 hive org.apache.hive.jdbc.HiveDrive
> set hive.stats.autogather=true;
> CREATE TABLE `alltypes_insert_only`(
> `id` int COMMENT 'Add a comment',
> `bool_col` boolean,
> `tinyint_col` tinyint,
> `smallint_col` smallint,
> `int_col` int,
> `bigint_col` bigint,
> `float_col` float,
> `double_col` double,
> `date_string_col` string,
> `string_col` string,
> `timestamp_col` timestamp)
> PARTITIONED BY (
> `year` int,
> `month` int)
> STORED AS PARQUET
> TBLPROPERTIES ("transactional"="true", "transactional_properties"="insert_only");
> load data inpath 'hdfs://172.19.0.1:20500/test-warehouse/alltypes_parquet/year=2009/month=1/154473eafa08ea0e-f9d70e7100000004_1040780996_data.0.parq' into table alltypes_insert_only partition (year=2009,month=9);
> # In Impala
> show table stats alltypes_insert_only;
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> | year | month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> | 2009 | 10 | 0 | 1 | 7.75KB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://172.19.0.1:20500/test-warehouse/managed/alltypes_insert_only/year=2009/month=10 |
> | Total | | -1 | 1 | 7.75KB | 0B | | | | |
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org