You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/08/13 03:34:00 UTC
[jira] [Commented] (IMPALA-9744) Treat corrupt table stats as missing to avoid bad plans

    [ https://issues.apache.org/jira/browse/IMPALA-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176741#comment-17176741 ] 

ASF subversion and git services commented on IMPALA-9744:
---------------------------------------------------------

Commit 5fedf7bf7247240ae1d356a393179f80b97d5cb5 in impala's branch refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5fedf7b ]

IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans

This work addresses the current limitation in computing the total row
count for a Hive table in a scan. The row count can be incorrectly
computed as 0, even though there exists data in the Hive table. This
is the stats corruption at table level. Similar stats corruption
exists for a partition. The row count of a table or a partition
sometime can also be -1 which indicates a missing stats situation.

In the fix, as long as no partition in a Hive table exhibits any
missing or corrupt stats, the total row count for the table is computed
from the row counts in all partitions. Otherwise, Impala looks at
the table level stats particularly the table row count.

In addition, if the table stats is missing or corrupted, Impala
estimates a row count for the table, if feasible. This row count is
the sum of the row count from the partitions with good stats, and
an estimation of the number of rows in the partitions with missing or
corrupt stats. Such estimation also applies when some partition
has corrupt stats.

One way to observe the fix is through the explain of queries scanning
Hive tables with missing or corrupted stats. The cardinality for any
full scan should be a positive value (i.e. the estimated row count),
instead of 'unavailable'.  At the beginning of the explain output,
that table is still listed in the WARNING section for potentially
corrupt table statistics.

Testing:
1. Ran unit tests with queries documented in the case against Hive
   tables with the following configrations:
   a. No stats corruption in any partitions
   b. Stats corruption in some partitions
   c. Stats corruption in all partitions
2. Added two new tests in test_compute_stats.py:
   a. test_corrupted_stats_in_partitioned_Hive_tables
   b. test_corrupted_stats_in_unpartitioned_Hive_tables
3. Fixed failures in corrupt-stats.test
4. Ran "core" test

Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Reviewed-on: http://gerrit.cloudera.org:8080/16098
Reviewed-by: Sahil Takiar <st...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Treat corrupt table stats as missing to avoid bad plans
> -------------------------------------------------------
>
>                 Key: IMPALA-9744
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9744
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Tim Armstrong
>            Assignee: Qifan Chen
>            Priority: Major
>              Labels: ramp-up
>
> We currently detect corrupt stats (0 rows but data in partition) but only flag it. The 0 row count is used for planning. I ran into a scenario where this lead to an extremely pathological plan - the 0 row count lead to flipping a nested loop join to put the big table on the build side and running out of memory.
> I propose doing something very conservative to avoid this scenario: if we see corrupt stats in any partition, and the row count is computed to be zero, ignore the row count and treat it the same as missing stats in the planner.
> Here's an example where we end up with corrupt stats. Warning: this can remove the data file from your alltypes type, I recommend copying the file to a different location before running this.
> {noformat}
> # In beeline against HS2
> !connect jdbc:hive2://localhost:11050 hive org.apache.hive.jdbc.HiveDrive
> set hive.stats.autogather=true;
> CREATE TABLE `alltypes_insert_only`(
>    `id` int COMMENT 'Add a comment',
>    `bool_col` boolean,
>    `tinyint_col` tinyint,
>    `smallint_col` smallint,
>    `int_col` int,
>    `bigint_col` bigint,
>    `float_col` float,
>    `double_col` double,
>    `date_string_col` string,
>    `string_col` string,
>    `timestamp_col` timestamp)
>  PARTITIONED BY (
>    `year` int,
>    `month` int)
>  STORED AS PARQUET
>  TBLPROPERTIES ("transactional"="true", "transactional_properties"="insert_only");
> load data inpath 'hdfs://172.19.0.1:20500/test-warehouse/alltypes_parquet/year=2009/month=1/154473eafa08ea0e-f9d70e7100000004_1040780996_data.0.parq' into table alltypes_insert_only partition (year=2009,month=9);
> # In Impala
> show table stats alltypes_insert_only;
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> | year  | month | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                               |
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> | 2009  | 10    | 0     | 1      | 7.75KB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://172.19.0.1:20500/test-warehouse/managed/alltypes_insert_only/year=2009/month=10 |
> | Total |       | -1    | 1      | 7.75KB | 0B           |                   |         |                   |                                                                                        |
> +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org