You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/09/26 02:23:00 UTC

[jira] [Commented] (IMPALA-2201) Compute [incremental] stats may not persist the stats if the data was loaded from Hive with hive.stats.autogather=true.

    [ https://issues.apache.org/jira/browse/IMPALA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768967#comment-17768967 ] 

ASF subversion and git services commented on IMPALA-2201:
---------------------------------------------------------

Commit 45d6815821a29b83c7a3daa3d380a40e0e4f3836 in impala's branch refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=45d681582 ]

IMPALA-12462: Update only changed partitions after COMPUTE STATS

This is mainly a revert of https://gerrit.cloudera.org/#/c/640/ but
some parts had to be updated due to changes in Impala.
See IMPALA-2201 for details about why this optimization was removed.

The patch can massively speed up COMPUTE STATS statement when the
majority of partitions has no changes.
COMPUTE STATS tpcds_parquet.store_sales;
before: 12s
after:   1s

Besides the DDL speed up the number of HMS events generated is also
reduced.

Testing:
- added test to verify COMPUTE STATS output
- correctness of cases when something is modified should be covered
  by existing tests
- core tests passed

Change-Id: If2703e0790d5c25db98ed26f26f6d96281c366a3
Reviewed-on: http://gerrit.cloudera.org:8080/20505
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Wenzhe Zhou <wz...@cloudera.com>


> Compute [incremental] stats may not persist the stats if the data was loaded from Hive with hive.stats.autogather=true.
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-2201
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2201
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.2
>            Reporter: Alexander Behm
>            Assignee: Alexander Behm
>            Priority: Blocker
>              Labels: correctness, supportability, usability
>             Fix For: Impala 2.2.7, Impala 2.3.0
>
>
> *Symptoms of This Bug*
> - Stats have been computed, but the row count reverts back to -1 after an INVALIDATE METADATA
> - A compute [incremental] stats appears to not set the row count
> Example scenario where this bug may happen:
> 1. A new partition with new data is loaded into a table via Hive
> 2. Hive has hive.stats.autogather=true
> 3. Stats on the new partition are computed in Impala with COMPUTE INCREMENTAL STATS <partition>
> 4. At this point, SHOW TABLE STATS shows the correct row count
> 5. INVALIDATE METADATA is run on the table in Impala
> 6. The row count reverts back to -1 because the stats have not been persisted
> *Explanation for This Bug*
> Here is why the stats is reset to -1. When Hive hive.stats.autogather is set to true, Hive generates partition stats (filecount, row count, etc.) after creating it. If you run "compute incremental stats" in Impala again. you will get the same RowCount, so the following check will not be satisfied and StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK will not be set in Impala's CatalogOpExecutor.java 
> {code}
> ...
>       // Update table stats
>       if (existingRowCount == null || !existingRowCount.equals(newRowCount)) {
>         // The existing row count value wasn't set or has changed.
>         msPartition.putToParameters(StatsSetupConst.ROW_COUNT, newRowCount);
>         msPartition.putToParameters(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK,
>             StatsSetupConst.TRUE);
>         updatedPartition = true;
>       }
> ...
> {code}
> When executing the corresponding alterPartition() RPC in the Hive Metastore, the row count will be reset because the STATS_GENERATED_VIA_STATS_TASK parameter was not set.
> Snipped from Hive's MetaStoreUtils.hava:
> {code}
> ...
> public static boolean updatePartitionStatsFast(PartitionSpecProxy.PartitionIterator part, Warehouse wh,
>       boolean madeDir, boolean forceRecompute) throws MetaException {
> ...
>         if(!params.containsKey(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK)) {
>           // invalidate stats requiring scan since this is a regular ddl alter case
>           for (String stat : StatsSetupConst.statsRequireCompute) {
>             params.put(stat, "-1");
>           }
>           params.put(StatsSetupConst.COLUMN_STATS_ACCURATE, StatsSetupConst.FALSE);
>         }
> ...
> {code}
> So if partition stats already exists but not computed by impala, compute incremental stats will cause stats been reset back to -1.
> Note that in Hive versions after CDH 5.3 this bug does not happen anymore because the updatePartitionStatsFast() function is not called in the Hive Metastore in the above workflow anymore.
> *Workarounds*
> 1. Disable stats autogathering in Hive when loading the data
> {code}
> SET hive.stats.autogather=false;
> {code}
> 2. Manually alter the numRows to -1 before doing COMPUTE [INCREMENTAL] STATS in Impala
> {code}
> ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES ('numRows'='-1');
> {code}
> 3. When already in the broken "-1" state, re-computing the stats for the affected partition fixes the problem
> *Proposed Solution*
> While this is arguably a Hive bug, I'd recommend that Impala should just unconditionally update the stats when running a COMPUTE STATS. Making the behavior dependent on the existing metadata state is brittle and hard to reason about and debug, esp. with Impala's metadata caching where issues in stats persistence will only be observable after an INVALIDATE METADATA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org