You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/08/16 23:24:00 UTC

[jira] [Commented] (IMPALA-8865) Do COMPUTE STATS on ACID tables in a "proper" transactional way

    [ https://issues.apache.org/jira/browse/IMPALA-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909480#comment-16909480 ] 

ASF subversion and git services commented on IMPALA-8865:
---------------------------------------------------------

Commit 8c5ea90aa53dd925ec038ef9d8ea71e7919e3127 in impala's branch refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8c5ea90 ]

IMPALA-8836: Support COMPUTE STATS on insert only ACID tables

For ACID tables COMPUTE STATS needs to use a new HMS API, as the
old one is rejected by metastore. This API currently has some
counter intuitive parts:
- setPartitionColumnStatistics is used to set table stats, as there
  is no similar function exposed by HMS client for tables at the
  moment.
- A new writeId is allocated for the stat change, and this needs
  a transaction, so a transaction is opened/committed/aborted even
  though this doesn't seem necessary. The Hive code seems to use
  internal API for this.
- Even though the HMS thrift Table object has a colStats field,
  it is only applied during alter_table if there are other changes
  like new columns in the tables, so alter_table couldn't be used
  to change column stats.

Additional changes:
- DROP STATS is no longer allowed for transactional tables, as it
  turned out that there is no transactional version of the old API.
- Remove COLUMN_STATS_ACCURATE table property during COMPUTE STATS
  to ensure that Hive does use stats computed by Impala to return
  answer queries like SELECT count(*)
- Changed CatalogOpExecutor.updateCatalog() to get the writeIds
  earlier. This can mean unnecassary HMS RPC calls if no property
  change is needed in the end, but I felt it hard to reason about
  what happens if these RPC calls fail at their original location.

TODOs (My plan is to do these in IMPALA-8865):
- Tried to make the MetastoreShim API easier to use by adding a class
  to encapsulate thing like txnId and writeId, but it feels rather
  half baked and under documented.
  A similar class is added in  https://gerrit.cloudera.org/#/c/14071/,
  it would be good to merge them.
- The validWriteIdList of the original SELECT(s) behind COMPUTE
  STATS could be used in the HMS API calls, but this would need
  more plumbing.

Change-Id: I5c06b4678c1ff75c5aa1586a78afea563e64057f
Reviewed-on: http://gerrit.cloudera.org:8080/14066
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Tim Armstrong <ta...@cloudera.com>


> Do COMPUTE STATS on ACID tables in a "proper" transactional way
> ---------------------------------------------------------------
>
>                 Key: IMPALA-8865
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8865
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend, Frontend
>    Affects Versions: Impala 3.3.0
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Critical
>              Labels: impala-acid
>
> IMPALA-8836's goal is just to get the stats in somehow in a way that Impala can use them and Hive does not treat them as accurate. It would be the best if the SELECT(s) that are behind the COMPUTE STATS would use the same validWriteId list, and the stats would be set with the same writeId list to express that the stats are based on that state of the table. Theoretically Hive uses this mechanism to decide whether the stats are up to data by comparing a SELECTs validWriteIdList with the one saved for stats and considers it stale if the SELECT sees new writeIds.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org