You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2017/09/22 03:40:00 UTC

[jira] [Resolved] (IMPALA-5955) Use the totalSize Hive table property instead of rawDataSize

     [ https://issues.apache.org/jira/browse/IMPALA-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Behm resolved IMPALA-5955.
------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0

commit 71fd1941f006bb8e7629c8bcfbcfd1da050deed1
Author: Alex Behm <al...@cloudera.com>
Date:   Mon Sep 18 20:40:58 2017 -0700

    IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
    
    Today, Impala populates the 'rawDataSize' property
    during COMPUTE STATS for the purpose of extrapolating
    row counts based on file sizes.
    
    After this patch Impala will populate 'totalSize' instead of
    'rawDataSize'. The 'rawDataSize' is not populated or used.
    
    Intended meaning/use of tblproperties:
    - rawDataSize' is the estimated in-memory size of a table
      (without encoding and compression)
    - 'totalSize' represents the on-disk size
    
    Using the fields correctly is important for compatibility
    with other users of the HMS such as Hive and SparkSQL.
    For example, SparkSQL relies on the 'totalSize' for
    join ordering.
    
    Testing:
    - core/hdfs run passed
    
    Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
    Reviewed-on: http://gerrit.cloudera.org:8080/8110
    Tested-by: Impala Public Jenkins
    Reviewed-by: Alex Behm <al...@cloudera.com>


> Use the totalSize Hive table property instead of rawDataSize
> ------------------------------------------------------------
>
>                 Key: IMPALA-5955
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5955
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog, Frontend
>            Reporter: Alexander Behm
>            Assignee: Alexander Behm
>            Priority: Critical
>             Fix For: Impala 2.11.0
>
>
> IMPALA-2373 changed COMPUTE STATS to also populate the 'rawDataSize' table property for the purpose of row count extrapolation. However, we should use 'totalSize' instead of 'rawDataSize' instead. Based on searching Google and looking at the Hive code it looks like the 'rawDataSize' roughly corresponds to the estimated in-memory size of a table (without encoding and compression), whereas the 'totalSize' property is used to represent the on-disk size.
> I confirmed in the SparkSQL code that it prefers the 'totalSize' property for query planning. Also, SparkSQL's ANALYZE TABLE populates the 'totalSize'. We should try to be as compatible as possible with Hive/SparkSQL to avoid hard-to-debug stats inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)