You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2017/09/19 00:17:00 UTC

[jira] [Created] (IMPALA-5955) Use the totalSize Hive table property instead of rawDataSize

Alexander Behm created IMPALA-5955:
--------------------------------------

             Summary: Use the totalSize Hive table property instead of rawDataSize
                 Key: IMPALA-5955
                 URL: https://issues.apache.org/jira/browse/IMPALA-5955
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog, Frontend
            Reporter: Alexander Behm
            Assignee: Alexander Behm
            Priority: Critical


IMPALA-2373 changed COMPUTE STATS to also populate the 'rawDataSize' table property for the purpose of row count extrapolation. However, we should use 'totalSize' instead of 'rawDataSize' instead. Based on searching Google and looking at the Hive code it looks like the 'rawDataSize' roughly corresponds to the estimated in-memory size of a table (without encoding and compression), whereas the 'totalSize' property is used to represent the on-disk size.

I confirmed in the SparkSQL code that it prefers the 'totalSize' property for query planning. Also, SparkSQL's ANALYZE TABLE populates the 'totalSize'. We should try to be as compatible as possible with Hive/SparkSQL to avoid hard-to-debug stats inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)