You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "George Pachitariu (JIRA)" <ji...@apache.org> on 2018/09/08 19:29:00 UTC

[jira] [Commented] (HIVE-20523) Improve table statistics when the table contains arrays

    [ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608183#comment-16608183 ] 

George Pachitariu commented on HIVE-20523:
------------------------------------------

This is my understanding of why the original behaviour happens (and please correct me if I'm wrong):

rawDataSize is computed from the schema in the objectInspector (one example is [here|https://github.com/apache/hive/blob/rel/release-3.1.0/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L166]). And the inspector for an array is UnionStructObjectInspector which will return size = 1.

> Improve table statistics when the table contains arrays
> -------------------------------------------------------
>
>                 Key: HIVE-20523
>                 URL: https://issues.apache.org/jira/browse/HIVE-20523
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Minor
>         Attachments: HIVE-20523.patch
>
>
> By default, when the table contains table-stats, the value of *rawDataSize* is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. This makes the table size be underestimated when arrays make most of the table size.
> In those specific cases, the value of the *totalSize* is much closer to the truth.
> In this task I propose to take the *max* value between *rawDataSize* and *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)