You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "George Pachitariu (JIRA)" <ji...@apache.org> on 2018/09/08 18:31:00 UTC

[jira] [Updated] (HIVE-20523) Improve table statistics when the table contains arrays

     [ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

George Pachitariu updated HIVE-20523:
-------------------------------------
    Description: 
By default, when the table contains table-stats, the value of *rawDataSize* is taken to estimate the table data size in the execution plan.

The problem is that rawDataSize does not contain the data size of arrays. This makes the table size be underestimated when arrays make most of the table size.

In those specific cases, the value of the *totalSize* is much closer to the truth.

In this task I propose to take the *max* value between *rawDataSize* and *totalSize*deserializationFactor*.

I don't know if this proposal will backfire in any specific cases (overestimating the size of tables).

  was:
By default, when the table contains table-stats, the value of *rawDataSize* is taken to estimate the table data size in the execution plan.

The problem is that rawDataSize does not contain the data size of arrays. This makes the table size be underestimated when arrays make most of the table size.

In those specific cases, the value of the *totalSize* is much closer to the truth.

In this task I propose to take the max value between *rawDataSize* and *totalSize*deserializationFactor*.

I don't know if this proposal will backfire in any specific cases (overestimating the size of tables).


> Improve table statistics when the table contains arrays
> -------------------------------------------------------
>
>                 Key: HIVE-20523
>                 URL: https://issues.apache.org/jira/browse/HIVE-20523
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Minor
>
> By default, when the table contains table-stats, the value of *rawDataSize* is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. This makes the table size be underestimated when arrays make most of the table size.
> In those specific cases, the value of the *totalSize* is much closer to the truth.
> In this task I propose to take the *max* value between *rawDataSize* and *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)