You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2018/04/30 16:19:00 UTC

[jira] [Resolved] (IMPALA-6678) Better estimate of per-column compressed data size for low-NDV columns.

     [ https://issues.apache.org/jira/browse/IMPALA-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-6678.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 3.1.0
                   Impala 2.13.0

> Better estimate of per-column compressed data size for low-NDV columns.
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-6678
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6678
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>    Affects Versions: Not Applicable
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: resource-management
>             Fix For: Impala 2.13.0, Impala 3.1.0
>
>
> In the previous IMPALA-4835 patch, we assumed that the "ideal" memory per Parquet column was 3 * 8MB, except when the total size of the file capped the total amount of memory we might use. This is often an overestimate, particular for smaller files, files with large numbers of columns, and highly compressible data.
> We could do something smarter for Parquet given file sizes, per-partition row count, and column NDV. We can estimate row count per file by dividing the row count by the file size and estimate bytes per value with two methods:
> * For fixed width types, estimating bytes per value based on the type width. We don't know what the physical parquet type is necessarily, but it seems reasonable to estimate based on the type declared in the table.
> * log2(ndv) / 8, assuming that dictionary compression or general-purpose compression will kick in.
> See https://docs.google.com/document/d/1kR0zfevNNUJom3sH1XmposacVZ-QALan7NSwnR5CkSA/edit#heading=h.a2b8e8h5a6en for some analysis. 
> I looked at encoded lineitem data and saw that many of the scanned columns were 3-4MB in size and that we could have estimated an ideal size < 24MB per column based on the above heuristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org