You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Csaba Ringhofer (JIRA)" <ji...@apache.org> on 2019/04/17 14:40:00 UTC

[jira] [Created] (IMPALA-8431) Parquet STRING column memory reservation seems underestimated

Csaba Ringhofer created IMPALA-8431:
---------------------------------------

             Summary: Parquet STRING column memory reservation seems underestimated
                 Key: IMPALA-8431
                 URL: https://issues.apache.org/jira/browse/IMPALA-8431
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.2.0
            Reporter: Csaba Ringhofer


https://github.com/apache/impala/blob/5fa076e95cfbfcc044dc14cbb20af825936af82a/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L1698

computeMinScalarColumnMemReservation() uses stat avg_size to estimate the memory needed for a value during scanning, but this does not contain the 4 byte / value length field used in plain encoding, which can dominate columns with very short strings. (compression can probably negate this affect)

In case of dict decoding estimation:
- this 4 byte/NDV should be also added, as the dictionary itself is also plain encoded
- + 12 byte/NDV is used for the StringValues used as indirection in the dictionary, but I am not sure if this should be added to the reservation
- a more pessimistic estimation would use max_size instead of avg_size  for dictionary entries, as it is possible that the majority of distinct values are long, but the short ones are much more frequent, which makes the avg_size small

Another small underestimation, that NULL values are ignored. NULLs (=def levels) could be  added as 1 bit/value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)