You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Ashish Singh (Jira)" <ji...@apache.org> on 2020/10/07 22:14:00 UTC

[jira] [Created] (PARQUET-1920) Fix issue with reading parquet files with too large column chunks

Ashish Singh created PARQUET-1920:
-------------------------------------

             Summary: Fix issue with reading parquet files with too large column chunks
                 Key: PARQUET-1920
                 URL: https://issues.apache.org/jira/browse/PARQUET-1920
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.10.1, 1.11.0, 1.10.0, 1.12.0, 1.11.1
            Reporter: Ashish Singh
            Assignee: Ashish Singh


Fix Parquet writer's memory check while writing highly skewed data.

Parquet uses \{{CapacityByteArrayOutputStream}} to hold column chunks in memory. This is similar to \{{ByteArrayOutputStream}}, however it avoids copying of entire data while growing the array. It does so by creating and maintaining different arrays (slabs). The way the size grows is exponentially till it nears the configurable max capacity hint, and after that it grows very slowly. This along with the Parquet's logic to determine when to check if enough data is in memory to flush to disk, makes it possible for a highly skewed dataset to make Parquet's write really large column chunk and so row group, beyond the max expected size (in int) of the row group.

In Parquet 1.10, a change was made to make page size row check frequency configurable. However, there is a bug in the implementation that is leading to these configs to not help with memory checks calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)