You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/10/07 22:25:00 UTC

[jira] [Commented] (PARQUET-1920) Fix issue with reading parquet files with too large column chunks

    [ https://issues.apache.org/jira/browse/PARQUET-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209902#comment-17209902 ] 

ASF GitHub Bot commented on PARQUET-1920:
-----------------------------------------

SinghAsDev opened a new pull request #824:
URL: https://github.com/apache/parquet-mr/pull/824


   Fix Parquet writer's memory check interval calculation and throw helpful message while dealing with too large column chunks.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses [PARQUET-1920](https://issues.apache.org/jira/browse/PARQUET-1920).
     - https://issues.apache.org/jira/browse/PARQUET-1920
   
   ### Tests
   
   - [ ] My PR does not necessarily needs any specific test as it is removing hard coded values in favor of configs. However, I can think of ways to test if insisted.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fix issue with reading parquet files with too large column chunks
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1920
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1920
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.10.0, 1.11.0, 1.10.1, 1.12.0, 1.11.1
>            Reporter: Ashish Singh
>            Assignee: Ashish Singh
>            Priority: Major
>
> Fix Parquet writer's memory check while writing highly skewed data.
> Parquet uses \{{CapacityByteArrayOutputStream}} to hold column chunks in memory. This is similar to \{{ByteArrayOutputStream}}, however it avoids copying of entire data while growing the array. It does so by creating and maintaining different arrays (slabs). The way the size grows is exponentially till it nears the configurable max capacity hint, and after that it grows very slowly. This along with the Parquet's logic to determine when to check if enough data is in memory to flush to disk, makes it possible for a highly skewed dataset to make Parquet's write really large column chunk and so row group, beyond the max expected size (in int) of the row group.
> In Parquet 1.10, a change was made to make page size row check frequency configurable. However, there is a bug in the implementation that is leading to these configs to not help with memory checks calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)