You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gang Wu (Jira)" <ji...@apache.org> on 2023/03/26 05:45:00 UTC

[jira] [Resolved] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

     [ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gang Wu resolved PARQUET-2164.
------------------------------
    Fix Version/s:     (was: 1.12.3)
       Resolution: Fixed

> CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
> --------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2164
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2164
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Parth Chandra
>            Priority: Major
>         Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)