You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Piyush Narang (JIRA)" <ji...@apache.org> on 2016/05/23 23:44:12 UTC

[jira] [Commented] (PARQUET-624) Value count used for memSize calculation in ColumnWriterV1 can be skewed based on first 100 values

    [ https://issues.apache.org/jira/browse/PARQUET-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297353#comment-15297353 ] 

Piyush Narang commented on PARQUET-624:
---------------------------------------

Also noticed that there is a way around this in the current version in master - can override the estimateNextSizeCheck property. Didn't notice this earlier as we're on an older sha due to Parquet-400. 

> Value count used for memSize calculation in ColumnWriterV1 can be skewed based on first 100 values
> --------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-624
>                 URL: https://issues.apache.org/jira/browse/PARQUET-624
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Piyush Narang
>            Assignee: Piyush Narang
>
> While digging into some OOMs that we were seeing for some of our Parquet writer jobs, I noticed that we were writing out around 250MB+ of data for a single column as one page. Our page size threshold is set to 1MB so this should actually result in a few hundred pages instead of just 1. 
> This seems to be due to the code in: [ColumnWriterV1.accountForValueWritten()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L93]. We only check if we've crossed the memory threshold if the valueCount exceeds the valueCountForNextSizeCheck. However, valueCountForNextSizeCheck can end up getting skewed substantially if the memSize of the first 100 values of the column is really small:
> For example, I see this in one of our jobs:
> {code}
> [foo_column] valueCount: 101, memSize: 16, pageSizeThreshold: 1048576
> valueCountForNextSizeCheck = (int)(valueCount + ((float)valueCount * props.getPageSizeThreshold() / memSize)) / 2 + 1;
> [foo_column] valueCountForNextSizeCheck = 3309619
> {code}
> This really large new valueCountForNextSizeCheck, results in our job OOMing as we end up seeing more space consuming values much much earlier than the ~3M valueCount point. 
> At this point, I'm thinking of doing something simple which is similar to [InternalParquetRecordWriter.checkBlockSizeReached()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L143], basically cap the maximum value of the valueCountForNextSizeCheck:
> {code}
> valueCountForNextSizeCheck =
>           Math.min(
>             (int)(valueCount + ((float)valueCount * pageSizeThreshold / memSize)) / 2 + 1,
>             valueCount + MAX_COUNT_FOR_SIZE_CHECK // will not look more than max records ahead
>           );
> {code}
> Open to something more sophisticated if people prefer so. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)