You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/02/13 08:45:00 UTC
[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

    [ https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687805#comment-17687805 ] 

ASF GitHub Bot commented on PARQUET-2242:
-----------------------------------------

xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104147184


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##########
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES    = "parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min";

Review Comment:
   [apache:parquet-1.10.x](https://github.com/apache/parquet-mr/tree/parquet-1.10.x)  doesn't has README.md in the parquet-hadoop directory and now I  add README.md in this branch





> record count for  row group size check configurable
> ---------------------------------------------------
>
>                 Key: PARQUET-2242
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2242
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: xjlem
>            Priority: Major
>
>  org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached
> {code:java}
>  private void checkBlockSizeReached() throws IOException {
>     if (recordCount >= recordCountForNextMemCheck) { // checking the memory size is relatively expensive, so let's not do it for every record.
>       long memSize = columnStore.getBufferedSize();
>       long recordSize = memSize / recordCount;
>       // flush the row group if it is within ~2 records of the limit
>       // it is much better to be slightly under size than to be over at all
>       if (memSize > (nextRowGroupSize - 2 * recordSize)) {
>         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount);
>         flushRowGroupToStore();
>         initStore();
>         recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
>         this.lastRowGroupEndPos = parquetFileWriter.getPos();
>       } else {
>         recordCountForNextMemCheck = min(
>             max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway
>             recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look more than max records ahead
>             );
>         LOG.debug("Checked mem at {} will check again at: {}", recordCount, recordCountForNextMemCheck);
>       }
>     }
>   } {code}
> in this code，if the block size is small ,for example 8M,and the first 100 lines record size is small and  after 100 lines the record size is big，it will cause big row group，in our real scene，it will more than 64M. So i think the size for block check can configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)