You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/06/30 21:28:05 UTC

[jira] [Created] (PARQUET-321) Set the HDFS padding default to 16MB

Ryan Blue created PARQUET-321:
---------------------------------

             Summary: Set the HDFS padding default to 16MB
                 Key: PARQUET-321
                 URL: https://issues.apache.org/jira/browse/PARQUET-321
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
            Reporter: Ryan Blue
            Assignee: Ryan Blue
             Fix For: 1.8.0


PARQUET-306 added the ability to pad row groups so that they align with HDFS blocks to avoid remote reads. The ParquetFileWriter will now either pad the remaining space in the block or target a row group for the remaining size.

The padding maximum controls the threshold of the amount of padding that will be used. If the space left is under this threshold, it is padded. If it is greater than this threshold, then the next row group is fit into the remaining space. The current padding maximum is 0.

I think we should change the padding maximum to 8MB. My reasoning is this: we want this number to be small enough that it won't prevent the library from writing reasonable row groups, but larger than the minimum size row group we would want to write. 8MB is 1/16th of the row group default, so I think it is reasonable: we don't want a row group to be smaller than 8 MB.

We also want this to be large enough that a few row groups in a  block don't cause a tiny row group to be written in the excess space. 8MB accounts for 4 row groups that are 2MB under-size. In addition, it is reasonable to not allow row groups under 8MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)