You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gabor Szadovszky <ga...@apache.org> on 2021/01/04 15:54:57 UTC

Re: Query on striping parquet files to maintain Row group alignment

Hi Jayjeet,

I assume you are using parquet-mr (and not other parquet implementations
like parquet-cpp, Impala etc.).

I am not sure if I got your request correctly. You may configure the size
of the row group by setting the config parquet.block.size. You may also
check parquet.writer.max-padding so the row groups will fit exactly into
the blocks. See details about the available configs at
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md.
Currently, parquet-mr does not have the functionality to automatically
close a parquet file and start a new one during writing.

Regards,
Gabor

On Thu, Dec 31, 2020 at 5:58 AM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:

> Hi all,
>
> I am trying to figure out if a large Parquet file can be striped across
> multiple small files based on a Row group chunk size where each stripe
> would naturally end up containing data pages from a single row group. So,
> if I say my writer "write a parquet file in chunks of 128 MB (assuming my
> row groups are of around 128MB), each of my chunks ends up being
> self-contained row group, maybe except the last chunk which has the footer
> contents. Is this possible? Can we fix the row group size (the amount of
> disk space a row group uses) while writing parquet files ? Thanks a lot.
>