You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Jayjeet Chakraborty <ja...@gmail.com> on 2020/12/30 20:33:28 UTC

Query on striping parquet files while maintaining Row group alignment

Hi all,

I am trying to figure out if a large Parquet file can be striped across multiple small files based on a Row group chunk size where each stripe would naturally end up containing data pages from a single row group. So, if I say my writer "write a parquet file in chunks of 128 MB (assuming my row groups are of around 128MB), each of my chunks ends up being self-contained row group, maybe except the last chunk which has the footer contents. Is this possible? Can we fix the row group size (the amount of disk space a row group uses) while writing parquet files ? Thanks a lot.

Re: Query on striping parquet files while maintaining Row group alignment

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

Thanks for the explanation, kinda makes more sense. I guess you'd still
have to read the parquet footer outside of the storage nodes and then send
the relevant info from the footer to the storage nodes, right? I guess the
footer doesn't need to be in the same block as the

So the following options are maybe possible, but would each have different
challenges?

   1. Single Parquet file, multiple row groups, with each row group in its
   own block. It's tricky to know for sure that row groups are aligned to
   blocks in this case.
   2. Multiple separate parquet files. Maybe not well-supported by some
   libraries? Maybe some downside to having extra footers.
   3. Split parquet file into a file per row group and a separate footer.
   Not part of the Parquet standard.


Impala does a somewhat similar thing with its custom parquet writer - write
one Parquet file per HDFS block so that it could all be read locally (since
if it was a multi-block parquet file you have to do a remote read for the
HDFS footer). That's best-effort IIRC though - a file could overflow into
the next block. So we have some experience with doing this local reading of
parquet files, but if the layout of the files is non-optimal we'll fall
back to doing reads from remote nodes.

Part of the problem is that you don't know exactly how big the row group
will be until you've encoded all of the data, and you also don't know how
big the footer will be until you've encoded it. I'm not sure if some of the
parquet writers in parquet-mr or parquet-cpp are able to guarantee that the
row group is smaller than the block size.

On Thu, Dec 31, 2020 at 3:36 AM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:

> Hi Jason and Tim,
>
> Thanks for the detailed response.
>             The reason to split a large parquet file into small chunks is
> to be able to store them in Ceph as RADOS (distributed object backend for
> ceph) objects (where each object can't be larger than a few ten MBs). Next,
> the reason to split the large parquet files in such a way that makes each
> small chunk self-contained in terms of full row groups, is that we want to
> push down filter and projection operations to the storage node and inside a
> storage node context we couldn't read across objects (as two row-group
> objects can be present in different storage nodes) which is otherwise
> possible when applying projection and filters on the client-side using a
> filesystem abstraction. Now with row group objects, I can utilize the
> statistics in the footer metadata to map the column chunk offsets to RADOS
> objects (that have the row groups containing that column chunk) and read
> column chunks from the objects in the storage device memory by converting
> the file-wide column chunk offsets to object/row group-wide offsets, thus
> maintaining the parquet optimizations. I hope that gives you
> a brief background regarding my issue.
>
>              Also, since I am working in a Ceph environment (outside of the
> Hadoop environment), so the `parquet.block.size` parameter doesn't hold for
> me. So, I was wondering whether the `parquet::ParquetFileWriter` API in the
> arrow codebase already allows specifying a block size to write padded
> fixed-size row groups to match a given block size while writing a parquet
> file which can then be easily chunked using a Linux utility like `split`
> for example. Or, do I have to implement a custom `ParquetWriter` similar to
> what is present in `parquet-hadoop` to do the chunking and padding? If I
> could end up having such an API, I could split a large parquet file into
> well-aligned fixed-size objects containing single row groups (analogous to
> block in HDFS) and store them in the Ceph object-store and basically
> replicate the Hadoop + HDFS scenario on the CephFS + RADOS stack but with
> the added capability to push down filters and projections to the storage
> layer.
>
> On Thu, Dec 31, 2020 at 8:28 AM Tim Armstrong
> <ta...@cloudera.com.invalid> wrote:
>
> > It seems like you would be best off writing out N separate parquet files
> of
> > the desired size. That seems better than having N files with one row
> group
> > each and a shared footer that you have to stitch together to read. I
> guess
> > there would be a small amount of redundancy between footer contents, but
> > that wouldn't count for much in the scheme of things. If you have partial
> > parquet files without a footer, you lose the
> self-describing/self-contained
> > nature of Parquet files, like Jason said.
> >
> > I guess I'm not sure if parquet-mr or whatever you're using to write
> > parquet has an option to start a new file at each row group boundary, but
> > that seems like it would probably solve your problem.
> >
> >
> >
> >
> >
> > On Wed, Dec 30, 2020 at 1:09 PM Jason Altekruse <
> altekrusejason@gmail.com>
> > wrote:
> >
> > > Hi Jayjeet,
> > >
> > > Is there a particular reason that you need to spread out data into
> > > multiple small files? On HDFS at least there are longstanding
> > > scalability issues with having lots of smaller files around, so there
> > > generally is a move to concatenate together smaller files. Even with
> > larger
> > > files, the various common querying mechanisms, Spark, MapReduce, Hive,
> > > Impala, etc. will all allow parallelizing reads by blocks, which when
> > > configured properly should correspond to parquet row groups.
> > >
> > > The size of a row group is fixed by the setting of parquet.block.size.
> > You
> > > mentioned alignment, and pretty early on a padding feature was added to
> > > parquet to ensure that row groups would try to end on the true HDFS
> block
> > > boundaries, to avoid the need to read across blocks when accessing a
> row
> > > group (because row groups have to contain full rows, it is unlikely you
> > > will end with exactly the right number of bytes in the row group to
> match
> > > the end of the HFS block).
> > >
> > > https://github.com/apache/parquet-mr/pull/211
> > >
> > > So to your specific proposal, it currently isn't possible to detach the
> > > footer from the file that contains the actual data in the row groups,
> > but I
> > > think that is a good property, it means everything for that data to be
> > read
> > > is fully contained in one file that can be moved/renamed safely.
> > >
> > > There are some systems that elect to write only a single row group per
> > > file, because HDFS doesn't allow rewriting data in place. Doing this
> > > enables use cases where individual rows need to be deleted or updated
> to
> > be
> > > accomplished by re-writing smaller files, instead of needing to read in
> > and
> > > write back out a large file containing many row groups when only a
> single
> > > row group's data has changed.
> > >
> > > - Jason
> > >
> > >
> > > On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
> > > jayjeetchakraborty25@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am trying to figure out if a large Parquet file can be striped
> across
> > > > multiple small files based on a Row group chunk size where each
> stripe
> > > > would naturally end up containing data pages from a single row group.
> > So,
> > > > if I say my writer "write a parquet file in chunks of 128 MB
> (assuming
> > my
> > > > row groups are of around 128MB), each of my chunks ends up being
> > > > self-contained row group, maybe except the last chunk which has the
> > > footer
> > > > contents. Is this possible? Can we fix the row group size (the amount
> > of
> > > > disk space a row group uses) while writing parquet files ? Thanks a
> > lot.
> > > >
> > >
> >
>
>
> --
> *Jayjeet Chakraborty*
> 4th Year, B.Tech, Undergraduate
> Department Of Computer Sc. And Engineering
> National Institute Of Technology, Durgapur
> PIN: 713205, West Bengal, India
> M: (+91) 8436500886
>

Re: Query on striping parquet files while maintaining Row group alignment

Posted by Jayjeet Chakraborty <ja...@gmail.com>.

Hi Jason and Tim,

Thanks for the detailed response.
            The reason to split a large parquet file into small chunks is
to be able to store them in Ceph as RADOS (distributed object backend for
ceph) objects (where each object can't be larger than a few ten MBs). Next,
the reason to split the large parquet files in such a way that makes each
small chunk self-contained in terms of full row groups, is that we want to
push down filter and projection operations to the storage node and inside a
storage node context we couldn't read across objects (as two row-group
objects can be present in different storage nodes) which is otherwise
possible when applying projection and filters on the client-side using a
filesystem abstraction. Now with row group objects, I can utilize the
statistics in the footer metadata to map the column chunk offsets to RADOS
objects (that have the row groups containing that column chunk) and read
column chunks from the objects in the storage device memory by converting
the file-wide column chunk offsets to object/row group-wide offsets, thus
maintaining the parquet optimizations. I hope that gives you
a brief background regarding my issue.

             Also, since I am working in a Ceph environment (outside of the
Hadoop environment), so the `parquet.block.size` parameter doesn't hold for
me. So, I was wondering whether the `parquet::ParquetFileWriter` API in the
arrow codebase already allows specifying a block size to write padded
fixed-size row groups to match a given block size while writing a parquet
file which can then be easily chunked using a Linux utility like `split`
for example. Or, do I have to implement a custom `ParquetWriter` similar to
what is present in `parquet-hadoop` to do the chunking and padding? If I
could end up having such an API, I could split a large parquet file into
well-aligned fixed-size objects containing single row groups (analogous to
block in HDFS) and store them in the Ceph object-store and basically
replicate the Hadoop + HDFS scenario on the CephFS + RADOS stack but with
the added capability to push down filters and projections to the storage
layer.

On Thu, Dec 31, 2020 at 8:28 AM Tim Armstrong
<ta...@cloudera.com.invalid> wrote:

> It seems like you would be best off writing out N separate parquet files of
> the desired size. That seems better than having N files with one row group
> each and a shared footer that you have to stitch together to read. I guess
> there would be a small amount of redundancy between footer contents, but
> that wouldn't count for much in the scheme of things. If you have partial
> parquet files without a footer, you lose the self-describing/self-contained
> nature of Parquet files, like Jason said.
>
> I guess I'm not sure if parquet-mr or whatever you're using to write
> parquet has an option to start a new file at each row group boundary, but
> that seems like it would probably solve your problem.
>
>
>
>
>
> On Wed, Dec 30, 2020 at 1:09 PM Jason Altekruse <al...@gmail.com>
> wrote:
>
> > Hi Jayjeet,
> >
> > Is there a particular reason that you need to spread out data into
> > multiple small files? On HDFS at least there are longstanding
> > scalability issues with having lots of smaller files around, so there
> > generally is a move to concatenate together smaller files. Even with
> larger
> > files, the various common querying mechanisms, Spark, MapReduce, Hive,
> > Impala, etc. will all allow parallelizing reads by blocks, which when
> > configured properly should correspond to parquet row groups.
> >
> > The size of a row group is fixed by the setting of parquet.block.size.
> You
> > mentioned alignment, and pretty early on a padding feature was added to
> > parquet to ensure that row groups would try to end on the true HDFS block
> > boundaries, to avoid the need to read across blocks when accessing a row
> > group (because row groups have to contain full rows, it is unlikely you
> > will end with exactly the right number of bytes in the row group to match
> > the end of the HFS block).
> >
> > https://github.com/apache/parquet-mr/pull/211
> >
> > So to your specific proposal, it currently isn't possible to detach the
> > footer from the file that contains the actual data in the row groups,
> but I
> > think that is a good property, it means everything for that data to be
> read
> > is fully contained in one file that can be moved/renamed safely.
> >
> > There are some systems that elect to write only a single row group per
> > file, because HDFS doesn't allow rewriting data in place. Doing this
> > enables use cases where individual rows need to be deleted or updated to
> be
> > accomplished by re-writing smaller files, instead of needing to read in
> and
> > write back out a large file containing many row groups when only a single
> > row group's data has changed.
> >
> > - Jason
> >
> >
> > On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
> > jayjeetchakraborty25@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I am trying to figure out if a large Parquet file can be striped across
> > > multiple small files based on a Row group chunk size where each stripe
> > > would naturally end up containing data pages from a single row group.
> So,
> > > if I say my writer "write a parquet file in chunks of 128 MB (assuming
> my
> > > row groups are of around 128MB), each of my chunks ends up being
> > > self-contained row group, maybe except the last chunk which has the
> > footer
> > > contents. Is this possible? Can we fix the row group size (the amount
> of
> > > disk space a row group uses) while writing parquet files ? Thanks a
> lot.
> > >
> >
>


-- 
*Jayjeet Chakraborty*
4th Year, B.Tech, Undergraduate
Department Of Computer Sc. And Engineering
National Institute Of Technology, Durgapur
PIN: 713205, West Bengal, India
M: (+91) 8436500886

Re: Query on striping parquet files while maintaining Row group alignment

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

It seems like you would be best off writing out N separate parquet files of
the desired size. That seems better than having N files with one row group
each and a shared footer that you have to stitch together to read. I guess
there would be a small amount of redundancy between footer contents, but
that wouldn't count for much in the scheme of things. If you have partial
parquet files without a footer, you lose the self-describing/self-contained
nature of Parquet files, like Jason said.

I guess I'm not sure if parquet-mr or whatever you're using to write
parquet has an option to start a new file at each row group boundary, but
that seems like it would probably solve your problem.





On Wed, Dec 30, 2020 at 1:09 PM Jason Altekruse <al...@gmail.com>
wrote:

> Hi Jayjeet,
>
> Is there a particular reason that you need to spread out data into
> multiple small files? On HDFS at least there are longstanding
> scalability issues with having lots of smaller files around, so there
> generally is a move to concatenate together smaller files. Even with larger
> files, the various common querying mechanisms, Spark, MapReduce, Hive,
> Impala, etc. will all allow parallelizing reads by blocks, which when
> configured properly should correspond to parquet row groups.
>
> The size of a row group is fixed by the setting of parquet.block.size. You
> mentioned alignment, and pretty early on a padding feature was added to
> parquet to ensure that row groups would try to end on the true HDFS block
> boundaries, to avoid the need to read across blocks when accessing a row
> group (because row groups have to contain full rows, it is unlikely you
> will end with exactly the right number of bytes in the row group to match
> the end of the HFS block).
>
> https://github.com/apache/parquet-mr/pull/211
>
> So to your specific proposal, it currently isn't possible to detach the
> footer from the file that contains the actual data in the row groups, but I
> think that is a good property, it means everything for that data to be read
> is fully contained in one file that can be moved/renamed safely.
>
> There are some systems that elect to write only a single row group per
> file, because HDFS doesn't allow rewriting data in place. Doing this
> enables use cases where individual rows need to be deleted or updated to be
> accomplished by re-writing smaller files, instead of needing to read in and
> write back out a large file containing many row groups when only a single
> row group's data has changed.
>
> - Jason
>
>
> On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
> jayjeetchakraborty25@gmail.com> wrote:
>
> > Hi all,
> >
> > I am trying to figure out if a large Parquet file can be striped across
> > multiple small files based on a Row group chunk size where each stripe
> > would naturally end up containing data pages from a single row group. So,
> > if I say my writer "write a parquet file in chunks of 128 MB (assuming my
> > row groups are of around 128MB), each of my chunks ends up being
> > self-contained row group, maybe except the last chunk which has the
> footer
> > contents. Is this possible? Can we fix the row group size (the amount of
> > disk space a row group uses) while writing parquet files ? Thanks a lot.
> >
>

Re: Query on striping parquet files while maintaining Row group alignment

Posted by Jason Altekruse <al...@gmail.com>.

Hi Jayjeet,

Is there a particular reason that you need to spread out data into
multiple small files? On HDFS at least there are longstanding
scalability issues with having lots of smaller files around, so there
generally is a move to concatenate together smaller files. Even with larger
files, the various common querying mechanisms, Spark, MapReduce, Hive,
Impala, etc. will all allow parallelizing reads by blocks, which when
configured properly should correspond to parquet row groups.

The size of a row group is fixed by the setting of parquet.block.size. You
mentioned alignment, and pretty early on a padding feature was added to
parquet to ensure that row groups would try to end on the true HDFS block
boundaries, to avoid the need to read across blocks when accessing a row
group (because row groups have to contain full rows, it is unlikely you
will end with exactly the right number of bytes in the row group to match
the end of the HFS block).

https://github.com/apache/parquet-mr/pull/211

So to your specific proposal, it currently isn't possible to detach the
footer from the file that contains the actual data in the row groups, but I
think that is a good property, it means everything for that data to be read
is fully contained in one file that can be moved/renamed safely.

There are some systems that elect to write only a single row group per
file, because HDFS doesn't allow rewriting data in place. Doing this
enables use cases where individual rows need to be deleted or updated to be
accomplished by re-writing smaller files, instead of needing to read in and
write back out a large file containing many row groups when only a single
row group's data has changed.

- Jason

On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:

> Hi all,
>
> I am trying to figure out if a large Parquet file can be striped across
> multiple small files based on a Row group chunk size where each stripe
> would naturally end up containing data pages from a single row group. So,
> if I say my writer "write a parquet file in chunks of 128 MB (assuming my
> row groups are of around 128MB), each of my chunks ends up being
> self-contained row group, maybe except the last chunk which has the footer
> contents. Is this possible? Can we fix the row group size (the amount of
> disk space a row group uses) while writing parquet files ? Thanks a lot.
>