You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Tim Armstrong <ta...@cloudera.com.INVALID> on 2021/01/06 01:24:02 UTC
Re: Query on striping parquet files while maintaining Row group alignment

Thanks for the explanation, kinda makes more sense. I guess you'd still
have to read the parquet footer outside of the storage nodes and then send
the relevant info from the footer to the storage nodes, right? I guess the
footer doesn't need to be in the same block as the

So the following options are maybe possible, but would each have different
challenges?

   1. Single Parquet file, multiple row groups, with each row group in its
   own block. It's tricky to know for sure that row groups are aligned to
   blocks in this case.
   2. Multiple separate parquet files. Maybe not well-supported by some
   libraries? Maybe some downside to having extra footers.
   3. Split parquet file into a file per row group and a separate footer.
   Not part of the Parquet standard.


Impala does a somewhat similar thing with its custom parquet writer - write
one Parquet file per HDFS block so that it could all be read locally (since
if it was a multi-block parquet file you have to do a remote read for the
HDFS footer). That's best-effort IIRC though - a file could overflow into
the next block. So we have some experience with doing this local reading of
parquet files, but if the layout of the files is non-optimal we'll fall
back to doing reads from remote nodes.

Part of the problem is that you don't know exactly how big the row group
will be until you've encoded all of the data, and you also don't know how
big the footer will be until you've encoded it. I'm not sure if some of the
parquet writers in parquet-mr or parquet-cpp are able to guarantee that the
row group is smaller than the block size.

On Thu, Dec 31, 2020 at 3:36 AM Jayjeet Chakraborty <
jayjeetchakraborty25@gmail.com> wrote:

> Hi Jason and Tim,
>
> Thanks for the detailed response.
>             The reason to split a large parquet file into small chunks is
> to be able to store them in Ceph as RADOS (distributed object backend for
> ceph) objects (where each object can't be larger than a few ten MBs). Next,
> the reason to split the large parquet files in such a way that makes each
> small chunk self-contained in terms of full row groups, is that we want to
> push down filter and projection operations to the storage node and inside a
> storage node context we couldn't read across objects (as two row-group
> objects can be present in different storage nodes) which is otherwise
> possible when applying projection and filters on the client-side using a
> filesystem abstraction. Now with row group objects, I can utilize the
> statistics in the footer metadata to map the column chunk offsets to RADOS
> objects (that have the row groups containing that column chunk) and read
> column chunks from the objects in the storage device memory by converting
> the file-wide column chunk offsets to object/row group-wide offsets, thus
> maintaining the parquet optimizations. I hope that gives you
> a brief background regarding my issue.
>
>              Also, since I am working in a Ceph environment (outside of the
> Hadoop environment), so the `parquet.block.size` parameter doesn't hold for
> me. So, I was wondering whether the `parquet::ParquetFileWriter` API in the
> arrow codebase already allows specifying a block size to write padded
> fixed-size row groups to match a given block size while writing a parquet
> file which can then be easily chunked using a Linux utility like `split`
> for example. Or, do I have to implement a custom `ParquetWriter` similar to
> what is present in `parquet-hadoop` to do the chunking and padding? If I
> could end up having such an API, I could split a large parquet file into
> well-aligned fixed-size objects containing single row groups (analogous to
> block in HDFS) and store them in the Ceph object-store and basically
> replicate the Hadoop + HDFS scenario on the CephFS + RADOS stack but with
> the added capability to push down filters and projections to the storage
> layer.
>
> On Thu, Dec 31, 2020 at 8:28 AM Tim Armstrong
> <ta...@cloudera.com.invalid> wrote:
>
> > It seems like you would be best off writing out N separate parquet files
> of
> > the desired size. That seems better than having N files with one row
> group
> > each and a shared footer that you have to stitch together to read. I
> guess
> > there would be a small amount of redundancy between footer contents, but
> > that wouldn't count for much in the scheme of things. If you have partial
> > parquet files without a footer, you lose the
> self-describing/self-contained
> > nature of Parquet files, like Jason said.
> >
> > I guess I'm not sure if parquet-mr or whatever you're using to write
> > parquet has an option to start a new file at each row group boundary, but
> > that seems like it would probably solve your problem.
> >
> >
> >
> >
> >
> > On Wed, Dec 30, 2020 at 1:09 PM Jason Altekruse <
> altekrusejason@gmail.com>
> > wrote:
> >
> > > Hi Jayjeet,
> > >
> > > Is there a particular reason that you need to spread out data into
> > > multiple small files? On HDFS at least there are longstanding
> > > scalability issues with having lots of smaller files around, so there
> > > generally is a move to concatenate together smaller files. Even with
> > larger
> > > files, the various common querying mechanisms, Spark, MapReduce, Hive,
> > > Impala, etc. will all allow parallelizing reads by blocks, which when
> > > configured properly should correspond to parquet row groups.
> > >
> > > The size of a row group is fixed by the setting of parquet.block.size.
> > You
> > > mentioned alignment, and pretty early on a padding feature was added to
> > > parquet to ensure that row groups would try to end on the true HDFS
> block
> > > boundaries, to avoid the need to read across blocks when accessing a
> row
> > > group (because row groups have to contain full rows, it is unlikely you
> > > will end with exactly the right number of bytes in the row group to
> match
> > > the end of the HFS block).
> > >
> > > https://github.com/apache/parquet-mr/pull/211
> > >
> > > So to your specific proposal, it currently isn't possible to detach the
> > > footer from the file that contains the actual data in the row groups,
> > but I
> > > think that is a good property, it means everything for that data to be
> > read
> > > is fully contained in one file that can be moved/renamed safely.
> > >
> > > There are some systems that elect to write only a single row group per
> > > file, because HDFS doesn't allow rewriting data in place. Doing this
> > > enables use cases where individual rows need to be deleted or updated
> to
> > be
> > > accomplished by re-writing smaller files, instead of needing to read in
> > and
> > > write back out a large file containing many row groups when only a
> single
> > > row group's data has changed.
> > >
> > > - Jason
> > >
> > >
> > > On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
> > > jayjeetchakraborty25@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am trying to figure out if a large Parquet file can be striped
> across
> > > > multiple small files based on a Row group chunk size where each
> stripe
> > > > would naturally end up containing data pages from a single row group.
> > So,
> > > > if I say my writer "write a parquet file in chunks of 128 MB
> (assuming
> > my
> > > > row groups are of around 128MB), each of my chunks ends up being
> > > > self-contained row group, maybe except the last chunk which has the
> > > footer
> > > > contents. Is this possible? Can we fix the row group size (the amount
> > of
> > > > disk space a row group uses) while writing parquet files ? Thanks a
> > lot.
> > > >
> > >
> >
>
>
> --
> *Jayjeet Chakraborty*
> 4th Year, B.Tech, Undergraduate
> Department Of Computer Sc. And Engineering
> National Institute Of Technology, Durgapur
> PIN: 713205, West Bengal, India
> M: (+91) 8436500886
>