You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Zoltan Borok-Nagy <bo...@cloudera.com.INVALID> on 2018/07/12 17:12:28 UTC

page and record boundaries

Hi everyone,

Currently I am working on the implementation of the Parquet page index for
Impala.
(design doc is here if you are interested:
https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
)

During our discussions it came up that DataPageHeaderV2 states that page
boundaries are also record boundaries:

https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532





DataPageHeader(V1) doesn't have this statement, which means that in theory
it allows records to span through multiple pages. Is it really the case, or
is it something that is missing from the specification?

I ask this because filtering pages based on the page index is much more
simple if page boundaries are record boundaries as well.

Thanks,
    Zoltan

Re: page and record boundaries

Posted by Zoltan Borok-Nagy <bo...@cloudera.com.INVALID>.
Thank you all for your quick responses!

BR,
    Zoltan


On Thu, Jul 12, 2018 at 8:04 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi Zoltan,
>
> If I remember correctly, this is what we had in mind in this question:
>
> Although page boundaries for v1 pages do not have to be record boundaries,
> nothing prevents us from implementing a writer that does align pages to
> record boundaries. (Of course, on the read path, we have to be able to
> handle pages that are not aligned in respect to records.)
>
> When we add indexes to the picture, aligning pages to record boundaries
> becomes desirable. And since indexes are a new feature, we can simply
> implement them in a way so that we always do this alignment when writing
> pages with indexes.
>
> Br,
>
> Zoltan
>
> On Thu, Jul 12, 2018 at 7:12 PM Zoltan Borok-Nagy
> <bo...@cloudera.com.invalid> wrote:
>
> > Hi everyone,
> >
> > Currently I am working on the implementation of the Parquet page index
> for
> > Impala.
> > (design doc is here if you are interested:
> >
> >
> https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
> > )
> >
> > During our discussions it came up that DataPageHeaderV2 states that page
> > boundaries are also record boundaries:
> >
> >
> >
> https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532
> >
> >
> >
> >
> >
> > DataPageHeader(V1) doesn't have this statement, which means that in
> theory
> > it allows records to span through multiple pages. Is it really the case,
> or
> > is it something that is missing from the specification?
> >
> > I ask this because filtering pages based on the page index is much more
> > simple if page boundaries are record boundaries as well.
> >
> > Thanks,
> >     Zoltan
> >
>

Re: page and record boundaries

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Zoltan,

If I remember correctly, this is what we had in mind in this question:

Although page boundaries for v1 pages do not have to be record boundaries,
nothing prevents us from implementing a writer that does align pages to
record boundaries. (Of course, on the read path, we have to be able to
handle pages that are not aligned in respect to records.)

When we add indexes to the picture, aligning pages to record boundaries
becomes desirable. And since indexes are a new feature, we can simply
implement them in a way so that we always do this alignment when writing
pages with indexes.

Br,

Zoltan

On Thu, Jul 12, 2018 at 7:12 PM Zoltan Borok-Nagy
<bo...@cloudera.com.invalid> wrote:

> Hi everyone,
>
> Currently I am working on the implementation of the Parquet page index for
> Impala.
> (design doc is here if you are interested:
>
> https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
> )
>
> During our discussions it came up that DataPageHeaderV2 states that page
> boundaries are also record boundaries:
>
>
> https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532
>
>
>
>
>
> DataPageHeader(V1) doesn't have this statement, which means that in theory
> it allows records to span through multiple pages. Is it really the case, or
> is it something that is missing from the specification?
>
> I ask this because filtering pages based on the page index is much more
> simple if page boundaries are record boundaries as well.
>
> Thanks,
>     Zoltan
>

Re: page and record boundaries

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Zoltan,

If you write the new page index structure, you're required to split pages
on record boundaries even when using v1 pages. By making that requirement
on the write side, we ensure that this feature is compatible with both v1
and v2 pages.

rb

On Thu, Jul 12, 2018 at 10:12 AM Zoltan Borok-Nagy
<bo...@cloudera.com.invalid> wrote:

> Hi everyone,
>
> Currently I am working on the implementation of the Parquet page index for
> Impala.
> (design doc is here if you are interested:
>
> https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
> )
>
> During our discussions it came up that DataPageHeaderV2 states that page
> boundaries are also record boundaries:
>
>
> https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532
>
>
>
>
>
> DataPageHeader(V1) doesn't have this statement, which means that in theory
> it allows records to span through multiple pages. Is it really the case, or
> is it something that is missing from the specification?
>
> I ask this because filtering pages based on the page index is much more
> simple if page boundaries are record boundaries as well.
>
> Thanks,
>     Zoltan
>


-- 
Ryan Blue
Software Engineer
Netflix