You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Micah Kornfield <em...@gmail.com> on 2021/06/01 18:52:39 UTC

Order of encodings?

I couldn't find anything in the specification on this, but is there any
constraint on the ordering of encoded pages in a column for a row group.

I think in practice most implementations try to dictionary-encode first and
then fallback to another encoding if the dictionary doesn't yield benefits
or grows too big.  But in theory is an ordering of pages described below
possible?

1. Dictionary Encoded Page
2. Plain Encoded Page
3. Dictionary Encoded Page

Thanks,
Micah

Re: Order of encodings?

Posted by Micah Kornfield <em...@gmail.com>.
Thanks, I'm not currently planning on adding any writers, just wanted to
understand some edge cases that could happen (and some old C++ reading
code).

Cheers,
Micah

On Wed, Jun 2, 2021 at 1:28 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Micah,
>
> The way you described is how parquet-mr works at the write path. Meanwhile,
> based on the parquet-mr code, it seems that the scenario explained can be
> read properly. (If we think we have/will have writers that support such
> scenarios we shall write unit tests for them.)
>
> Cheers,
> Gabor
>
> On Tue, Jun 1, 2021 at 8:52 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > I couldn't find anything in the specification on this, but is there any
> > constraint on the ordering of encoded pages in a column for a row group.
> >
> > I think in practice most implementations try to dictionary-encode first
> and
> > then fallback to another encoding if the dictionary doesn't yield
> benefits
> > or grows too big.  But in theory is an ordering of pages described below
> > possible?
> >
> > 1. Dictionary Encoded Page
> > 2. Plain Encoded Page
> > 3. Dictionary Encoded Page
> >
> > Thanks,
> > Micah
> >
>

Re: Order of encodings?

Posted by Gabor Szadovszky <ga...@apache.org>.
Hi Micah,

The way you described is how parquet-mr works at the write path. Meanwhile,
based on the parquet-mr code, it seems that the scenario explained can be
read properly. (If we think we have/will have writers that support such
scenarios we shall write unit tests for them.)

Cheers,
Gabor

On Tue, Jun 1, 2021 at 8:52 PM Micah Kornfield <em...@gmail.com>
wrote:

> I couldn't find anything in the specification on this, but is there any
> constraint on the ordering of encoded pages in a column for a row group.
>
> I think in practice most implementations try to dictionary-encode first and
> then fallback to another encoding if the dictionary doesn't yield benefits
> or grows too big.  But in theory is an ordering of pages described below
> possible?
>
> 1. Dictionary Encoded Page
> 2. Plain Encoded Page
> 3. Dictionary Encoded Page
>
> Thanks,
> Micah
>