You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Gabor Szadovszky <ga...@apache.org> on 2018/08/18 11:11:09 UTC

Status of column index in parquet-mr

Hi,

The implementation of column index (writing and filtering) is almost done.
All the implementation work was done under the PARQUET-1201
<https://issues.apache.org/jira/browse/PARQUET-1201>. Subtasks were used to
decompose the work.
Every change made was done on the separate feature branch column-indexes
<https://github.com/apache/parquet-mr/tree/column-indexes>. Only 2 small
fixes/improvements are *waiting for review* (PARQUET-1389
<https://issues.apache.org/jira/browse/PARQUET-1389> and PARQUET-1386
<https://issues.apache.org/jira/browse/PARQUET-1386>). All the other work
have already been reviewed. After the successful review of the 2 remaining
modifications I would like to merge the feature branch to master. *Any
review/comment related to these modifications or the whole branch is
welcomed.*

The implementation was done based on the original design of column indexes
<https://github.com/apache/parquet-format/blob/master/PageIndex.md> meaning
that no row alignment is required between the pages (the only requirement
is for the pages to respect row boundaries).
As we described in the preview parquet sync the desing/implementation would
be much more clear (and might perform a bit better) if the row alignment
would also be required. I would be happy to modify the implementation if we
would decide to align pages on rows.* I would like to have a final decision
on this topic before merging this feature.*

I have an *improvement idea* about handling column indexes in case of the
data is not sorted. I am curious about your oppinions.
If the data is sorted (ascending/descending) the column index based
filtering has high benefits. If the data is random or similarly unordered
the column indexes quite useless. But, there are partial orderings might
present where column indexes can be used.
My idea is to check characteristics min/max values of the column index
before writing and if it is UNORDERED we only write it if it seems to be
useful. E.g. how much the min-max ranges are overlapping. If the
overlapping is larger than e.g. 90% then column index based filtering is
not useful in most of the time.
Example of a useful UNORDERED column index [min, max]:
[10, 20]
[0, 5]
[15, 25]
[-10, 0]
Example of a ~useless UNORDERED column index:
[0, 20]
[1, 19]
[0, 19]
[1, 21]

Thanks a lot,
Gabor

Re: Status of column index in parquet-mr

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.

Hi,

I would really appriciate if someone would review PARQUET-1389
<https://issues.apache.org/jira/browse/PARQUET-1389> and PARQUET-1386
<https://issues.apache.org/jira/browse/PARQUET-1386>. After this two
outstanding modifications I would be able to merge the whole feature branch
column-indexes <https://github.com/apache/parquet-mr/tree/column-indexes>.
Also, feel free to comment on any modification on the branch itself.

Any opinions about the improvement idea for writing column indexes only if
it would result better filtering?

Thanks a lot,
Gabor

On Tue, Aug 21, 2018 at 9:51 AM Gabor Szadovszky <
gabor.szadovszky@cloudera.com> wrote:

> Hi,
>
> Row alignment in my wording was the 1st definition in Uwe's mail. From
> column index based filtering point of view the implementation and the logic
> would be much simplier in this case but I do understand that the pages
> sizes would not be optimal. It seems, the community is against the row
> alignment so I would close this topic.
> The 2nd definition is required for column indexes and already mentioned in
> the spec. For page v2 we already do the same anyway.
> Tim, as you mentioned it would still require skipping values. The biggest
> pain in my implementation was to pass the required values through the API
> to implement the skipping. The calculation itself was not that complicated.
>
> Thanks a lot,
> Gabor
>
> On Mon, Aug 20, 2018 at 7:51 PM Tim Armstrong
> <ta...@cloudera.com.invalid> wrote:
>
>> I had a similar concern to Uwe - if there are a large number of columns
>> with variable size there does seem to be a real risk of having many tiny
>> pages.
>>
>> I wonder if we could do something in-between where we allow different page
>> sizes for different columns, but require that the row ranges for pages of
>> different columns either are the same or one contains the other. I.e. if
>> you have row ranges [a, b) and [c, d) from two different columns, then
>> either they don't overlap (c >= b || a >= d) or one contains the other (c
>> >= a && d <= b) || (a >= c && b <= d)
>>
>> E.g. if you have three columns, small, medium and large that fit 50000,
>> 20000 and 1020 values per page, you could meet the above constraint with
>> the following set of row ranges where pages are truncated when a page with
>> an enclosing row range is full.
>>
>> Small: [0, 50000), [50000, 100000)
>> Medium: [0, 20000), [20000, 40000), [40000, 50000), [50000, 70000),
>> [70000,
>> 90000), [90000, 100000)
>> Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 20000), ...
>>
>> That seems like it would simplify the calculation of the relevant pages on
>> the read path, although you would still need to have logic to skip values
>> within a page.
>>
>>
>>
>> On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>> > Hello Gabor,
>> >
>> > comment in-line
>> >
>> > > The implementation was done based on the original design of column
>> > indexes
>> > > <https://github.com/apache/parquet-format/blob/master/PageIndex.md>
>> > meaning
>> > > that no row alignment is required between the pages (the only
>> requirement
>> > > is for the pages to respect row boundaries).
>> > > As we described in the preview parquet sync the desing/implementation
>> > would
>> > > be much more clear (and might perform a bit better) if the row
>> alignment
>> > > would also be required. I would be happy to modify the implementation
>> if
>> > we
>> > > would decide to align pages on rows.* I would like to have a final
>> > decision
>> > > on this topic before merging this feature.*
>> >
>> > I'm not 100% certain what "row alignment" could mean, I thinking of two
>> > very different things.
>> >
>> > 1.  It would mean that all columns in a RowGroup would have the same
>> > number of pages that would all align on the same set of rows.
>> > 2. It would mean that pages are only split on the highest nesting level,
>> > i.e. only split on what would be the horizontal boundaries on a
>> 2D-table.
>> > I.e. not splitting any cells of this table structure.
>> >
>> > If the interpretation is 1, then I think this is generating far too much
>> > pages for very sparse columns. But I'm guessing that the interpretation
>> is
>> > rather 2 and there I would be more interested the concerns that were
>> raised
>> > in the sync. This type of alignment also is something that made me some
>> > headaches when implementing things in parquet-cpp. From a Parquet
>> > developer's perspective, this would really ease the implementation but
>> I'm
>> > wondering if there are use-cases where a single cell of a table becomes
>> > larger than what we would normally put into a page.
>> >
>> > Uwe
>> >
>>
>

Re: Status of column index in parquet-mr

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.

Hi,

Row alignment in my wording was the 1st definition in Uwe's mail. From
column index based filtering point of view the implementation and the logic
would be much simplier in this case but I do understand that the pages
sizes would not be optimal. It seems, the community is against the row
alignment so I would close this topic.
The 2nd definition is required for column indexes and already mentioned in
the spec. For page v2 we already do the same anyway.
Tim, as you mentioned it would still require skipping values. The biggest
pain in my implementation was to pass the required values through the API
to implement the skipping. The calculation itself was not that complicated.

Thanks a lot,
Gabor

On Mon, Aug 20, 2018 at 7:51 PM Tim Armstrong
<ta...@cloudera.com.invalid> wrote:

> I had a similar concern to Uwe - if there are a large number of columns
> with variable size there does seem to be a real risk of having many tiny
> pages.
>
> I wonder if we could do something in-between where we allow different page
> sizes for different columns, but require that the row ranges for pages of
> different columns either are the same or one contains the other. I.e. if
> you have row ranges [a, b) and [c, d) from two different columns, then
> either they don't overlap (c >= b || a >= d) or one contains the other (c
> >= a && d <= b) || (a >= c && b <= d)
>
> E.g. if you have three columns, small, medium and large that fit 50000,
> 20000 and 1020 values per page, you could meet the above constraint with
> the following set of row ranges where pages are truncated when a page with
> an enclosing row range is full.
>
> Small: [0, 50000), [50000, 100000)
> Medium: [0, 20000), [20000, 40000), [40000, 50000), [50000, 70000), [70000,
> 90000), [90000, 100000)
> Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 20000), ...
>
> That seems like it would simplify the calculation of the relevant pages on
> the read path, although you would still need to have logic to skip values
> within a page.
>
>
>
> On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>
> > Hello Gabor,
> >
> > comment in-line
> >
> > > The implementation was done based on the original design of column
> > indexes
> > > <https://github.com/apache/parquet-format/blob/master/PageIndex.md>
> > meaning
> > > that no row alignment is required between the pages (the only
> requirement
> > > is for the pages to respect row boundaries).
> > > As we described in the preview parquet sync the desing/implementation
> > would
> > > be much more clear (and might perform a bit better) if the row
> alignment
> > > would also be required. I would be happy to modify the implementation
> if
> > we
> > > would decide to align pages on rows.* I would like to have a final
> > decision
> > > on this topic before merging this feature.*
> >
> > I'm not 100% certain what "row alignment" could mean, I thinking of two
> > very different things.
> >
> > 1.  It would mean that all columns in a RowGroup would have the same
> > number of pages that would all align on the same set of rows.
> > 2. It would mean that pages are only split on the highest nesting level,
> > i.e. only split on what would be the horizontal boundaries on a 2D-table.
> > I.e. not splitting any cells of this table structure.
> >
> > If the interpretation is 1, then I think this is generating far too much
> > pages for very sparse columns. But I'm guessing that the interpretation
> is
> > rather 2 and there I would be more interested the concerns that were
> raised
> > in the sync. This type of alignment also is something that made me some
> > headaches when implementing things in parquet-cpp. From a Parquet
> > developer's perspective, this would really ease the implementation but
> I'm
> > wondering if there are use-cases where a single cell of a table becomes
> > larger than what we would normally put into a page.
> >
> > Uwe
> >
>

Re: Status of column index in parquet-mr

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

I had a similar concern to Uwe - if there are a large number of columns
with variable size there does seem to be a real risk of having many tiny
pages.

I wonder if we could do something in-between where we allow different page
sizes for different columns, but require that the row ranges for pages of
different columns either are the same or one contains the other. I.e. if
you have row ranges [a, b) and [c, d) from two different columns, then
either they don't overlap (c >= b || a >= d) or one contains the other (c
>= a && d <= b) || (a >= c && b <= d)

E.g. if you have three columns, small, medium and large that fit 50000,
20000 and 1020 values per page, you could meet the above constraint with
the following set of row ranges where pages are truncated when a page with
an enclosing row range is full.

Small: [0, 50000), [50000, 100000)
Medium: [0, 20000), [20000, 40000), [40000, 50000), [50000, 70000), [70000,
90000), [90000, 100000)
Large: [0, 1020), [1020, 2040), [2040, 3060), ..., [19390, 20000), ...

That seems like it would simplify the calculation of the relevant pages on
the read path, although you would still need to have logic to skip values
within a page.



On Sun, Aug 19, 2018 at 1:57 AM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Gabor,
>
> comment in-line
>
> > The implementation was done based on the original design of column
> indexes
> > <https://github.com/apache/parquet-format/blob/master/PageIndex.md>
> meaning
> > that no row alignment is required between the pages (the only requirement
> > is for the pages to respect row boundaries).
> > As we described in the preview parquet sync the desing/implementation
> would
> > be much more clear (and might perform a bit better) if the row alignment
> > would also be required. I would be happy to modify the implementation if
> we
> > would decide to align pages on rows.* I would like to have a final
> decision
> > on this topic before merging this feature.*
>
> I'm not 100% certain what "row alignment" could mean, I thinking of two
> very different things.
>
> 1.  It would mean that all columns in a RowGroup would have the same
> number of pages that would all align on the same set of rows.
> 2. It would mean that pages are only split on the highest nesting level,
> i.e. only split on what would be the horizontal boundaries on a 2D-table.
> I.e. not splitting any cells of this table structure.
>
> If the interpretation is 1, then I think this is generating far too much
> pages for very sparse columns. But I'm guessing that the interpretation is
> rather 2 and there I would be more interested the concerns that were raised
> in the sync. This type of alignment also is something that made me some
> headaches when implementing things in parquet-cpp. From a Parquet
> developer's perspective, this would really ease the implementation but I'm
> wondering if there are use-cases where a single cell of a table becomes
> larger than what we would normally put into a page.
>
> Uwe
>

Re: Status of column index in parquet-mr

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Gabor,

comment in-line

> The implementation was done based on the original design of column indexes
> <https://github.com/apache/parquet-format/blob/master/PageIndex.md> meaning
> that no row alignment is required between the pages (the only requirement
> is for the pages to respect row boundaries).
> As we described in the preview parquet sync the desing/implementation would
> be much more clear (and might perform a bit better) if the row alignment
> would also be required. I would be happy to modify the implementation if we
> would decide to align pages on rows.* I would like to have a final decision
> on this topic before merging this feature.*

I'm not 100% certain what "row alignment" could mean, I thinking of two very different things.

1.  It would mean that all columns in a RowGroup would have the same number of pages that would all align on the same set of rows.
2. It would mean that pages are only split on the highest nesting level, i.e. only split on what would be the horizontal boundaries on a 2D-table. I.e. not splitting any cells of this table structure.

If the interpretation is 1, then I think this is generating far too much pages for very sparse columns. But I'm guessing that the interpretation is rather 2 and there I would be more interested the concerns that were raised in the sync. This type of alignment also is something that made me some headaches when implementing things in parquet-cpp. From a Parquet developer's perspective, this would really ease the implementation but I'm wondering if there are use-cases where a single cell of a table becomes larger than what we would normally put into a page.

Uwe