You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Anton Okolnychyi <ao...@apple.com.INVALID> on 2019/05/28 14:42:49 UTC

Future of Iceberg Parquet Reader

Hi,

I see more and more questions around Iceberg Parquet reader. I think it would be useful to have a thread that clarifies all open questions and explains the long-term plan.

1. Am I correct that performance is the main reason to have a custom reader in Iceberg? Are there any other purposes? A common question I get is why not improve parquet-mr instead of writing a new reader? I know that almost every system that cares about performance has its own reader, but why so? 

2.  Iceberg filters out row groups based on stats and dictionary pages on its own whereas the Spark reader simply sets filters and relies on parquet-mr to do the filtering. My assumption there is a problem in parquet-mr. Is it correct? Is it somehow related to record materialization?

3. At some point, Julien Le Dem gave a talk about supporting page skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. Basically, filtering data in columns based on predicates on other columns. It is a highly anticipated feature on our end. Can somebody clarify if it will be part of parquet-mr or we will have to implement this in Iceberg?

4.  What is the long-term vision for the Parquet reader in Iceberg? Are there any plans to submit parts of it to parquet-mr? Will Iceberg reader be mostly independent of parquet-mr?

5. We are considering reading Parquet data into Arrow. Will be it something specific to Iceberg or generally available? I believe it is a quite common use case.

Thanks,
Anton

Re: Future of Iceberg Parquet Reader

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Am I correct that performance is the main reason to have a custom reader in
Iceberg? Are there any other purposes? A common question I get is why not
improve parquet-mr instead of writing a new reader? I know that almost
every system that cares about performance has its own reader, but why so?

Performance wasn’t the only reason. The record materialization API in
Parquet is complicated and required some hacks to handle Iceberg’s id-based
column resolution, like Dan noted. It is also easier to avoid translating
to Parquet’s expressions. I think this is also going to be a benefit for
the vectorization work, where Parquet Java doesn’t have a clear direction.

Iceberg filters out row groups based on stats and dictionary pages on its
own whereas the Spark reader simply sets filters and relies on parquet-mr
to do the filtering. My assumption there is a problem in parquet-mr. Is it
correct? Is it somehow related to record materialization?

There aren’t known bugs in Spark’s version. But if we want to extend the
expressions that Iceberg supports, it is much easier to have the stats and
dictionary filters in Iceberg. That way we can add contains, startsWith,
and others directly.

I should also note that Iceberg delegates record filtering back to Spark
because Spark will build a codegen filter. When Iceberg supports the page
indexes in Parquet, it will probably make sense to do this in Iceberg or
Parquet instead.

At some point, Julien Le Dem gave a talk about supporting page skipping in
Parquet. His primary example was SELECT a, b FROM t WHERE c = ‘smth’.
Basically, filtering data in columns based on predicates on other columns.
It is a highly anticipated feature on our end. Can somebody clarify if it
will be part of parquet-mr or we will have to implement this in Iceberg?

Parquet is adding page indexes in the next release. Because Iceberg doesn’t
use Parquet’s record materialization code, we would probably need to
implement it separately. We’ll see what makes sense when the feature is
released and can be used. If Parquet doesn’t support vectorized reads, then
Iceberg would need to implement the vectorized version anyway.

What is the long-term vision for the Parquet reader in Iceberg? Are there
any plans to submit parts of it to parquet-mr? Will Iceberg reader be
mostly independent of parquet-mr?

In the long term, I’d like to move components to Parquet. I’ve proposed
adding Iceberg’s expressions and record materialization to Parquet. But
these would probably require a Parquet 2.0 API because Iceberg’s record
readers are significantly different. There’s also a fair amount of work
needed to generalize how we build the reader tree.

Right now, we can iterate faster by developing new features in Iceberg.

We are considering reading Parquet data into Arrow. Will be it something
specific to Iceberg or generally available? I believe it is a quite common
use case.

At first, I think that we will probably make Iceberg-specific assumptions,
like using Iceberg expressions for data filters and resolving columns by
ID. But the core functionality will be general-purpose and we should be
able to move that upstream when it is stable.

The way Iceberg is structured as a library that delegates file reads to
engines, everything can be used on an individual file. This should be
immediately useful for anyone willing to pull in the iceberg-arrow
dependency.

On Tue, May 28, 2019 at 9:58 AM Wes McKinney <we...@gmail.com> wrote:

> On Tue, May 28, 2019 at 11:19 AM Daniel Weeks
> <dw...@netflix.com.invalid> wrote:
> >
> > Hey Anton,
> >
> > #1) Part of the reason Iceberg has a custom reader is to help resolve
> some of the Iceberg specific aspects of how parquet files are read (e.g.
> column resolution by id, iceberg expressions).  Also, it's been a struggle
> to get agreement on a good vectorized api.  I don't believe the objective
> of building a read path in Iceberg was strictly about performance, but more
> about being able to iterate quickly and prototype new approaches that we
> hope to ultimately feed back to the parquet-mr project.
> >
> > #2) Iceberg isn't using the same path, so row group and dictionary
> filtering are handled by Iceberg (though record level filtering is not).  I
> don't believe this is any specific problem with the parquet-mr other than
> where it is implemented in the read path and that Iceberg has its own
> expression implementation.
> >
> > #3) There is active work going on related to page skipping
> (PARQUET-1201).  I believe this may be what you are referring to.
> >
> > #4) Ideally we would be able to contribute back the read/write path
> implementations to parquet-mr and get an updated API that can be used for
> future development.
> >
> > #5) We do intend to build a native parquet to arrow read path.
> Initially this will likely be specific to Iceberg as we iterate on the
> implementation, but will we hope to be generally usable across multiple
> engines.  For example, both Spark and Presto have custom read paths for
> parquet and their own columnar memory format.  We hope that we can build an
> Iceberg read path that can be used across both and leverage arrow natively
> through columnar api abstractions.
> >
>
> Presto and Spark aren't exactly analogous because they have bespoke
> in-memory formats, so it wouldn't make sense to develop Parquet
> serialization/deserialization anywhere else. It would be unfortunate
> if Parquet-to-Arrow conversion in Java at the granularity of a single
> file is hidden behind Iceberg business logic, so I would encourage you
> to make the lower-level single file interface as accessible to general
> Arrow users as possible.
>
> > -Dan
> >
> > On Tue, May 28, 2019 at 8:03 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> hi Anton,
> >>
> >> On point #5, I would suggest doing the work either in Apache Arrow or
> >> in the Parquet Java project -- we are developing both Parquet C++ and
> >> Rust codebases within the apache/arrow repository so I think you would
> >> find an active community there. I know that there has been a lot of
> >> interest in decoupling from Hadoop-related Java dependencies, so you
> >> might also think about how to do that at the same time.
> >>
> >> - Wes
> >>
> >> On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi
> >> <ao...@apple.com.invalid> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I see more and more questions around Iceberg Parquet reader. I think
> it would be useful to have a thread that clarifies all open questions and
> explains the long-term plan.
> >> >
> >> > 1. Am I correct that performance is the main reason to have a custom
> reader in Iceberg? Are there any other purposes? A common question I get is
> why not improve parquet-mr instead of writing a new reader? I know that
> almost every system that cares about performance has its own reader, but
> why so?
> >> >
> >> > 2.  Iceberg filters out row groups based on stats and dictionary
> pages on its own whereas the Spark reader simply sets filters and relies on
> parquet-mr to do the filtering. My assumption there is a problem in
> parquet-mr. Is it correct? Is it somehow related to record materialization?
> >> >
> >> > 3. At some point, Julien Le Dem gave a talk about supporting page
> skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c =
> 'smth'. Basically, filtering data in columns based on predicates on other
> columns. It is a highly anticipated feature on our end. Can somebody
> clarify if it will be part of parquet-mr or we will have to implement this
> in Iceberg?
> >> >
> >> > 4.  What is the long-term vision for the Parquet reader in Iceberg?
> Are there any plans to submit parts of it to parquet-mr? Will Iceberg
> reader be mostly independent of parquet-mr?
> >> >
> >> > 5. We are considering reading Parquet data into Arrow. Will be it
> something specific to Iceberg or generally available? I believe it is a
> quite common use case.
> >> >
> >> > Thanks,
> >> > Anton
> >> >
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Future of Iceberg Parquet Reader

Posted by Wes McKinney <we...@gmail.com>.

On Tue, May 28, 2019 at 11:19 AM Daniel Weeks
<dw...@netflix.com.invalid> wrote:
>
> Hey Anton,
>
> #1) Part of the reason Iceberg has a custom reader is to help resolve some of the Iceberg specific aspects of how parquet files are read (e.g. column resolution by id, iceberg expressions).  Also, it's been a struggle to get agreement on a good vectorized api.  I don't believe the objective of building a read path in Iceberg was strictly about performance, but more about being able to iterate quickly and prototype new approaches that we hope to ultimately feed back to the parquet-mr project.
>
> #2) Iceberg isn't using the same path, so row group and dictionary filtering are handled by Iceberg (though record level filtering is not).  I don't believe this is any specific problem with the parquet-mr other than where it is implemented in the read path and that Iceberg has its own expression implementation.
>
> #3) There is active work going on related to page skipping (PARQUET-1201).  I believe this may be what you are referring to.
>
> #4) Ideally we would be able to contribute back the read/write path implementations to parquet-mr and get an updated API that can be used for future development.
>
> #5) We do intend to build a native parquet to arrow read path.  Initially this will likely be specific to Iceberg as we iterate on the implementation, but will we hope to be generally usable across multiple engines.  For example, both Spark and Presto have custom read paths for parquet and their own columnar memory format.  We hope that we can build an Iceberg read path that can be used across both and leverage arrow natively through columnar api abstractions.
>

Presto and Spark aren't exactly analogous because they have bespoke
in-memory formats, so it wouldn't make sense to develop Parquet
serialization/deserialization anywhere else. It would be unfortunate
if Parquet-to-Arrow conversion in Java at the granularity of a single
file is hidden behind Iceberg business logic, so I would encourage you
to make the lower-level single file interface as accessible to general
Arrow users as possible.

> -Dan
>
> On Tue, May 28, 2019 at 8:03 AM Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Anton,
>>
>> On point #5, I would suggest doing the work either in Apache Arrow or
>> in the Parquet Java project -- we are developing both Parquet C++ and
>> Rust codebases within the apache/arrow repository so I think you would
>> find an active community there. I know that there has been a lot of
>> interest in decoupling from Hadoop-related Java dependencies, so you
>> might also think about how to do that at the same time.
>>
>> - Wes
>>
>> On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi
>> <ao...@apple.com.invalid> wrote:
>> >
>> > Hi,
>> >
>> > I see more and more questions around Iceberg Parquet reader. I think it would be useful to have a thread that clarifies all open questions and explains the long-term plan.
>> >
>> > 1. Am I correct that performance is the main reason to have a custom reader in Iceberg? Are there any other purposes? A common question I get is why not improve parquet-mr instead of writing a new reader? I know that almost every system that cares about performance has its own reader, but why so?
>> >
>> > 2.  Iceberg filters out row groups based on stats and dictionary pages on its own whereas the Spark reader simply sets filters and relies on parquet-mr to do the filtering. My assumption there is a problem in parquet-mr. Is it correct? Is it somehow related to record materialization?
>> >
>> > 3. At some point, Julien Le Dem gave a talk about supporting page skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. Basically, filtering data in columns based on predicates on other columns. It is a highly anticipated feature on our end. Can somebody clarify if it will be part of parquet-mr or we will have to implement this in Iceberg?
>> >
>> > 4.  What is the long-term vision for the Parquet reader in Iceberg? Are there any plans to submit parts of it to parquet-mr? Will Iceberg reader be mostly independent of parquet-mr?
>> >
>> > 5. We are considering reading Parquet data into Arrow. Will be it something specific to Iceberg or generally available? I believe it is a quite common use case.
>> >
>> > Thanks,
>> > Anton
>> >

Re: Future of Iceberg Parquet Reader

Posted by Daniel Weeks <dw...@netflix.com.INVALID>.

Hey Anton,

#1) Part of the reason Iceberg has a custom reader is to help resolve some
of the Iceberg specific aspects of how parquet files are read (e.g. column
resolution by id, iceberg expressions).  Also, it's been a struggle to get
agreement on a good vectorized api.  I don't believe the objective of
building a read path in Iceberg was strictly about performance, but more
about being able to iterate quickly and prototype new approaches that we
hope to ultimately feed back to the parquet-mr project.

#2) Iceberg isn't using the same path, so row group and dictionary
filtering are handled by Iceberg (though record level filtering is not).  I
don't believe this is any specific problem with the parquet-mr other than
where it is implemented in the read path and that Iceberg has its own
expression implementation.

#3) There is active work going on related to page skipping (PARQUET-1201
<https://issues.apache.org/jira/browse/PARQUET-1201>).  I believe this may
be what you are referring to.

#4) Ideally we would be able to contribute back the read/write path
implementations to parquet-mr and get an updated API that can be used for
future development.

#5) We do intend to build a native parquet to arrow read path.  Initially
this will likely be specific to Iceberg as we iterate on the
implementation, but will we hope to be generally usable across multiple
engines.  For example, both Spark and Presto have custom read paths for
parquet and their own columnar memory format.  We hope that we can build an
Iceberg read path that can be used across both and leverage arrow natively
through columnar api abstractions.

-Dan

On Tue, May 28, 2019 at 8:03 AM Wes McKinney <we...@gmail.com> wrote:

> hi Anton,
>
> On point #5, I would suggest doing the work either in Apache Arrow or
> in the Parquet Java project -- we are developing both Parquet C++ and
> Rust codebases within the apache/arrow repository so I think you would
> find an active community there. I know that there has been a lot of
> interest in decoupling from Hadoop-related Java dependencies, so you
> might also think about how to do that at the same time.
>
> - Wes
>
> On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi
> <ao...@apple.com.invalid> wrote:
> >
> > Hi,
> >
> > I see more and more questions around Iceberg Parquet reader. I think it
> would be useful to have a thread that clarifies all open questions and
> explains the long-term plan.
> >
> > 1. Am I correct that performance is the main reason to have a custom
> reader in Iceberg? Are there any other purposes? A common question I get is
> why not improve parquet-mr instead of writing a new reader? I know that
> almost every system that cares about performance has its own reader, but
> why so?
> >
> > 2.  Iceberg filters out row groups based on stats and dictionary pages
> on its own whereas the Spark reader simply sets filters and relies on
> parquet-mr to do the filtering. My assumption there is a problem in
> parquet-mr. Is it correct? Is it somehow related to record materialization?
> >
> > 3. At some point, Julien Le Dem gave a talk about supporting page
> skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c =
> 'smth'. Basically, filtering data in columns based on predicates on other
> columns. It is a highly anticipated feature on our end. Can somebody
> clarify if it will be part of parquet-mr or we will have to implement this
> in Iceberg?
> >
> > 4.  What is the long-term vision for the Parquet reader in Iceberg? Are
> there any plans to submit parts of it to parquet-mr? Will Iceberg reader be
> mostly independent of parquet-mr?
> >
> > 5. We are considering reading Parquet data into Arrow. Will be it
> something specific to Iceberg or generally available? I believe it is a
> quite common use case.
> >
> > Thanks,
> > Anton
> >
>

Re: Future of Iceberg Parquet Reader

Posted by Wes McKinney <we...@gmail.com>.

hi Anton,

On point #5, I would suggest doing the work either in Apache Arrow or
in the Parquet Java project -- we are developing both Parquet C++ and
Rust codebases within the apache/arrow repository so I think you would
find an active community there. I know that there has been a lot of
interest in decoupling from Hadoop-related Java dependencies, so you
might also think about how to do that at the same time.

- Wes

On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:
>
> Hi,
>
> I see more and more questions around Iceberg Parquet reader. I think it would be useful to have a thread that clarifies all open questions and explains the long-term plan.
>
> 1. Am I correct that performance is the main reason to have a custom reader in Iceberg? Are there any other purposes? A common question I get is why not improve parquet-mr instead of writing a new reader? I know that almost every system that cares about performance has its own reader, but why so?
>
> 2.  Iceberg filters out row groups based on stats and dictionary pages on its own whereas the Spark reader simply sets filters and relies on parquet-mr to do the filtering. My assumption there is a problem in parquet-mr. Is it correct? Is it somehow related to record materialization?
>
> 3. At some point, Julien Le Dem gave a talk about supporting page skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. Basically, filtering data in columns based on predicates on other columns. It is a highly anticipated feature on our end. Can somebody clarify if it will be part of parquet-mr or we will have to implement this in Iceberg?
>
> 4.  What is the long-term vision for the Parquet reader in Iceberg? Are there any plans to submit parts of it to parquet-mr? Will Iceberg reader be mostly independent of parquet-mr?
>
> 5. We are considering reading Parquet data into Arrow. Will be it something specific to Iceberg or generally available? I believe it is a quite common use case.
>
> Thanks,
> Anton
>