You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Max Konstantinov <ko...@gmail.com> on 2024/02/13 01:27:03 UTC

[WIP][Proposal] PARQUET-2430: Add parquet joiner

Hi Parquet dev team!


I wanted to ask your opinion on the proposal I came up with.
PR: https://github.com/apache/parquet-mr/pull/1273
JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
PR's description and JIRA ticket contains all the details, please check it
out. The feature is not yet ready to merge, it is just a proposal for now.
I wanted to ask a PARQUET community opinion if you see any obstacles for
adding it? We find it very useful and plan to use it and if PARQUET
community finds no issues with it I can add tests, javadocs and polish it
so we can add this new feature to PARQUET.


Max.

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

Posted by Max Konstantinov <ko...@gmail.com>.

Got it,

Let me start working on this. I will modify the PR, It will probably look
clunky at first iteration, but we can converge it, eliminate code
duplication and try to make it simple at the end.

Max.

On Mon, Feb 19, 2024 at 1:49 AM Gang Wu <us...@gmail.com> wrote:

> I agree it is challenging to extend ParquetRewriter. However, it would also
> add maintenance overhead if two classes share a lot of similar features.
> For example, we have to implement a new feature twice to make sure both
> rewriter and joiner support it. The birth of class ParquetRewriter itself
> is the
> result of a refactoring work to merge different classes with
> similar features.
> IMO, it is OK not to support all existing rewrite features (e.g. nullify
> column)
> while joining columns from different files. WDYT?
>
> For the potential bugs, we can file JIRA issues to discuss them separately.
>
> Best,
> Gang
>
> On Mon, Feb 19, 2024 at 10:36 AM Xinli shang <sh...@uber.com.invalid>
> wrote:
>
> > HI Max,
> >
> > This is a very interesting feature with a great idea!
> >
> > Xinli
> >
> > On Sun, Feb 18, 2024 at 1:19 PM Max Konstantinov <
> > konstantinov.maxim@gmail.com> wrote:
> >
> > > Hi Gang,
> > >
> > >
> > > I actually started working on this feature by trying to extend
> > > ParquetRewriter with that new capability, but I quickly ran into
> issues:
> > > - Many state holder variables in ParquetRewriter will have to be
> > duplicated
> > > for the left / right side of the join
> > > - Some methods(ex: nullifyColumn) are very close but not the same for
> > > joiner and will require more branching in the existing codebase of
> > > ParquetRewriter
> > > - Tests for a Join part will have to pay a special attention to the
> right
> > > side of the join as it is a new thing in a joiner so it will blow out
> > test
> > > class quite a bit
> > > To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter
> > while
> > > possible will potentially make codebase too complex and hard to reason
> > > about.
> > >
> > > I thought about it, maybe we can utilize Factory method & Builder
> > patterns
> > > for this? For example we can:
> > > - Unify Options(RewriteOptions/JoinOptions) into a single class, if one
> > of
> > > the final implementation is not supporting a certain feature it should
> > > throw exception during construction
> > > - Use Factory pattern approach and pick the actual final implementation
> > of
> > > the class based on provided options
> > > - Both ParquetRewriter & ParquetJoiner will implement a new Interface
> > that
> > > has processBlocks() & close() public method
> > > - Use Builder pattern approach and make all methods including
> > constructors
> > > private besides those that need to be exposed to users
> > > By using this approach we can simplify internal implementation by
> > dividing
> > > it into separate dedicated smaller sub-modules while still providing a
> > > single feature rich external API. Let me know what you think.
> > >
> > > Also I might be wrong but I’ve noticed a few potential issues with
> > > ParquetRewriter:
> > > - not sure if that is a bug in but is not we supposed to consume() on
> the
> > > reader, here is the place in ParquetJoiner
> > > <
> > >
> >
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345
> > > >
> > > you can find the same place in ParquetRewriter
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771
> > > >
> > > ?
> > > - newSchema() & extractField() methods seem to have a small issue with
> > > complex nested schemas, here is the line that is different in
> > > ParquetRewriter
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815
> > > >
> > > and here is ParquetJoiner
> > > <
> > >
> >
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390
> > > >.
> > > ParquetRewriter’s version originally failed in my tests, it originally
> > had
> > > a different schema, I will try to reproduce it later, right now it
> > works, I
> > > need to go back and check the commit history.
> > >
> > > btw
> > > I wanted to start covering ParquetJoiner with extensive tests the next
> > > week.
> > > Also we built POC for one of our projects based on the current version
> of
> > > ParquetJoiner and it showed up to x10 improvement in performance in
> > > comparison with a default join implementation using Spark.
> > >
> > >
> > > Max.
> > >
> > > On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <us...@gmail.com> wrote:
> > >
> > > > Hi Max,
> > > >
> > > > Thanks for proposing the joiner! I simply took a glimpse of the PR
> and
> > > > it looks promising to me. My general question is on the possibility
> of
> > > > consolidating the work with ParquetRewriter, which shares a lot of
> > > > common rewriting logic.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> > > > konstantinov.maxim@gmail.com> wrote:
> > > >
> > > > > Hi Parquet dev team!
> > > > >
> > > > >
> > > > > I wanted to ask your opinion on the proposal I came up with.
> > > > > PR: https://github.com/apache/parquet-mr/pull/1273
> > > > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > > > > PR's description and JIRA ticket contains all the details, please
> > check
> > > > it
> > > > > out. The feature is not yet ready to merge, it is just a proposal
> for
> > > > now.
> > > > > I wanted to ask a PARQUET community opinion if you see any
> obstacles
> > > for
> > > > > adding it? We find it very useful and plan to use it and if PARQUET
> > > > > community finds no issues with it I can add tests, javadocs and
> > polish
> > > it
> > > > > so we can add this new feature to PARQUET.
> > > > >
> > > > >
> > > > > Max.
> > > > >
> > > >
> > >
> >
> >
> > --
> > Xinli Shang
> >
>

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

Posted by Gang Wu <us...@gmail.com>.

I agree it is challenging to extend ParquetRewriter. However, it would also
add maintenance overhead if two classes share a lot of similar features.
For example, we have to implement a new feature twice to make sure both
rewriter and joiner support it. The birth of class ParquetRewriter itself
is the
result of a refactoring work to merge different classes with
similar features.
IMO, it is OK not to support all existing rewrite features (e.g. nullify
column)
while joining columns from different files. WDYT?

For the potential bugs, we can file JIRA issues to discuss them separately.

Best,
Gang

On Mon, Feb 19, 2024 at 10:36 AM Xinli shang <sh...@uber.com.invalid>
wrote:

> HI Max,
>
> This is a very interesting feature with a great idea!
>
> Xinli
>
> On Sun, Feb 18, 2024 at 1:19 PM Max Konstantinov <
> konstantinov.maxim@gmail.com> wrote:
>
> > Hi Gang,
> >
> >
> > I actually started working on this feature by trying to extend
> > ParquetRewriter with that new capability, but I quickly ran into issues:
> > - Many state holder variables in ParquetRewriter will have to be
> duplicated
> > for the left / right side of the join
> > - Some methods(ex: nullifyColumn) are very close but not the same for
> > joiner and will require more branching in the existing codebase of
> > ParquetRewriter
> > - Tests for a Join part will have to pay a special attention to the right
> > side of the join as it is a new thing in a joiner so it will blow out
> test
> > class quite a bit
> > To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter
> while
> > possible will potentially make codebase too complex and hard to reason
> > about.
> >
> > I thought about it, maybe we can utilize Factory method & Builder
> patterns
> > for this? For example we can:
> > - Unify Options(RewriteOptions/JoinOptions) into a single class, if one
> of
> > the final implementation is not supporting a certain feature it should
> > throw exception during construction
> > - Use Factory pattern approach and pick the actual final implementation
> of
> > the class based on provided options
> > - Both ParquetRewriter & ParquetJoiner will implement a new Interface
> that
> > has processBlocks() & close() public method
> > - Use Builder pattern approach and make all methods including
> constructors
> > private besides those that need to be exposed to users
> > By using this approach we can simplify internal implementation by
> dividing
> > it into separate dedicated smaller sub-modules while still providing a
> > single feature rich external API. Let me know what you think.
> >
> > Also I might be wrong but I’ve noticed a few potential issues with
> > ParquetRewriter:
> > - not sure if that is a bug in but is not we supposed to consume() on the
> > reader, here is the place in ParquetJoiner
> > <
> >
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345
> > >
> > you can find the same place in ParquetRewriter
> > <
> >
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771
> > >
> > ?
> > - newSchema() & extractField() methods seem to have a small issue with
> > complex nested schemas, here is the line that is different in
> > ParquetRewriter
> > <
> >
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815
> > >
> > and here is ParquetJoiner
> > <
> >
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390
> > >.
> > ParquetRewriter’s version originally failed in my tests, it originally
> had
> > a different schema, I will try to reproduce it later, right now it
> works, I
> > need to go back and check the commit history.
> >
> > btw
> > I wanted to start covering ParquetJoiner with extensive tests the next
> > week.
> > Also we built POC for one of our projects based on the current version of
> > ParquetJoiner and it showed up to x10 improvement in performance in
> > comparison with a default join implementation using Spark.
> >
> >
> > Max.
> >
> > On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <us...@gmail.com> wrote:
> >
> > > Hi Max,
> > >
> > > Thanks for proposing the joiner! I simply took a glimpse of the PR and
> > > it looks promising to me. My general question is on the possibility of
> > > consolidating the work with ParquetRewriter, which shares a lot of
> > > common rewriting logic.
> > >
> > > Best,
> > > Gang
> > >
> > > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> > > konstantinov.maxim@gmail.com> wrote:
> > >
> > > > Hi Parquet dev team!
> > > >
> > > >
> > > > I wanted to ask your opinion on the proposal I came up with.
> > > > PR: https://github.com/apache/parquet-mr/pull/1273
> > > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > > > PR's description and JIRA ticket contains all the details, please
> check
> > > it
> > > > out. The feature is not yet ready to merge, it is just a proposal for
> > > now.
> > > > I wanted to ask a PARQUET community opinion if you see any obstacles
> > for
> > > > adding it? We find it very useful and plan to use it and if PARQUET
> > > > community finds no issues with it I can add tests, javadocs and
> polish
> > it
> > > > so we can add this new feature to PARQUET.
> > > >
> > > >
> > > > Max.
> > > >
> > >
> >
>
>
> --
> Xinli Shang
>

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

Posted by Xinli shang <sh...@uber.com.INVALID>.

HI Max,

This is a very interesting feature with a great idea!

Xinli

On Sun, Feb 18, 2024 at 1:19 PM Max Konstantinov <
konstantinov.maxim@gmail.com> wrote:

> Hi Gang,
>
>
> I actually started working on this feature by trying to extend
> ParquetRewriter with that new capability, but I quickly ran into issues:
> - Many state holder variables in ParquetRewriter will have to be duplicated
> for the left / right side of the join
> - Some methods(ex: nullifyColumn) are very close but not the same for
> joiner and will require more branching in the existing codebase of
> ParquetRewriter
> - Tests for a Join part will have to pay a special attention to the right
> side of the join as it is a new thing in a joiner so it will blow out test
> class quite a bit
> To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter while
> possible will potentially make codebase too complex and hard to reason
> about.
>
> I thought about it, maybe we can utilize Factory method & Builder patterns
> for this? For example we can:
> - Unify Options(RewriteOptions/JoinOptions) into a single class, if one of
> the final implementation is not supporting a certain feature it should
> throw exception during construction
> - Use Factory pattern approach and pick the actual final implementation of
> the class based on provided options
> - Both ParquetRewriter & ParquetJoiner will implement a new Interface that
> has processBlocks() & close() public method
> - Use Builder pattern approach and make all methods including constructors
> private besides those that need to be exposed to users
> By using this approach we can simplify internal implementation by dividing
> it into separate dedicated smaller sub-modules while still providing a
> single feature rich external API. Let me know what you think.
>
> Also I might be wrong but I’ve noticed a few potential issues with
> ParquetRewriter:
> - not sure if that is a bug in but is not we supposed to consume() on the
> reader, here is the place in ParquetJoiner
> <
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345
> >
> you can find the same place in ParquetRewriter
> <
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771
> >
> ?
> - newSchema() & extractField() methods seem to have a small issue with
> complex nested schemas, here is the line that is different in
> ParquetRewriter
> <
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815
> >
> and here is ParquetJoiner
> <
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390
> >.
> ParquetRewriter’s version originally failed in my tests, it originally had
> a different schema, I will try to reproduce it later, right now it works, I
> need to go back and check the commit history.
>
> btw
> I wanted to start covering ParquetJoiner with extensive tests the next
> week.
> Also we built POC for one of our projects based on the current version of
> ParquetJoiner and it showed up to x10 improvement in performance in
> comparison with a default join implementation using Spark.
>
>
> Max.
>
> On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <us...@gmail.com> wrote:
>
> > Hi Max,
> >
> > Thanks for proposing the joiner! I simply took a glimpse of the PR and
> > it looks promising to me. My general question is on the possibility of
> > consolidating the work with ParquetRewriter, which shares a lot of
> > common rewriting logic.
> >
> > Best,
> > Gang
> >
> > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> > konstantinov.maxim@gmail.com> wrote:
> >
> > > Hi Parquet dev team!
> > >
> > >
> > > I wanted to ask your opinion on the proposal I came up with.
> > > PR: https://github.com/apache/parquet-mr/pull/1273
> > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > > PR's description and JIRA ticket contains all the details, please check
> > it
> > > out. The feature is not yet ready to merge, it is just a proposal for
> > now.
> > > I wanted to ask a PARQUET community opinion if you see any obstacles
> for
> > > adding it? We find it very useful and plan to use it and if PARQUET
> > > community finds no issues with it I can add tests, javadocs and polish
> it
> > > so we can add this new feature to PARQUET.
> > >
> > >
> > > Max.
> > >
> >
>


-- 
Xinli Shang

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

Posted by Max Konstantinov <ko...@gmail.com>.

Hi Gang,


I actually started working on this feature by trying to extend
ParquetRewriter with that new capability, but I quickly ran into issues:
- Many state holder variables in ParquetRewriter will have to be duplicated
for the left / right side of the join
- Some methods(ex: nullifyColumn) are very close but not the same for
joiner and will require more branching in the existing codebase of
ParquetRewriter
- Tests for a Join part will have to pay a special attention to the right
side of the join as it is a new thing in a joiner so it will blow out test
class quite a bit
To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter while
possible will potentially make codebase too complex and hard to reason
about.

I thought about it, maybe we can utilize Factory method & Builder patterns
for this? For example we can:
- Unify Options(RewriteOptions/JoinOptions) into a single class, if one of
the final implementation is not supporting a certain feature it should
throw exception during construction
- Use Factory pattern approach and pick the actual final implementation of
the class based on provided options
- Both ParquetRewriter & ParquetJoiner will implement a new Interface that
has processBlocks() & close() public method
- Use Builder pattern approach and make all methods including constructors
private besides those that need to be exposed to users
By using this approach we can simplify internal implementation by dividing
it into separate dedicated smaller sub-modules while still providing a
single feature rich external API. Let me know what you think.

Also I might be wrong but I’ve noticed a few potential issues with
ParquetRewriter:
- not sure if that is a bug in but is not we supposed to consume() on the
reader, here is the place in ParquetJoiner
<https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345>
you can find the same place in ParquetRewriter
<https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771>
?
- newSchema() & extractField() methods seem to have a small issue with
complex nested schemas, here is the line that is different in
ParquetRewriter
<https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815>
and here is ParquetJoiner
<https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390>.
ParquetRewriter’s version originally failed in my tests, it originally had
a different schema, I will try to reproduce it later, right now it works, I
need to go back and check the commit history.

btw
I wanted to start covering ParquetJoiner with extensive tests the next
week.
Also we built POC for one of our projects based on the current version of
ParquetJoiner and it showed up to x10 improvement in performance in
comparison with a default join implementation using Spark.


Max.

On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <us...@gmail.com> wrote:

> Hi Max,
>
> Thanks for proposing the joiner! I simply took a glimpse of the PR and
> it looks promising to me. My general question is on the possibility of
> consolidating the work with ParquetRewriter, which shares a lot of
> common rewriting logic.
>
> Best,
> Gang
>
> On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> konstantinov.maxim@gmail.com> wrote:
>
> > Hi Parquet dev team!
> >
> >
> > I wanted to ask your opinion on the proposal I came up with.
> > PR: https://github.com/apache/parquet-mr/pull/1273
> > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > PR's description and JIRA ticket contains all the details, please check
> it
> > out. The feature is not yet ready to merge, it is just a proposal for
> now.
> > I wanted to ask a PARQUET community opinion if you see any obstacles for
> > adding it? We find it very useful and plan to use it and if PARQUET
> > community finds no issues with it I can add tests, javadocs and polish it
> > so we can add this new feature to PARQUET.
> >
> >
> > Max.
> >
>

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

Posted by Gang Wu <us...@gmail.com>.

Hi Max,

Thanks for proposing the joiner! I simply took a glimpse of the PR and
it looks promising to me. My general question is on the possibility of
consolidating the work with ParquetRewriter, which shares a lot of
common rewriting logic.

Best,
Gang

On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
konstantinov.maxim@gmail.com> wrote:

> Hi Parquet dev team!
>
>
> I wanted to ask your opinion on the proposal I came up with.
> PR: https://github.com/apache/parquet-mr/pull/1273
> JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> PR's description and JIRA ticket contains all the details, please check it
> out. The feature is not yet ready to merge, it is just a proposal for now.
> I wanted to ask a PARQUET community opinion if you see any obstacles for
> adding it? We find it very useful and plan to use it and if PARQUET
> community finds no issues with it I can add tests, javadocs and polish it
> so we can add this new feature to PARQUET.
>
>
> Max.
>