You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Shawn Chang <yc...@cornell.edu> on 2023/05/24 03:16:20 UTC

[DISCUSSION] Simplify code structure for supporting multiple Spark versions in Hudi

Hi Hudi developers,

I am writing to discuss the current code structure of the existing
hudi-spark-datasource and propose a more scalable approach for supporting
multiple Spark versions. The current structure involves common code shared
by several Spark versions, such as hudi-spark-common, hudi-spark3-common,
hudi-spark3.2plus-common, etc. (a detailed description can be found in the
readme here:
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/README.md).
This setup aims to minimize duplicate code in Hudi. Hudi currently utilizes
the SparkAdapter to invoke specific code based on the Spark version,
allowing different Spark versions to trigger different logic.

However, this code structure proves to be complex and hampers the process
of adding support for newer Spark versions. The current approach involves
the following steps:
1) Identify breaking changes introduced by the new Spark version and patch
affected Hudi classes.
2) Separate affected Hudi classes into different folders so that older
Spark versions can continue using the existing logic, while the new Spark
version can work with the updated Hudi classes.
3) Connect SparkAdapter to these Hudi classes, enabling Hudi to utilize the
correct code based on the Spark version.
4) Collect common code and place it in a new folder, such as
hudi-spark3.2plus-common, to reduce duplicate code.

This convoluted process has significantly slowed down the pace of adding
support for newer Spark versions in Hudi. Fortunately, there is a simpler
alternative that can streamline the process. I propose removing the common
modules and having only one folder for each Spark version. For example:




*hudi-spark-datasource/---hudi-spark2.4.0/---hudi-spark3.2.0/---hudi-spark3.3.0/...*

With this revised code structure, each Spark version will have its own
corresponding Hudi module. The process of adding Spark support will be
simplified as follows:
1) Copy the latest existing hudi-spark module to a new module,
hudi-spark<new_Spark_version>.
2) Identify breaking changes introduced by the new Spark version and patch
affected Hudi classes.

Let's consider some pros and cons of this new code structure:
*Pros:*
-A more readable codebase, with each Spark version having its individual
module.
-Easier addition of support for new Spark versions by duplicating the most
recent module and making necessary modifications.
-Simpler implementation of improvements specific to a particular Spark
version.
*Cons:*
-Increased duplicate code (though this shouldn't impact the Hudi jar size
during runtime, as the jar will still only contain support for one Spark
version).
-When applying a general fix for multiple Spark versions, the fix needs to
be applied to different Spark modules instead of a common codebase.

Please feel free to share your opinion, any feedback would be welcome!

Thank you.

Best,
Shawn

Re: [DISCUSSION] Simplify code structure for supporting multiple Spark versions in Hudi

Posted by Y Ethan Guo <yi...@apache.org>.

Hey Shawn, Rahil,

Thanks for raising this issue.  These are good suggestions; I would
recommend simplifying the code structure of Hudi Spark incrementally and
gradually making the code less coupled with Spark engine.

Identify breaking changes introduced by the new Spark version and patch
> affected Hudi classes.


This is important.  No matter how we organize the code structure, we
need to understand breaking changes from Spark that can affect Hudi.  Right
now, only a handful of Spark experts in our community have such knowledge
and how Hudi integrates with Spark at the implementation level.

We should document the integration, e.g., in an RFC, to avoid knowledge
gaps.  Based on the discussion, we should also write down the formal
process of supporting a new Spark version in Hudi, with clear testing and
certification criteria.

The current structure involves common code shared by several Spark
> versions, such as hudi-spark-common, hudi-spark3-common,
> hudi-spark3.2plus-common, etc. (a detailed description can be found in the
> readme here:
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/README.md).
> This setup aims to minimize duplicate code in Hudi. Hudi currently utilizes
> the SparkAdapter to invoke specific code based on the Spark version,
> allowing different Spark versions to trigger different logic.


We took the current approach of having hudi-spark3.2plus-common module to
deduplicate the code between Spark 3.2 and Spark 3.3 integrations (
https://issues.apache.org/jira/browse/HUDI-4691), because we anticipated
that this should reduce the code duplication going forward.  Now it proved
inflexible given Spark 3.4 changes some APIs and hudi-spark3.2plus-common
is no longer "common" any more.

It makes sense to have Spark version-specific module to contain classes for
specific integration logic.  I would still keep common modules for classes
that are more Hudi-relevant and applicable to all Spark or Spark 2/3
versions.  How we draw the line depends on the implementation, which we can
gradually make the decision.  This should solve the problem of applying a
general fix for multiple Spark versions.

For Spark 3.4 integration, I suggest duplicating the code for now in a
separate module, while we figure out a better code structure for Hudi Spark
integration.

Currently I think our integration with Spark is too tight, and brings up
> serious issues when upgrading.
> I will describe one example(however there are many more) but one area is
> we extend Spark's *ParquetFileFormat* in the following classes:
> buildReaderWithPartitionValues method


I agree that Hudi integration with Spark should not be tightly coupled.
Yet we should also honor the fact that some are for functionality and
performance reasons (which we should clearly document).  We should
definitely revisit such coupling points and see if they can be relaxed.

The Hoodie Parquet format classes are introduced mainly for supporting full
schema evaluation in Spark (https://github.com/apache/hudi/pull/4910).
AFAIK, in Spark32HoodieParquetFileFormat class, we did use
default ParquetFileFormat logic when schema evaluation is not enabled (
https://github.com/apache/hudi/pull/4910/files#diff-fc11bc10091e5312b58068f263960ef9459b1d01cf08512f33362b76f5554416R61),
meaning that it did not change any parquet reading logic.  However, such
fallback/default behavior is removed later on.  I think we need to revisit
such a decision and bring back the default behavior if OK.

In the future, we should understand the implications and review such code
changes carefully.

Best,
- Ethan

On Fri, Jun 2, 2023 at 12:28 AM Vinoth Chandar <vi...@apache.org> wrote:

> This is a good topic, thanks for raising this. Overall our reliance on
> spark classes/APIs that are declared experimental is an issue on paper. But
> there is few other ways to get right performance without relying on these.
> This has been the tricky issue IMO. Thoughts?
>
>  I ll review the code organization more carefully and report back.
>
> On Fri, Jun 2, 2023 at 04:23 Rahil C <rc...@gmail.com> wrote:
>
> > Thanks Shawn for writing this, I would like to also add on to the Spark
> > Discussion.
> >
> > Currently I think our integration with Spark is too tight, and brings up
> > serious issues when upgrading.
> >
> > I will describe one example(however there are many more) but one area is
> we
> > extend Spark's *ParquetFileFormat* in the following classes.
> >
> >
> >
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormat.scala
> >
> >
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala
> >
> > and specifically the main logic changes is we override
> > *buildReaderWithPartitionValues
> > *method
> > *.*
> > I understand the pro of reusability of spark's code, but the con is that
> we
> > dont then get the latest changes from the latest implementation of these
> > methods. This gets more complex as we then need to understand which spark
> > changes are required to cherry pick over as spark upgrades, such as these
> > issues.
> >
> > For spark 3.3.2 we faced several issues documented here
> > https://github.com/apache/hudi/pull/8082,
> > and for spark 3.4.0 we have encountered several issues as well.
> > https://github.com/apache/hudi/pull/8682
> >
> > We also are not keeping up to date with certain spark features as a
> result
> > of the integration we have made. I have created a JIRA that goes more
> into
> > this in-depth to this.
> > https://issues.apache.org/jira/browse/HUDI-6262
> >
> > Would be happy to sync with other hudi spark committers/experts, or
> anyone
> > interested in revisiting this integration so that future spark work will
> be
> > more achievable.
> >
> > Regards,
> > Rahil Chertara
> >
> > On Tue, May 23, 2023 at 8:16 PM Shawn Chang <yc...@cornell.edu> wrote:
> >
> > > Hi Hudi developers,
> > >
> > > I am writing to discuss the current code structure of the existing
> > > hudi-spark-datasource and propose a more scalable approach for
> supporting
> > > multiple Spark versions. The current structure involves common code
> > shared
> > > by several Spark versions, such as hudi-spark-common,
> hudi-spark3-common,
> > > hudi-spark3.2plus-common, etc. (a detailed description can be found in
> > the
> > > readme here:
> > >
> >
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/README.md
> > > ).
> > > This setup aims to minimize duplicate code in Hudi. Hudi currently
> > utilizes
> > > the SparkAdapter to invoke specific code based on the Spark version,
> > > allowing different Spark versions to trigger different logic.
> > >
> > > However, this code structure proves to be complex and hampers the
> process
> > > of adding support for newer Spark versions. The current approach
> involves
> > > the following steps:
> > > 1) Identify breaking changes introduced by the new Spark version and
> > patch
> > > affected Hudi classes.
> > > 2) Separate affected Hudi classes into different folders so that older
> > > Spark versions can continue using the existing logic, while the new
> Spark
> > > version can work with the updated Hudi classes.
> > > 3) Connect SparkAdapter to these Hudi classes, enabling Hudi to utilize
> > the
> > > correct code based on the Spark version.
> > > 4) Collect common code and place it in a new folder, such as
> > > hudi-spark3.2plus-common, to reduce duplicate code.
> > >
> > > This convoluted process has significantly slowed down the pace of
> adding
> > > support for newer Spark versions in Hudi. Fortunately, there is a
> simpler
> > > alternative that can streamline the process. I propose removing the
> > common
> > > modules and having only one folder for each Spark version. For example:
> > >
> > >
> > >
> > >
> > >
> > >
> >
> *hudi-spark-datasource/---hudi-spark2.4.0/---hudi-spark3.2.0/---hudi-spark3.3.0/...*
> > >
> > > With this revised code structure, each Spark version will have its own
> > > corresponding Hudi module. The process of adding Spark support will be
> > > simplified as follows:
> > > 1) Copy the latest existing hudi-spark module to a new module,
> > > hudi-spark<new_Spark_version>.
> > > 2) Identify breaking changes introduced by the new Spark version and
> > patch
> > > affected Hudi classes.
> > >
> > > Let's consider some pros and cons of this new code structure:
> > > *Pros:*
> > > -A more readable codebase, with each Spark version having its
> individual
> > > module.
> > > -Easier addition of support for new Spark versions by duplicating the
> > most
> > > recent module and making necessary modifications.
> > > -Simpler implementation of improvements specific to a particular Spark
> > > version.
> > > *Cons:*
> > > -Increased duplicate code (though this shouldn't impact the Hudi jar
> size
> > > during runtime, as the jar will still only contain support for one
> Spark
> > > version).
> > > -When applying a general fix for multiple Spark versions, the fix needs
> > to
> > > be applied to different Spark modules instead of a common codebase.
> > >
> > > Please feel free to share your opinion, any feedback would be welcome!
> > >
> > > Thank you.
> > >
> > > Best,
> > > Shawn
> > >
> >
>

Re: [DISCUSSION] Simplify code structure for supporting multiple Spark versions in Hudi

Posted by Vinoth Chandar <vi...@apache.org>.

This is a good topic, thanks for raising this. Overall our reliance on
spark classes/APIs that are declared experimental is an issue on paper. But
there is few other ways to get right performance without relying on these.
This has been the tricky issue IMO. Thoughts?

 I ll review the code organization more carefully and report back.

On Fri, Jun 2, 2023 at 04:23 Rahil C <rc...@gmail.com> wrote:

> Thanks Shawn for writing this, I would like to also add on to the Spark
> Discussion.
>
> Currently I think our integration with Spark is too tight, and brings up
> serious issues when upgrading.
>
> I will describe one example(however there are many more) but one area is we
> extend Spark's *ParquetFileFormat* in the following classes.
>
>
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormat.scala
>
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala
>
> and specifically the main logic changes is we override
> *buildReaderWithPartitionValues
> *method
> *.*
> I understand the pro of reusability of spark's code, but the con is that we
> dont then get the latest changes from the latest implementation of these
> methods. This gets more complex as we then need to understand which spark
> changes are required to cherry pick over as spark upgrades, such as these
> issues.
>
> For spark 3.3.2 we faced several issues documented here
> https://github.com/apache/hudi/pull/8082,
> and for spark 3.4.0 we have encountered several issues as well.
> https://github.com/apache/hudi/pull/8682
>
> We also are not keeping up to date with certain spark features as a result
> of the integration we have made. I have created a JIRA that goes more into
> this in-depth to this.
> https://issues.apache.org/jira/browse/HUDI-6262
>
> Would be happy to sync with other hudi spark committers/experts, or anyone
> interested in revisiting this integration so that future spark work will be
> more achievable.
>
> Regards,
> Rahil Chertara
>
> On Tue, May 23, 2023 at 8:16 PM Shawn Chang <yc...@cornell.edu> wrote:
>
> > Hi Hudi developers,
> >
> > I am writing to discuss the current code structure of the existing
> > hudi-spark-datasource and propose a more scalable approach for supporting
> > multiple Spark versions. The current structure involves common code
> shared
> > by several Spark versions, such as hudi-spark-common, hudi-spark3-common,
> > hudi-spark3.2plus-common, etc. (a detailed description can be found in
> the
> > readme here:
> >
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/README.md
> > ).
> > This setup aims to minimize duplicate code in Hudi. Hudi currently
> utilizes
> > the SparkAdapter to invoke specific code based on the Spark version,
> > allowing different Spark versions to trigger different logic.
> >
> > However, this code structure proves to be complex and hampers the process
> > of adding support for newer Spark versions. The current approach involves
> > the following steps:
> > 1) Identify breaking changes introduced by the new Spark version and
> patch
> > affected Hudi classes.
> > 2) Separate affected Hudi classes into different folders so that older
> > Spark versions can continue using the existing logic, while the new Spark
> > version can work with the updated Hudi classes.
> > 3) Connect SparkAdapter to these Hudi classes, enabling Hudi to utilize
> the
> > correct code based on the Spark version.
> > 4) Collect common code and place it in a new folder, such as
> > hudi-spark3.2plus-common, to reduce duplicate code.
> >
> > This convoluted process has significantly slowed down the pace of adding
> > support for newer Spark versions in Hudi. Fortunately, there is a simpler
> > alternative that can streamline the process. I propose removing the
> common
> > modules and having only one folder for each Spark version. For example:
> >
> >
> >
> >
> >
> >
> *hudi-spark-datasource/---hudi-spark2.4.0/---hudi-spark3.2.0/---hudi-spark3.3.0/...*
> >
> > With this revised code structure, each Spark version will have its own
> > corresponding Hudi module. The process of adding Spark support will be
> > simplified as follows:
> > 1) Copy the latest existing hudi-spark module to a new module,
> > hudi-spark<new_Spark_version>.
> > 2) Identify breaking changes introduced by the new Spark version and
> patch
> > affected Hudi classes.
> >
> > Let's consider some pros and cons of this new code structure:
> > *Pros:*
> > -A more readable codebase, with each Spark version having its individual
> > module.
> > -Easier addition of support for new Spark versions by duplicating the
> most
> > recent module and making necessary modifications.
> > -Simpler implementation of improvements specific to a particular Spark
> > version.
> > *Cons:*
> > -Increased duplicate code (though this shouldn't impact the Hudi jar size
> > during runtime, as the jar will still only contain support for one Spark
> > version).
> > -When applying a general fix for multiple Spark versions, the fix needs
> to
> > be applied to different Spark modules instead of a common codebase.
> >
> > Please feel free to share your opinion, any feedback would be welcome!
> >
> > Thank you.
> >
> > Best,
> > Shawn
> >
>

Re: [DISCUSSION] Simplify code structure for supporting multiple Spark versions in Hudi

Posted by Rahil C <rc...@gmail.com>.

Thanks Shawn for writing this, I would like to also add on to the Spark
Discussion.

Currently I think our integration with Spark is too tight, and brings up
serious issues when upgrading.

I will describe one example(however there are many more) but one area is we
extend Spark's *ParquetFileFormat* in the following classes.

https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormat.scala
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala

and specifically the main logic changes is we override
*buildReaderWithPartitionValues
*method
*.*
I understand the pro of reusability of spark's code, but the con is that we
dont then get the latest changes from the latest implementation of these
methods. This gets more complex as we then need to understand which spark
changes are required to cherry pick over as spark upgrades, such as these
issues.

For spark 3.3.2 we faced several issues documented here
https://github.com/apache/hudi/pull/8082,
and for spark 3.4.0 we have encountered several issues as well.
https://github.com/apache/hudi/pull/8682

We also are not keeping up to date with certain spark features as a result
of the integration we have made. I have created a JIRA that goes more into
this in-depth to this.
https://issues.apache.org/jira/browse/HUDI-6262

Would be happy to sync with other hudi spark committers/experts, or anyone
interested in revisiting this integration so that future spark work will be
more achievable.

Regards,
Rahil Chertara

On Tue, May 23, 2023 at 8:16 PM Shawn Chang <yc...@cornell.edu> wrote:

> Hi Hudi developers,
>
> I am writing to discuss the current code structure of the existing
> hudi-spark-datasource and propose a more scalable approach for supporting
> multiple Spark versions. The current structure involves common code shared
> by several Spark versions, such as hudi-spark-common, hudi-spark3-common,
> hudi-spark3.2plus-common, etc. (a detailed description can be found in the
> readme here:
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/README.md
> ).
> This setup aims to minimize duplicate code in Hudi. Hudi currently utilizes
> the SparkAdapter to invoke specific code based on the Spark version,
> allowing different Spark versions to trigger different logic.
>
> However, this code structure proves to be complex and hampers the process
> of adding support for newer Spark versions. The current approach involves
> the following steps:
> 1) Identify breaking changes introduced by the new Spark version and patch
> affected Hudi classes.
> 2) Separate affected Hudi classes into different folders so that older
> Spark versions can continue using the existing logic, while the new Spark
> version can work with the updated Hudi classes.
> 3) Connect SparkAdapter to these Hudi classes, enabling Hudi to utilize the
> correct code based on the Spark version.
> 4) Collect common code and place it in a new folder, such as
> hudi-spark3.2plus-common, to reduce duplicate code.
>
> This convoluted process has significantly slowed down the pace of adding
> support for newer Spark versions in Hudi. Fortunately, there is a simpler
> alternative that can streamline the process. I propose removing the common
> modules and having only one folder for each Spark version. For example:
>
>
>
>
>
> *hudi-spark-datasource/---hudi-spark2.4.0/---hudi-spark3.2.0/---hudi-spark3.3.0/...*
>
> With this revised code structure, each Spark version will have its own
> corresponding Hudi module. The process of adding Spark support will be
> simplified as follows:
> 1) Copy the latest existing hudi-spark module to a new module,
> hudi-spark<new_Spark_version>.
> 2) Identify breaking changes introduced by the new Spark version and patch
> affected Hudi classes.
>
> Let's consider some pros and cons of this new code structure:
> *Pros:*
> -A more readable codebase, with each Spark version having its individual
> module.
> -Easier addition of support for new Spark versions by duplicating the most
> recent module and making necessary modifications.
> -Simpler implementation of improvements specific to a particular Spark
> version.
> *Cons:*
> -Increased duplicate code (though this shouldn't impact the Hudi jar size
> during runtime, as the jar will still only contain support for one Spark
> version).
> -When applying a general fix for multiple Spark versions, the fix needs to
> be applied to different Spark modules instead of a common codebase.
>
> Please feel free to share your opinion, any feedback would be welcome!
>
> Thank you.
>
> Best,
> Shawn
>