You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Alexey Kudinkin <al...@onehouse.ai> on 2022/03/08 21:32:00 UTC

Unbundling "spark-avro" dependency

Hello, everyone!

While working on HUDI-3549 <https://issues.apache.org/jira/browse/HUDI-3549>,
we've surprisingly discovered that Hudi actually bundles "spark-avro"
dependency *by default*.

This is problematic b/c "spark-avro" is tightly coupled with some of the
other Spark components making up its core distribution (ie being packaged
in Spark itself, not an external packages, one example of that is
"spark-sql")

In regards to HUDI-3549
<https://issues.apache.org/jira/browse/HUDI-3549> itself,
the problem in there unfolded like following:

   1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1 bundled
   along with it
   2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
   3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
   3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
   and renaming Internal API methods DataSourceUtils)


To avoid this problems going forward, our proposal is to

   1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
   this means that Hudi users would need to now specify spark-avro via
   `--packages` flag, since it's not part of Spark's core distribution)
   2. (Optional) If community still sees value in bundling (and shading)
   "spark-avro" in some cases, we can add Maven profile that would allow to do
   that *ad hoc*.

We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with the
proposed changes.

Looking forward to your feedback.

Re: Unbundling "spark-avro" dependency

Posted by Sivabalan <n....@gmail.com>.
I too second that for existing users we should keep the same behavior. But
would like to get some clarity on what's the path towards unbundling
spark-avro. Or are we always going to have only bundled (hudi spark bundle
with spark-avro) artifacts in maven and for unbundled version, we are going
to ask devs to build hudi by their own, I don't think many would go that
route ever and will stick to the officially released artifacts. So, if we
have plans to eventually deprecate/stop bundling spark-avro, may be we need
to think through this.


On Tue, 8 Mar 2022 at 19:20, Y Ethan Guo <et...@gmail.com> wrote:

> Thanks for raising the discussion.  I agree that from the usability
> standpoint from the user side, we should keep the same expectation
> regarding "--packages" for Spark and reliance bundled spark-avro for
> utilities bundle in this release.
>
> Given that there are Spark API changes between 3.2.0 and 3.2.1, do we also
> add Spark profiles for patch versions besides the latest, e.g. 3.2.0, as
> well?  If a user has Spark 3.2.0 in their environment, they have to upgrade
> both Hudi and Spark if they want to upgrade Hudi release.  Do we know if
> this is a major use case?
>
> Best,
> - Ethan
>
> On Tue, Mar 8, 2022 at 6:15 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Thanks Alexey.
> >
> > This was actually the case for a while now, I think. From what I can see,
> > our quickstart for spark still suggests passing spark-avro in via
> > --packages, but utilities bundle related examples are relying on the fact
> > that this is pre-bundled.
> >
> > I do acknowledge that with recent Spark 3.x versions, breakages have
> become
> > much more frequent, amplifying this pain. However, to prevent jobs from
> > failing upon upgrade (i.e forcing everyone to redeploy streaming + batch
> > job with the --packages flag), I would prefer if we actually kept the
> same
> > bundling behavior with the following simplifications.
> >
> > 1. We have three spark profiles now - spark2, spark3.1.x, and spark3
> > (3.2.1). We continue to bundle spark-avro and support the latest spark
> > minor version
> > 2. We retain and make the docs clearer about how users can "optionally"
> > unbundle and deploy for other versions.
> >
> > Given other large features going out, turned on by default this release,
> > not sure if its a good idea to introduce a breaking change like this.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Mar 8, 2022 at 1:32 PM Alexey Kudinkin <al...@onehouse.ai>
> wrote:
> >
> > > Hello, everyone!
> > >
> > > While working on HUDI-3549 <
> > > https://issues.apache.org/jira/browse/HUDI-3549>,
> > > we've surprisingly discovered that Hudi actually bundles "spark-avro"
> > > dependency *by default*.
> > >
> > > This is problematic b/c "spark-avro" is tightly coupled with some of
> the
> > > other Spark components making up its core distribution (ie being
> packaged
> > > in Spark itself, not an external packages, one example of that is
> > > "spark-sql")
> > >
> > > In regards to HUDI-3549
> > > <https://issues.apache.org/jira/browse/HUDI-3549> itself,
> > > the problem in there unfolded like following:
> > >
> > >    1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1
> > bundled
> > >    along with it
> > >    2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
> > >    3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/
> "spark-sql"
> > >    3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing
> typo
> > >    and renaming Internal API methods DataSourceUtils)
> > >
> > >
> > > To avoid this problems going forward, our proposal is to
> > >
> > >    1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
> > >    this means that Hudi users would need to now specify spark-avro via
> > >    `--packages` flag, since it's not part of Spark's core distribution)
> > >    2. (Optional) If community still sees value in bundling (and
> shading)
> > >    "spark-avro" in some cases, we can add Maven profile that would
> allow
> > > to do
> > >    that *ad hoc*.
> > >
> > > We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with
> the
> > > proposed changes.
> > >
> > > Looking forward to your feedback.
> > >
> >
>


-- 
Regards,
-Sivabalan

Re: Unbundling "spark-avro" dependency

Posted by Y Ethan Guo <et...@gmail.com>.
Thanks for raising the discussion.  I agree that from the usability
standpoint from the user side, we should keep the same expectation
regarding "--packages" for Spark and reliance bundled spark-avro for
utilities bundle in this release.

Given that there are Spark API changes between 3.2.0 and 3.2.1, do we also
add Spark profiles for patch versions besides the latest, e.g. 3.2.0, as
well?  If a user has Spark 3.2.0 in their environment, they have to upgrade
both Hudi and Spark if they want to upgrade Hudi release.  Do we know if
this is a major use case?

Best,
- Ethan

On Tue, Mar 8, 2022 at 6:15 PM Vinoth Chandar <vi...@apache.org> wrote:

> Thanks Alexey.
>
> This was actually the case for a while now, I think. From what I can see,
> our quickstart for spark still suggests passing spark-avro in via
> --packages, but utilities bundle related examples are relying on the fact
> that this is pre-bundled.
>
> I do acknowledge that with recent Spark 3.x versions, breakages have become
> much more frequent, amplifying this pain. However, to prevent jobs from
> failing upon upgrade (i.e forcing everyone to redeploy streaming + batch
> job with the --packages flag), I would prefer if we actually kept the same
> bundling behavior with the following simplifications.
>
> 1. We have three spark profiles now - spark2, spark3.1.x, and spark3
> (3.2.1). We continue to bundle spark-avro and support the latest spark
> minor version
> 2. We retain and make the docs clearer about how users can "optionally"
> unbundle and deploy for other versions.
>
> Given other large features going out, turned on by default this release,
> not sure if its a good idea to introduce a breaking change like this.
>
> Thanks
> Vinoth
>
> On Tue, Mar 8, 2022 at 1:32 PM Alexey Kudinkin <al...@onehouse.ai> wrote:
>
> > Hello, everyone!
> >
> > While working on HUDI-3549 <
> > https://issues.apache.org/jira/browse/HUDI-3549>,
> > we've surprisingly discovered that Hudi actually bundles "spark-avro"
> > dependency *by default*.
> >
> > This is problematic b/c "spark-avro" is tightly coupled with some of the
> > other Spark components making up its core distribution (ie being packaged
> > in Spark itself, not an external packages, one example of that is
> > "spark-sql")
> >
> > In regards to HUDI-3549
> > <https://issues.apache.org/jira/browse/HUDI-3549> itself,
> > the problem in there unfolded like following:
> >
> >    1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1
> bundled
> >    along with it
> >    2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
> >    3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
> >    3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
> >    and renaming Internal API methods DataSourceUtils)
> >
> >
> > To avoid this problems going forward, our proposal is to
> >
> >    1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
> >    this means that Hudi users would need to now specify spark-avro via
> >    `--packages` flag, since it's not part of Spark's core distribution)
> >    2. (Optional) If community still sees value in bundling (and shading)
> >    "spark-avro" in some cases, we can add Maven profile that would allow
> > to do
> >    that *ad hoc*.
> >
> > We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with the
> > proposed changes.
> >
> > Looking forward to your feedback.
> >
>

Re: Unbundling "spark-avro" dependency

Posted by Vinoth Chandar <vi...@apache.org>.
Thanks Alexey.

This was actually the case for a while now, I think. From what I can see,
our quickstart for spark still suggests passing spark-avro in via
--packages, but utilities bundle related examples are relying on the fact
that this is pre-bundled.

I do acknowledge that with recent Spark 3.x versions, breakages have become
much more frequent, amplifying this pain. However, to prevent jobs from
failing upon upgrade (i.e forcing everyone to redeploy streaming + batch
job with the --packages flag), I would prefer if we actually kept the same
bundling behavior with the following simplifications.

1. We have three spark profiles now - spark2, spark3.1.x, and spark3
(3.2.1). We continue to bundle spark-avro and support the latest spark
minor version
2. We retain and make the docs clearer about how users can "optionally"
unbundle and deploy for other versions.

Given other large features going out, turned on by default this release,
not sure if its a good idea to introduce a breaking change like this.

Thanks
Vinoth

On Tue, Mar 8, 2022 at 1:32 PM Alexey Kudinkin <al...@onehouse.ai> wrote:

> Hello, everyone!
>
> While working on HUDI-3549 <
> https://issues.apache.org/jira/browse/HUDI-3549>,
> we've surprisingly discovered that Hudi actually bundles "spark-avro"
> dependency *by default*.
>
> This is problematic b/c "spark-avro" is tightly coupled with some of the
> other Spark components making up its core distribution (ie being packaged
> in Spark itself, not an external packages, one example of that is
> "spark-sql")
>
> In regards to HUDI-3549
> <https://issues.apache.org/jira/browse/HUDI-3549> itself,
> the problem in there unfolded like following:
>
>    1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1 bundled
>    along with it
>    2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
>    3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
>    3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
>    and renaming Internal API methods DataSourceUtils)
>
>
> To avoid this problems going forward, our proposal is to
>
>    1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
>    this means that Hudi users would need to now specify spark-avro via
>    `--packages` flag, since it's not part of Spark's core distribution)
>    2. (Optional) If community still sees value in bundling (and shading)
>    "spark-avro" in some cases, we can add Maven profile that would allow
> to do
>    that *ad hoc*.
>
> We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with the
> proposed changes.
>
> Looking forward to your feedback.
>