You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Amit Sela <am...@gmail.com> on 2016/07/25 15:09:44 UTC

[Discuss] Beam SDK (Java) providing a shaded jar as a dependency

I wanted to raise the issue of the SDK providing a shaded jar for
dependency.

AFAIK it's generally considered bad practice, though eventually it's
trade-offs. I can definitely understand why here
<https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196>
we
shade Guava for example - so the user can use it's desired Guava version -
but on the other hand, runners eventually have to serialize/deserialize the
SDK's classes sometimes.
For example: Using "Top.largest(10)" to get top 10 results, uses the SDK's
BoundedHeap, which is backed by a ReverseList which is currently not
supported by Kryo (I've submitted a PR), but even if you write-up a
ReverseListSerializer, in order to register the class you have to
explicitly state it's *repackaged* name... (I know you can use Coders to
shuffle bytes around and so avoid Kryo serializing classes)

I'm not saying that shading is completely wrong in some cases, but I would
like to know more about the considerations made, and let's not forget that
some runners (Spark for example) shade also... How risky is it for Beam to
provide such shaded artifacts ? What/How should we inform our users about
it ? i.e., Elastic published this
<https://www.elastic.co/blog/to-shade-or-not-to-shade> about their choice
for (not) shading.

Thanks,
Amit

Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Posted by Amit Sela <am...@gmail.com>.

@Dan: BounedHeap.extractOutput() returns Lists.reverse() if not empty..
Which returns an instance of a ReverseList (true, declared as List).
Falling back to the CollectionSerializer doesn't work but that's
irrelevant. I've actually checked with kryo-serializers author and he
confirmed. And there are actually a few Guava-immutable-collection-specific
serializers.... While Kryo is using reflection, which is indeed sometimes
harmful, it is used by Spark and AFAIK by Flink and Gearpump...

But this Kryo issue was just an example, and I agreed that Guava is indeed
a good reason to shade wherever possible.

I am wondering though, is there a use case for the SDK to ship by itself ?
no runner ? not even direct ?
What's the down side for the SDK not to shade ? and let the runners do
that. Unless we want to enable the SDK being provided - and then it makes
sense to shade Guava and avoid the mentioned Hadoop Guava pain.

Thanks for the thorough responses, that really helps (me at least),
Amit

On Mon, Jul 25, 2016 at 10:09 PM Dan Halperin <dh...@google.com.invalid>
wrote:

> A few reactions:
>
> * Keeping tight control over your module's API surface is *critical* for
> your users. For example, Hadoop added a public dependency on an old version
> of Guava many years ago and it has really hurt the community ever since.
> Google "hadoop guava pain" is pretty instructive, and this issue is now 3
> years old without resolution:
> https://issues.apache.org/jira/browse/HADOOP-10101
>
> * Seems like Elastic wants to put all the hard work on their customer
> rather than bother thinking about their API service. Assuming that the
> user's POM is so simple that it probably only depends on your package (as
> in their blog post) does your users a real disservice. For Beam, we should
> set the bar very high.
>
> * Kryo or other serialization libraries needing to be aware of full Guava
> path names -- this seems like a bug in the assumptions being made by Kryo
> or the user of Kryo. This is well solved by using the public API surface.
>      -> Nothing in BoundedHeap is actually declared as a ReverseList ; so
> Kryo should be able to just use the List serializer, shouldn't it?
>      -> This is what ListCoder does -- we don't have an ArrayListCoder and
> an ImmutableListCoder and a LinkedListCoder.
>   If the user actually depended on a more concrete type than List, they
> should include it in the declared types. Reflection is often harmful.
>
> * Multiple artifacts -- strongly opposed. I think it would be very
> confusing for our users if we had shaded and unshaded versions of Beam on
> maven central.
>
> Dan
>
>
> On Mon, Jul 25, 2016 at 10:58 AM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > The way I see this issue is that shading is a tool that we are using to
> > imperfectly implement two kinds of dependencies:
> >
> > 1. Private dependencies, which are implementation details.
> > 2. Public dependencies, which are observable through the API surface.
> >
> > The SDK and user are required to share the same instance (not just
> version)
> > of a public dependency (like joda-time). The SDK and user should not be
> > able to know if they are using the same instance of a private dependency
> -
> > not only should different versions be a non-issue, but even global data
> > should not be shared. As Luke said, reflection can push things from
> private
> > to public very easily.
> >
> > That is a bit of an idealized view. If the language & tool stack
> understood
> > these concepts then shading would be a non-issue, debuggers would be
> happy,
> > etc.
> >
> > With maven-shade-plugin, #1 is implemented via repackaging & bundling,
> > while #2 is implemented by doing nothing. I don't find the ES rationale
> > applicable here, but I'll be the first to admit that implementing these
> via
> > shading has many drawbacks: redundant configuration that can be
> > error-prone, breaks for split packages, corner cases around reflection,
> > large jars, unit tests run pre-shading, ...
> >
> > So if I reframe what you are saying, I hear a proposal to go
> > public-by-default and to only treat things as private dependencies if we
> > really see a problem.
> >
> > I tend to favor private-by-default, to keep a tight grip on what our API
> > surface really is, but I also have no problem just giving up when a
> library
> > is not "shadeable". I could also be convinced that there is such an
> > impedance mismatch between the goal and the tooling that
> public-by-default
> > is a more practical route.
> >
> > Kenn
> >
> > On Mon, Jul 25, 2016 at 9:04 AM, Amit Sela <am...@gmail.com> wrote:
> >
> > > I agree with you about the advantages of shading, but I also believe
> that
> > > shading needs to be used only where necessary.
> > >
> > > Guava is a good example for shading, since it's popular and ends-up in
> > your
> > > classpath somehow.. more than once.. unless you shade.
> > > I was wondering what's "com.google.thirdparty" and why it's shaded.
> > >
> > > From my experience, it's best to "push" the shading downstream as much
> as
> > > possible, meaning that if a runner provides a shaded artifact, it will
> do
> > > so without shaded dependencies (or as little as possible).
> > > One idea could be to "tie" the versions of dependencies between runners
> > and
> > > the SDK - the runner ends up shading according to the engine it runs
> on.
> > >
> > > Having shaded and non-shaded artifacts could work as well, though this
> is
> > > risky in the sense that the user might end-up having runtime collisions
> > if
> > > the SDK/runner/engine are not aligned. Though this is something that
> > could
> > > always eventually happen...
> > >
> > > On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <lc...@google.com.invalid>
> > > wrote:
> > >
> > > > I believe that shading is a net win because many larger projects have
> > > > hundreds of transitive dependencies and making sure that you can use
> a
> > > > complex library like Beam with another complex library like Spark or
> > > Hadoop
> > > > quickly becomes untenable without something like shading due to
> version
> > > > compatibility issues. I also believe that shading does simplify the
> > > getting
> > > > started experience for many users since we would only need to expose
> > the
> > > > dependencies which cross our API boundaries.
> > > >
> > > > It does come at the cost of dealing with libraries that don't honor
> API
> > > > boundaries (e.g. reflection, serialization, code generation
> libraries)
> > > and
> > > > finding either effective workarounds or increasing the API surface of
> > > what
> > > > is not shaded. Which is all extra work for Beam maintainers.
> > > >
> > > > Its not impossible to have a large project work with another large
> > > project
> > > > but it is also equally difficult since we give up a lot of version
> > > > compatibility freedom.
> > > >
> > > > This does not mean we can't have two artifacts, one shaded and one
> not
> > > but
> > > > if we were to have both, would this hurt portability between runners?
> > > > If you have experience maintaining another project with or without
> > > shading
> > > > like tech, would love to hear it as well.
> > > >
> > > > On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <am...@gmail.com>
> > > wrote:
> > > >
> > > > > I wanted to raise the issue of the SDK providing a shaded jar for
> > > > > dependency.
> > > > >
> > > > > AFAIK it's generally considered bad practice, though eventually
> it's
> > > > > trade-offs. I can definitely understand why here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> > > > > >
> > > > > we
> > > > > shade Guava for example - so the user can use it's desired Guava
> > > version
> > > > -
> > > > > but on the other hand, runners eventually have to
> > serialize/deserialize
> > > > the
> > > > > SDK's classes sometimes.
> > > > > For example: Using "Top.largest(10)" to get top 10 results, uses
> the
> > > > SDK's
> > > > > BoundedHeap, which is backed by a ReverseList which is currently
> not
> > > > > supported by Kryo (I've submitted a PR), but even if you write-up a
> > > > > ReverseListSerializer, in order to register the class you have to
> > > > > explicitly state it's *repackaged* name... (I know you can use
> Coders
> > > to
> > > > > shuffle bytes around and so avoid Kryo serializing classes)
> > > > >
> > > > > I'm not saying that shading is completely wrong in some cases, but
> I
> > > > would
> > > > > like to know more about the considerations made, and let's not
> forget
> > > > that
> > > > > some runners (Spark for example) shade also... How risky is it for
> > Beam
> > > > to
> > > > > provide such shaded artifacts ? What/How should we inform our users
> > > about
> > > > > it ? i.e., Elastic published this
> > > > > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their
> > > > choice
> > > > > for (not) shading.
> > > > >
> > > > > Thanks,
> > > > > Amit
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Posted by Dan Halperin <dh...@google.com.INVALID>.

A few reactions:

* Keeping tight control over your module's API surface is *critical* for
your users. For example, Hadoop added a public dependency on an old version
of Guava many years ago and it has really hurt the community ever since.
Google "hadoop guava pain" is pretty instructive, and this issue is now 3
years old without resolution:
https://issues.apache.org/jira/browse/HADOOP-10101

* Seems like Elastic wants to put all the hard work on their customer
rather than bother thinking about their API service. Assuming that the
user's POM is so simple that it probably only depends on your package (as
in their blog post) does your users a real disservice. For Beam, we should
set the bar very high.

* Kryo or other serialization libraries needing to be aware of full Guava
path names -- this seems like a bug in the assumptions being made by Kryo
or the user of Kryo. This is well solved by using the public API surface.
     -> Nothing in BoundedHeap is actually declared as a ReverseList ; so
Kryo should be able to just use the List serializer, shouldn't it?
     -> This is what ListCoder does -- we don't have an ArrayListCoder and
an ImmutableListCoder and a LinkedListCoder.
  If the user actually depended on a more concrete type than List, they
should include it in the declared types. Reflection is often harmful.

* Multiple artifacts -- strongly opposed. I think it would be very
confusing for our users if we had shaded and unshaded versions of Beam on
maven central.

Dan


On Mon, Jul 25, 2016 at 10:58 AM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> The way I see this issue is that shading is a tool that we are using to
> imperfectly implement two kinds of dependencies:
>
> 1. Private dependencies, which are implementation details.
> 2. Public dependencies, which are observable through the API surface.
>
> The SDK and user are required to share the same instance (not just version)
> of a public dependency (like joda-time). The SDK and user should not be
> able to know if they are using the same instance of a private dependency -
> not only should different versions be a non-issue, but even global data
> should not be shared. As Luke said, reflection can push things from private
> to public very easily.
>
> That is a bit of an idealized view. If the language & tool stack understood
> these concepts then shading would be a non-issue, debuggers would be happy,
> etc.
>
> With maven-shade-plugin, #1 is implemented via repackaging & bundling,
> while #2 is implemented by doing nothing. I don't find the ES rationale
> applicable here, but I'll be the first to admit that implementing these via
> shading has many drawbacks: redundant configuration that can be
> error-prone, breaks for split packages, corner cases around reflection,
> large jars, unit tests run pre-shading, ...
>
> So if I reframe what you are saying, I hear a proposal to go
> public-by-default and to only treat things as private dependencies if we
> really see a problem.
>
> I tend to favor private-by-default, to keep a tight grip on what our API
> surface really is, but I also have no problem just giving up when a library
> is not "shadeable". I could also be convinced that there is such an
> impedance mismatch between the goal and the tooling that public-by-default
> is a more practical route.
>
> Kenn
>
> On Mon, Jul 25, 2016 at 9:04 AM, Amit Sela <am...@gmail.com> wrote:
>
> > I agree with you about the advantages of shading, but I also believe that
> > shading needs to be used only where necessary.
> >
> > Guava is a good example for shading, since it's popular and ends-up in
> your
> > classpath somehow.. more than once.. unless you shade.
> > I was wondering what's "com.google.thirdparty" and why it's shaded.
> >
> > From my experience, it's best to "push" the shading downstream as much as
> > possible, meaning that if a runner provides a shaded artifact, it will do
> > so without shaded dependencies (or as little as possible).
> > One idea could be to "tie" the versions of dependencies between runners
> and
> > the SDK - the runner ends up shading according to the engine it runs on.
> >
> > Having shaded and non-shaded artifacts could work as well, though this is
> > risky in the sense that the user might end-up having runtime collisions
> if
> > the SDK/runner/engine are not aligned. Though this is something that
> could
> > always eventually happen...
> >
> > On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> > > I believe that shading is a net win because many larger projects have
> > > hundreds of transitive dependencies and making sure that you can use a
> > > complex library like Beam with another complex library like Spark or
> > Hadoop
> > > quickly becomes untenable without something like shading due to version
> > > compatibility issues. I also believe that shading does simplify the
> > getting
> > > started experience for many users since we would only need to expose
> the
> > > dependencies which cross our API boundaries.
> > >
> > > It does come at the cost of dealing with libraries that don't honor API
> > > boundaries (e.g. reflection, serialization, code generation libraries)
> > and
> > > finding either effective workarounds or increasing the API surface of
> > what
> > > is not shaded. Which is all extra work for Beam maintainers.
> > >
> > > Its not impossible to have a large project work with another large
> > project
> > > but it is also equally difficult since we give up a lot of version
> > > compatibility freedom.
> > >
> > > This does not mean we can't have two artifacts, one shaded and one not
> > but
> > > if we were to have both, would this hurt portability between runners?
> > > If you have experience maintaining another project with or without
> > shading
> > > like tech, would love to hear it as well.
> > >
> > > On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <am...@gmail.com>
> > wrote:
> > >
> > > > I wanted to raise the issue of the SDK providing a shaded jar for
> > > > dependency.
> > > >
> > > > AFAIK it's generally considered bad practice, though eventually it's
> > > > trade-offs. I can definitely understand why here
> > > > <
> > > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> > > > >
> > > > we
> > > > shade Guava for example - so the user can use it's desired Guava
> > version
> > > -
> > > > but on the other hand, runners eventually have to
> serialize/deserialize
> > > the
> > > > SDK's classes sometimes.
> > > > For example: Using "Top.largest(10)" to get top 10 results, uses the
> > > SDK's
> > > > BoundedHeap, which is backed by a ReverseList which is currently not
> > > > supported by Kryo (I've submitted a PR), but even if you write-up a
> > > > ReverseListSerializer, in order to register the class you have to
> > > > explicitly state it's *repackaged* name... (I know you can use Coders
> > to
> > > > shuffle bytes around and so avoid Kryo serializing classes)
> > > >
> > > > I'm not saying that shading is completely wrong in some cases, but I
> > > would
> > > > like to know more about the considerations made, and let's not forget
> > > that
> > > > some runners (Spark for example) shade also... How risky is it for
> Beam
> > > to
> > > > provide such shaded artifacts ? What/How should we inform our users
> > about
> > > > it ? i.e., Elastic published this
> > > > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their
> > > choice
> > > > for (not) shading.
> > > >
> > > > Thanks,
> > > > Amit
> > > >
> > >
> >
>

Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Posted by Kenneth Knowles <kl...@google.com.INVALID>.

The way I see this issue is that shading is a tool that we are using to
imperfectly implement two kinds of dependencies:

1. Private dependencies, which are implementation details.
2. Public dependencies, which are observable through the API surface.

The SDK and user are required to share the same instance (not just version)
of a public dependency (like joda-time). The SDK and user should not be
able to know if they are using the same instance of a private dependency -
not only should different versions be a non-issue, but even global data
should not be shared. As Luke said, reflection can push things from private
to public very easily.

That is a bit of an idealized view. If the language & tool stack understood
these concepts then shading would be a non-issue, debuggers would be happy,
etc.

With maven-shade-plugin, #1 is implemented via repackaging & bundling,
while #2 is implemented by doing nothing. I don't find the ES rationale
applicable here, but I'll be the first to admit that implementing these via
shading has many drawbacks: redundant configuration that can be
error-prone, breaks for split packages, corner cases around reflection,
large jars, unit tests run pre-shading, ...

So if I reframe what you are saying, I hear a proposal to go
public-by-default and to only treat things as private dependencies if we
really see a problem.

I tend to favor private-by-default, to keep a tight grip on what our API
surface really is, but I also have no problem just giving up when a library
is not "shadeable". I could also be convinced that there is such an
impedance mismatch between the goal and the tooling that public-by-default
is a more practical route.

Kenn

On Mon, Jul 25, 2016 at 9:04 AM, Amit Sela <am...@gmail.com> wrote:

> I agree with you about the advantages of shading, but I also believe that
> shading needs to be used only where necessary.
>
> Guava is a good example for shading, since it's popular and ends-up in your
> classpath somehow.. more than once.. unless you shade.
> I was wondering what's "com.google.thirdparty" and why it's shaded.
>
> From my experience, it's best to "push" the shading downstream as much as
> possible, meaning that if a runner provides a shaded artifact, it will do
> so without shaded dependencies (or as little as possible).
> One idea could be to "tie" the versions of dependencies between runners and
> the SDK - the runner ends up shading according to the engine it runs on.
>
> Having shaded and non-shaded artifacts could work as well, though this is
> risky in the sense that the user might end-up having runtime collisions if
> the SDK/runner/engine are not aligned. Though this is something that could
> always eventually happen...
>
> On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > I believe that shading is a net win because many larger projects have
> > hundreds of transitive dependencies and making sure that you can use a
> > complex library like Beam with another complex library like Spark or
> Hadoop
> > quickly becomes untenable without something like shading due to version
> > compatibility issues. I also believe that shading does simplify the
> getting
> > started experience for many users since we would only need to expose the
> > dependencies which cross our API boundaries.
> >
> > It does come at the cost of dealing with libraries that don't honor API
> > boundaries (e.g. reflection, serialization, code generation libraries)
> and
> > finding either effective workarounds or increasing the API surface of
> what
> > is not shaded. Which is all extra work for Beam maintainers.
> >
> > Its not impossible to have a large project work with another large
> project
> > but it is also equally difficult since we give up a lot of version
> > compatibility freedom.
> >
> > This does not mean we can't have two artifacts, one shaded and one not
> but
> > if we were to have both, would this hurt portability between runners?
> > If you have experience maintaining another project with or without
> shading
> > like tech, would love to hear it as well.
> >
> > On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <am...@gmail.com>
> wrote:
> >
> > > I wanted to raise the issue of the SDK providing a shaded jar for
> > > dependency.
> > >
> > > AFAIK it's generally considered bad practice, though eventually it's
> > > trade-offs. I can definitely understand why here
> > > <
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> > > >
> > > we
> > > shade Guava for example - so the user can use it's desired Guava
> version
> > -
> > > but on the other hand, runners eventually have to serialize/deserialize
> > the
> > > SDK's classes sometimes.
> > > For example: Using "Top.largest(10)" to get top 10 results, uses the
> > SDK's
> > > BoundedHeap, which is backed by a ReverseList which is currently not
> > > supported by Kryo (I've submitted a PR), but even if you write-up a
> > > ReverseListSerializer, in order to register the class you have to
> > > explicitly state it's *repackaged* name... (I know you can use Coders
> to
> > > shuffle bytes around and so avoid Kryo serializing classes)
> > >
> > > I'm not saying that shading is completely wrong in some cases, but I
> > would
> > > like to know more about the considerations made, and let's not forget
> > that
> > > some runners (Spark for example) shade also... How risky is it for Beam
> > to
> > > provide such shaded artifacts ? What/How should we inform our users
> about
> > > it ? i.e., Elastic published this
> > > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their
> > choice
> > > for (not) shading.
> > >
> > > Thanks,
> > > Amit
> > >
> >
>

Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Posted by Amit Sela <am...@gmail.com>.

I agree with you about the advantages of shading, but I also believe that
shading needs to be used only where necessary.

Guava is a good example for shading, since it's popular and ends-up in your
classpath somehow.. more than once.. unless you shade.
I was wondering what's "com.google.thirdparty" and why it's shaded.

From my experience, it's best to "push" the shading downstream as much as
possible, meaning that if a runner provides a shaded artifact, it will do
so without shaded dependencies (or as little as possible).
One idea could be to "tie" the versions of dependencies between runners and
the SDK - the runner ends up shading according to the engine it runs on.

Having shaded and non-shaded artifacts could work as well, though this is
risky in the sense that the user might end-up having runtime collisions if
the SDK/runner/engine are not aligned. Though this is something that could
always eventually happen...

On Mon, Jul 25, 2016 at 6:36 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> I believe that shading is a net win because many larger projects have
> hundreds of transitive dependencies and making sure that you can use a
> complex library like Beam with another complex library like Spark or Hadoop
> quickly becomes untenable without something like shading due to version
> compatibility issues. I also believe that shading does simplify the getting
> started experience for many users since we would only need to expose the
> dependencies which cross our API boundaries.
>
> It does come at the cost of dealing with libraries that don't honor API
> boundaries (e.g. reflection, serialization, code generation libraries) and
> finding either effective workarounds or increasing the API surface of what
> is not shaded. Which is all extra work for Beam maintainers.
>
> Its not impossible to have a large project work with another large project
> but it is also equally difficult since we give up a lot of version
> compatibility freedom.
>
> This does not mean we can't have two artifacts, one shaded and one not but
> if we were to have both, would this hurt portability between runners?
> If you have experience maintaining another project with or without shading
> like tech, would love to hear it as well.
>
> On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <am...@gmail.com> wrote:
>
> > I wanted to raise the issue of the SDK providing a shaded jar for
> > dependency.
> >
> > AFAIK it's generally considered bad practice, though eventually it's
> > trade-offs. I can definitely understand why here
> > <
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> > >
> > we
> > shade Guava for example - so the user can use it's desired Guava version
> -
> > but on the other hand, runners eventually have to serialize/deserialize
> the
> > SDK's classes sometimes.
> > For example: Using "Top.largest(10)" to get top 10 results, uses the
> SDK's
> > BoundedHeap, which is backed by a ReverseList which is currently not
> > supported by Kryo (I've submitted a PR), but even if you write-up a
> > ReverseListSerializer, in order to register the class you have to
> > explicitly state it's *repackaged* name... (I know you can use Coders to
> > shuffle bytes around and so avoid Kryo serializing classes)
> >
> > I'm not saying that shading is completely wrong in some cases, but I
> would
> > like to know more about the considerations made, and let's not forget
> that
> > some runners (Spark for example) shade also... How risky is it for Beam
> to
> > provide such shaded artifacts ? What/How should we inform our users about
> > it ? i.e., Elastic published this
> > <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their
> choice
> > for (not) shading.
> >
> > Thanks,
> > Amit
> >
>

Re: [Discuss] Beam SDK (Java) providing a shaded jar as a dependency

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

I believe that shading is a net win because many larger projects have
hundreds of transitive dependencies and making sure that you can use a
complex library like Beam with another complex library like Spark or Hadoop
quickly becomes untenable without something like shading due to version
compatibility issues. I also believe that shading does simplify the getting
started experience for many users since we would only need to expose the
dependencies which cross our API boundaries.

It does come at the cost of dealing with libraries that don't honor API
boundaries (e.g. reflection, serialization, code generation libraries) and
finding either effective workarounds or increasing the API surface of what
is not shaded. Which is all extra work for Beam maintainers.

Its not impossible to have a large project work with another large project
but it is also equally difficult since we give up a lot of version
compatibility freedom.

This does not mean we can't have two artifacts, one shaded and one not but
if we were to have both, would this hurt portability between runners?
If you have experience maintaining another project with or without shading
like tech, would love to hear it as well.

On Mon, Jul 25, 2016 at 11:09 AM, Amit Sela <am...@gmail.com> wrote:

> I wanted to raise the issue of the SDK providing a shaded jar for
> dependency.
>
> AFAIK it's generally considered bad practice, though eventually it's
> trade-offs. I can definitely understand why here
> <
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/pom.xml#L196
> >
> we
> shade Guava for example - so the user can use it's desired Guava version -
> but on the other hand, runners eventually have to serialize/deserialize the
> SDK's classes sometimes.
> For example: Using "Top.largest(10)" to get top 10 results, uses the SDK's
> BoundedHeap, which is backed by a ReverseList which is currently not
> supported by Kryo (I've submitted a PR), but even if you write-up a
> ReverseListSerializer, in order to register the class you have to
> explicitly state it's *repackaged* name... (I know you can use Coders to
> shuffle bytes around and so avoid Kryo serializing classes)
>
> I'm not saying that shading is completely wrong in some cases, but I would
> like to know more about the considerations made, and let's not forget that
> some runners (Spark for example) shade also... How risky is it for Beam to
> provide such shaded artifacts ? What/How should we inform our users about
> it ? i.e., Elastic published this
> <https://www.elastic.co/blog/to-shade-or-not-to-shade> about their choice
> for (not) shading.
>
> Thanks,
> Amit
>