You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Ismaël Mejía <ie...@gmail.com> on 2017/12/11 16:35:56 UTC

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Hello, I wanted to bring back this subject because I think we should
take action on this and at least first have a shaded version of guava.
I was playing with a toy project and I did the procedure we use to
submit jars to a Hadoop cluster via Flink/Spark which involves
creating an uber jar and I realized that the size of the jar was way
bigger than I expected, and the fact that we shade guava in every
module contributes to this. I found guava shaded on:

sdks/java/core
runners/core-construction-java
runners/core-java
model/job-management
runners/spark
sdks/java/io/hadoop-file-system
sdks/java/io/kafka

This means at least 6 times more of the size it should which counts in
around 15MB more (2.4MB*6 deps) of extra weight that we can simply
reduce by using a shaded version of the library.

I add this point to the previous ones mentioned during the discussion
because this goes against the end user experience and affects us all
(devs/users).

Another question is if we should shade (and how) protocol buffers
because now with the portability work we are exposing it closer to the
end users. I say this because I also found an issue while running a
job on YARN with the spark runner because hadoop-common includes
protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
a dependency to be able to use triggers (note the Spark runner
translates them using some method from runners/core-java). Since
hadoop-common is provided in the cluster with the older version of
protobuf, I am afraid that this will bite us in the future.

Ismaël

ps. There is already a JIRA for that shading for protobuf on
hadoop-common but this is not coming until version 3 is out.
https://issues.apache.org/jira/browse/HADOOP-13136

ps2. Extra curious situation is to see that the dataflow-runner ends
up having guava shaded twice via its shaded version on
core-construction-java.

ps3. Of course this message means a de-facto +1 at least to do it for
guava and evaluate it for other libs.


On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid> wrote:
> An issue to call out is how to deal with our generated code (.avro and
> .proto) as I don't believe those plugins allow you to generate code using a
> shaded package prefix on imports.
>
> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tg...@google.com.invalid>
> wrote:
>
>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>> every time for dependencies we know we'll use.
>>
>> This also means that if we end up pulling in transitive dependencies we
>> don't want in any particular module we can avoid having to adjust our
>> repackaging strategy for that module - which I have run into face-first in
>> the past.
>>
>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <kl...@google.com.invalid>
>> wrote:
>>
>> > Hi all,
>> >
>> > Shading is a big part of how we keep our dependencies sane in Beam. But
>> > downsides: shading is super slow, causes massive jar bloat, and kind of
>> > hard to get right because artifacts and namespaces are not 1-to-1.
>> >
>> > I know that some communities distribute their own shaded distributions of
>> > dependencies. I had a thought about doing something similar that I wanted
>> > to throw out there for people to poke holes in.
>> >
>> > To set the scene, here is how I view shading:
>> >
>> >  - A module has public dependencies and private dependencies.
>> >  - Public deps are used for data interchange; users must share these
>> deps.
>> >  - Private deps are just functionality and can be hidden (in our case,
>> > relocated + bundled)
>> >  - It isn't necessarily that simple, because public and private deps
>> might
>> > interact in higher-order ways ("public" is contagious)
>> >
>> > Shading is an implementation detail of expressing these characteristics.
>> We
>> > use shading selectively because of its downsides I mentioned above.
>> >
>> > But what about this idea: Introduce shaded deps as a single separate
>> > artifact.
>> >
>> >  - sdks/java/private-deps: bundled uber jar with relocated versions of
>> > everything we want to shade
>> >
>> >  - sdks/java/core and sdks/java/harness: no relocation or bundling -
>> > depends on `beam-sdks-java-private-deps` and imports like
>> > `org.apache.beam.sdk.private.com.google.common` directly (this is what
>> > they
>> > are rewritten to
>> >
>> > Some benefits
>> >
>> >  - much faster builds of other modules
>> >  - only one shaded uber jar
>> >  - rare/no rebuilds of the uber jar
>> >  - can use maven enforcer to forbid imports like com.google.common
>> >  - configuration all in one place
>> >  - no automated rewriting of our real code, which has led to some major
>> > confusion
>> >  - easy to implement incrementally
>> >
>> > Downsides:
>> >
>> >  - plenty of effort work to get there
>> >  - unclear how many different such deps modules we need; sharing them
>> could
>> > get weird
>> >  - if we hit a roadblock, we will have committed a lot of time
>> >
>> > Just something I was musing as I spent another evening waiting for slow
>> > builds to try to confirm changes to brittle poms.
>> >
>> > Kenn
>> >
>>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Romain Manni-Bucau <rm...@gmail.com>.
well we can disagree on the code - it is fine ;), but the needed part of it
by beam is not huge and in any case it can be forked without requiring 10
classes - if so we'll use another impl than the guava one ;). This is the
whole point.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-02 17:24 GMT+01:00 Reuven Lax <re...@google.com>:

> TypeToken is not trivial. I've written code to do what TypeToken does
> before (figuring out generic ancestor types). It's actually somewhat
> tricky, and the code I wrote had subtle bugs in it; eventually we removed
> this code in favor of Guava's implementation :)
>
> On Fri, Feb 2, 2018 at 7:47 AM, Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Yep, note I never said to reinvent the wheel, we can copy it from guava,
>> openwebbeans or any other impl. Point was more to avoid to depend on
>> something we don't own for that which is after all not that much code. I
>> also think we can limit it a lot to align it on what is supported by beam
>> (I'm thinking to coders here) but this can be another topic.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-02 16:33 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>
>>> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Don't forget beam doesn't support much behind it (mainly only a few
>>>> ParameterizedType due the usage code path) so it is mainly only about
>>>> handling parameterized types and typevariables recursively. Not a lot of
>>>> work. Main concern being it is in the API so using a shade as an API is
>>>> never a good idea, in particular cause the shade can be broken and requires
>>>> to setup clirr or things like that and when it breaks you are done and need
>>>> to fork it anyway. Limiting the dependencies for an API - as beam is - is
>>>> always saner even if it requires a small fork of code.
>>>>
>>>
>>> The main thing that TypeToken is used for is capturing generics that are
>>> lost by Java reflection. It is a bit tricky, actually.
>>>
>>> Kenn
>>>
>>>
>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>>>
>>>>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>>>>>
>>>>>>> Another couple:
>>>>>>>
>>>>>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>>>>>>> TypeToken
>>>>>>>
>>>>>>
>>>>>> Technically reflect Type is enough
>>>>>>
>>>>>
>>>>> If you want to try to remove TypeToken from underneath TypeDescriptor,
>>>>> I have no objections as long as you expand the test suite to cover all the
>>>>> functionality where we trust TypeToken's tests. Good luck :-)
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>  - ImmutableList and friends and their builders are very widely used
>>>>>>> and IMO still add a lot for readability and preventing someone coming along
>>>>>>> and adding mistakes to a codebase
>>>>>>>
>>>>>>
>>>>>> Sugar but not required. When you compare the cost of a shade versus
>>>>>> of duplicating the parts we need there is no real match IMHO.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> So considering it all, I would keep a vendored Guava (but also move
>>>>>>> off where we can, and also have our own improvements). I hope it will be a
>>>>>>> near-empty build file to generate it so not a maintenance burden.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Nice. This sounds like a great idea (or two?) and goes along with
>>>>>>>> what I just started for futures.
>>>>>>>>
>>>>>>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>>>>>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>>>>>>
>>>>>>>> Specific things for our micro guava:
>>>>>>>>
>>>>>>>>  - checkArgumentNotNull can throw the right exception
>>>>>>>>  - our own Optional because Java's is not Serializable
>>>>>>>>  - futures combinators since many are missing, especially Java's
>>>>>>>> don't do exceptions right
>>>>>>>>
>>>>>>>> Protobuf: didn't file an issue because I'm not sure
>>>>>>>>
>>>>>>>> I was wondering if pre-shading works. We really need to get rid of
>>>>>>>> it from public APIs in a 100% reliable way. It is also a problem for
>>>>>>>> Dataflow. I was wondering if one approach is to pre-shade
>>>>>>>> gcpio-protobuf-java, gcpio-protobuf-java-util, gcpio-grpc-java, etc.
>>>>>>>> Anything that needs to take a Message object. (and do the same for
>>>>>>>> beam-model-protobuf-java since the model bits have to depend on each other
>>>>>>>> but nothing else).
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Huge +1 to get rid of Guava!
>>>>>>>>>
>>>>>>>>> This solves annoying dependency issues for some IOs and allow us to
>>>>>>>>> get rid of the shading that makes current jars bigger than they
>>>>>>>>> should
>>>>>>>>> be.
>>>>>>>>>
>>>>>>>>> We can create our own 'micro guava' package with some classes for
>>>>>>>>> things that are hard to migrate, or that we  prefer to still have
>>>>>>>>> like
>>>>>>>>> the check* methods for example. Given the size of the task we
>>>>>>>>> should
>>>>>>>>> probably divide it into subtasks, more important is to get rid of
>>>>>>>>> it
>>>>>>>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>>>>>>>
>>>>>>>>> Other important idea would be to get rid of Protobuf in public APIs
>>>>>>>>> like GCPIO and to better shade it from leaking into the runners. An
>>>>>>>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>>>>>>>> that is byting us for the Spark runner, but well that's worth a
>>>>>>>>> different discussion.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>>>>>>>> <rm...@gmail.com> wrote:
>>>>>>>>> > a map of list is fine and not a challenge we'll face long I hope
>>>>>>>>> ;)
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Romain Manni-Bucau
>>>>>>>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>>>>>>>> >
>>>>>>>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>>>>> >>
>>>>>>>>> >> Not sure we'll be able to replace them all. Things like guava
>>>>>>>>> Table and
>>>>>>>>> >> Multimap don't have great replacements in Java8.
>>>>>>>>> >>
>>>>>>>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>>>>>>>> jb@nanthrax.net>
>>>>>>>>> >> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>>>>>>>> >>>
>>>>>>>>> >>> Regards
>>>>>>>>> >>> JB
>>>>>>>>> >>>
>>>>>>>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>>>>>>>> >>> > Why not dropping guava for all beam codebase? With java 8 it
>>>>>>>>> is quite
>>>>>>>>> >>> > easy to do
>>>>>>>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with
>>>>>>>>> quite a
>>>>>>>>> >>> > good result.
>>>>>>>>> >>> >
>>>>>>>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>>>>>>>> >>> > <ma...@google.com>> a écrit :
>>>>>>>>> >>> >
>>>>>>>>> >>> >     Make sure to include the guava version in the artifact
>>>>>>>>> name so that
>>>>>>>>> >>> > we can
>>>>>>>>> >>> >     have multiple vendored versions.
>>>>>>>>> >>> >
>>>>>>>>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>>>>>>>> klk@google.com
>>>>>>>>> >>> >     <ma...@google.com>> wrote:
>>>>>>>>> >>> >
>>>>>>>>> >>> >         I didn't have time for this, but it just bit me. We
>>>>>>>>> definitely
>>>>>>>>> >>> > have
>>>>>>>>> >>> >         Guava on the API surface of runner support code in
>>>>>>>>> ways that
>>>>>>>>> >>> > get
>>>>>>>>> >>> >         incompatibly shaded. I will probably start "1a" by
>>>>>>>>> making a
>>>>>>>>> >>> > shaded
>>>>>>>>> >>> >         library org.apache.beam:vendored-guava and starting
>>>>>>>>> to use it.
>>>>>>>>> >>> > It sounds
>>>>>>>>> >>> >         like there is generally unanimous support for that
>>>>>>>>> much,
>>>>>>>>> >>> > anyhow.
>>>>>>>>> >>> >
>>>>>>>>> >>> >         Kenn
>>>>>>>>> >>> >
>>>>>>>>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>>>>>>>> >>> > <aljoscha@apache.org
>>>>>>>>> >>> >         <ma...@apache.org>> wrote:
>>>>>>>>> >>> >
>>>>>>>>> >>> >             Thanks Ismaël for bringing up this discussion
>>>>>>>>> again!
>>>>>>>>> >>> >
>>>>>>>>> >>> >             I would be in favour of 1) and more specifically
>>>>>>>>> of 1a)
>>>>>>>>> >>> >
>>>>>>>>> >>> >             Aljoscha
>>>>>>>>> >>> >
>>>>>>>>> >>> >
>>>>>>>>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>>>>>>>>> lcwik@google.com
>>>>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>             You can always run tests on post shaded
>>>>>>>>> artifacts instead
>>>>>>>>> >>> >> of the
>>>>>>>>> >>> >>             compiled classes, it just requires us to change
>>>>>>>>> our maven
>>>>>>>>> >>> >> surefire
>>>>>>>>> >>> >>             / gradle test configurations but it is true
>>>>>>>>> that most IDEs
>>>>>>>>> >>> >> would
>>>>>>>>> >>> >>             behave better with a dependency jar unless you
>>>>>>>>> delegate
>>>>>>>>> >>> >> all the
>>>>>>>>> >>> >>             build/test actions to the build system and then
>>>>>>>>> it won't
>>>>>>>>> >>> >> matter.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>>>>>>>> >>> >> <klk@google.com
>>>>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                 There's also, with additional overhead,
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                 1a) A relocated and shipped package for
>>>>>>>>> each thing we
>>>>>>>>> >>> >> want to
>>>>>>>>> >>> >>                 relocate. I think this has also been tried
>>>>>>>>> outside
>>>>>>>>> >>> >> Beam...
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                 Pros:
>>>>>>>>> >>> >>                 * all the pros of 1) plus no bloat beyond
>>>>>>>>> what is
>>>>>>>>> >>> >> necessary
>>>>>>>>> >>> >>                 Cons:
>>>>>>>>> >>> >>                 * abandons whitelist approach for public
>>>>>>>>> deps,
>>>>>>>>> >>> >> reverting to
>>>>>>>>> >>> >>                 blacklist approach for trouble things like
>>>>>>>>> guava, so a
>>>>>>>>> >>> >> bit
>>>>>>>>> >>> >>                 less principled
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                 For both 1) and 1a) I would add:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                 Pros:
>>>>>>>>> >>> >>                 * clearly readable dependency since code
>>>>>>>>> will `import
>>>>>>>>> >>> >>                 org.apache.beam.private.guava21` and IDEs
>>>>>>>>> will
>>>>>>>>> >>> >> understand it
>>>>>>>>> >>> >>                 is a distinct lilbrary
>>>>>>>>> >>> >>                 * can run tests on unpackaged classes, as
>>>>>>>>> long as deps
>>>>>>>>> >>> >> are
>>>>>>>>> >>> >>                 shaded or provided as jars
>>>>>>>>> >>> >>                 * no mysterious action at a distance from
>>>>>>>>> inherited
>>>>>>>>> >>> >> configuration
>>>>>>>>> >>> >>                 Cons:
>>>>>>>>> >>> >>                 * need to adjust imports
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                 Kenn
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>>>>>>>> >>> >> <lcwik@google.com
>>>>>>>>> >>> >>                 <ma...@google.com>> wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                     I would suggest that either we use:
>>>>>>>>> >>> >>                     1) A common deps package containing
>>>>>>>>> shaded
>>>>>>>>> >>> >> dependencies
>>>>>>>>> >>> >>                     allows for
>>>>>>>>> >>> >>                     Pros
>>>>>>>>> >>> >>                     * doesn't require the user to build an
>>>>>>>>> uber jar
>>>>>>>>> >>> >>                     Risks
>>>>>>>>> >>> >>                     * dependencies package will keep
>>>>>>>>> growing even if
>>>>>>>>> >>> >> something
>>>>>>>>> >>> >>                     is or isn't needed by all of Apache
>>>>>>>>> Beam leading
>>>>>>>>> >>> >> to a
>>>>>>>>> >>> >>                     large jar anyways negating any space
>>>>>>>>> savings
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                     2) Shade within each module to a common
>>>>>>>>> location
>>>>>>>>> >>> >> like
>>>>>>>>> >>> >>                     org.apache.beam.relocated.guava....
>>>>>>>>> >>> >>                     Pros
>>>>>>>>> >>> >>                     * you only get the shaded dependencies
>>>>>>>>> of the
>>>>>>>>> >>> >> things that
>>>>>>>>> >>> >>                     are required
>>>>>>>>> >>> >>                     * its one less dependency for users to
>>>>>>>>> manage
>>>>>>>>> >>> >>                     Risks
>>>>>>>>> >>> >>                     * requires an uber jar to be built to
>>>>>>>>> get the
>>>>>>>>> >>> >> space
>>>>>>>>> >>> >>                     savings (either by a user or a
>>>>>>>>> distribution of
>>>>>>>>> >>> >> Apache
>>>>>>>>> >>> >>                     Beam) otherwise we negate any space
>>>>>>>>> savings.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                     If we either use a common relocation
>>>>>>>>> scheme or a
>>>>>>>>> >>> >>                     dependencies jar then each relocation
>>>>>>>>> should
>>>>>>>>> >>> >> specifically
>>>>>>>>> >>> >>                     contain the version number of the
>>>>>>>>> package because
>>>>>>>>> >>> >> we would
>>>>>>>>> >>> >>                     like to allow for us to be using two
>>>>>>>>> different
>>>>>>>>> >>> >> versions of
>>>>>>>>> >>> >>                     the same library.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                     For the common deps package approach,
>>>>>>>>> should we
>>>>>>>>> >>> >> check in
>>>>>>>>> >>> >>                     code where the imports contain the
>>>>>>>>> relocated
>>>>>>>>> >>> >> location
>>>>>>>>> >>> >>                     (e.g. import
>>>>>>>>> org.apache.beam.guava.20.0.com
>>>>>>>>> >>> >>
>>>>>>>>> >>> >> <http://org.apache.beam.guava.20.0.com/
>>>>>>>>> >.google.common.collect.ImmutableList)?
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM,
>>>>>>>>> Jean-Baptiste
>>>>>>>>> >>> >> Onofré
>>>>>>>>> >>> >>                     <jb@nanthrax.net <mailto:
>>>>>>>>> jb@nanthrax.net>> wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                         Thanks for bringing that back.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                         Indeed guava is shaded in different
>>>>>>>>> uber-jar.
>>>>>>>>> >>> >> Maybe we
>>>>>>>>> >>> >>                         can have a common deps module that
>>>>>>>>> we include
>>>>>>>>> >>> >> once
>>>>>>>>> >>> >>                         (but the user will have to
>>>>>>>>> explicitly define
>>>>>>>>> >>> >> the dep) ?
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                         Basically, what do you propose for
>>>>>>>>> protobuf
>>>>>>>>> >>> >>                         (unfortunately, I don't see an
>>>>>>>>> obvious) ?
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                         Regards
>>>>>>>>> >>> >>                         JB
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël
>>>>>>>>> Mejía wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             Hello, I wanted to bring back
>>>>>>>>> this subject
>>>>>>>>> >>> >> because
>>>>>>>>> >>> >>                             I think we should
>>>>>>>>> >>> >>                             take action on this and at
>>>>>>>>> least first
>>>>>>>>> >>> >> have a
>>>>>>>>> >>> >>                             shaded version of guava.
>>>>>>>>> >>> >>                             I was playing with a toy
>>>>>>>>> project and I did
>>>>>>>>> >>> >> the
>>>>>>>>> >>> >>                             procedure we use to
>>>>>>>>> >>> >>                             submit jars to a Hadoop cluster
>>>>>>>>> via
>>>>>>>>> >>> >> Flink/Spark
>>>>>>>>> >>> >>                             which involves
>>>>>>>>> >>> >>                             creating an uber jar and I
>>>>>>>>> realized that
>>>>>>>>> >>> >> the size
>>>>>>>>> >>> >>                             of the jar was way
>>>>>>>>> >>> >>                             bigger than I expected, and the
>>>>>>>>> fact that
>>>>>>>>> >>> >> we shade
>>>>>>>>> >>> >>                             guava in every
>>>>>>>>> >>> >>                             module contributes to this. I
>>>>>>>>> found guava
>>>>>>>>> >>> >> shaded on:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             sdks/java/core
>>>>>>>>> >>> >>                             runners/core-construction-java
>>>>>>>>> >>> >>                             runners/core-java
>>>>>>>>> >>> >>                             model/job-management
>>>>>>>>> >>> >>                             runners/spark
>>>>>>>>> >>> >>                             sdks/java/io/hadoop-file-system
>>>>>>>>> >>> >>                             sdks/java/io/kafka
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             This means at least 6 times
>>>>>>>>> more of the
>>>>>>>>> >>> >> size it
>>>>>>>>> >>> >>                             should which counts in
>>>>>>>>> >>> >>                             around 15MB more (2.4MB*6 deps)
>>>>>>>>> of extra
>>>>>>>>> >>> >> weight
>>>>>>>>> >>> >>                             that we can simply
>>>>>>>>> >>> >>                             reduce by using a shaded
>>>>>>>>> version of the
>>>>>>>>> >>> >> library.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             I add this point to the
>>>>>>>>> previous ones
>>>>>>>>> >>> >> mentioned
>>>>>>>>> >>> >>                             during the discussion
>>>>>>>>> >>> >>                             because this goes against the
>>>>>>>>> end user
>>>>>>>>> >>> >> experience
>>>>>>>>> >>> >>                             and affects us all
>>>>>>>>> >>> >>                             (devs/users).
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             Another question is if we
>>>>>>>>> should shade
>>>>>>>>> >>> >> (and how)
>>>>>>>>> >>> >>                             protocol buffers
>>>>>>>>> >>> >>                             because now with the
>>>>>>>>> portability work we
>>>>>>>>> >>> >> are
>>>>>>>>> >>> >>                             exposing it closer to the
>>>>>>>>> >>> >>                             end users. I say this because I
>>>>>>>>> also found
>>>>>>>>> >>> >> an
>>>>>>>>> >>> >>                             issue while running a
>>>>>>>>> >>> >>                             job on YARN with the spark
>>>>>>>>> runner because
>>>>>>>>> >>> >>                             hadoop-common includes
>>>>>>>>> >>> >>                             protobuf-java 2 and I had to
>>>>>>>>> explicitly
>>>>>>>>> >>> >> provide
>>>>>>>>> >>> >>                             protocol-buffers 3 as
>>>>>>>>> >>> >>                             a dependency to be able to use
>>>>>>>>> triggers
>>>>>>>>> >>> >> (note the
>>>>>>>>> >>> >>                             Spark runner
>>>>>>>>> >>> >>                             translates them using some
>>>>>>>>> method from
>>>>>>>>> >>> >>                             runners/core-java). Since
>>>>>>>>> >>> >>                             hadoop-common is provided in
>>>>>>>>> the cluster
>>>>>>>>> >>> >> with the
>>>>>>>>> >>> >>                             older version of
>>>>>>>>> >>> >>                             protobuf, I am afraid that this
>>>>>>>>> will bite
>>>>>>>>> >>> >> us in
>>>>>>>>> >>> >>                             the future.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             Ismaël
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             ps. There is already a JIRA for
>>>>>>>>> that
>>>>>>>>> >>> >> shading for
>>>>>>>>> >>> >>                             protobuf on
>>>>>>>>> >>> >>                             hadoop-common but this is not
>>>>>>>>> coming until
>>>>>>>>> >>> >> version
>>>>>>>>> >>> >>                             3 is out.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>>>>> >>> >>
>>>>>>>>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             ps2. Extra curious situation is
>>>>>>>>> to see
>>>>>>>>> >>> >> that the
>>>>>>>>> >>> >>                             dataflow-runner ends
>>>>>>>>> >>> >>                             up having guava shaded twice
>>>>>>>>> via its
>>>>>>>>> >>> >> shaded version on
>>>>>>>>> >>> >>                             core-construction-java.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             ps3. Of course this message
>>>>>>>>> means a
>>>>>>>>> >>> >> de-facto +1 at
>>>>>>>>> >>> >>                             least to do it for
>>>>>>>>> >>> >>                             guava and evaluate it for other
>>>>>>>>> libs.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                             On Tue, Oct 17, 2017 at 7:29
>>>>>>>>> PM, Lukasz
>>>>>>>>> >>> >> Cwik
>>>>>>>>> >>> >>                             <lcwik@google.com.invalid
>>>>>>>>> >>> >>                             <mailto:lcwik@google.com.inva
>>>>>>>>> lid>> wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                 An issue to call out is how
>>>>>>>>> to deal
>>>>>>>>> >>> >> with our
>>>>>>>>> >>> >>                                 generated code (.avro and
>>>>>>>>> >>> >>                                 .proto) as I don't believe
>>>>>>>>> those
>>>>>>>>> >>> >> plugins allow
>>>>>>>>> >>> >>                                 you to generate code using a
>>>>>>>>> >>> >>                                 shaded package prefix on
>>>>>>>>> imports.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                 On Tue, Oct 17, 2017 at
>>>>>>>>> 10:28 AM,
>>>>>>>>> >>> >> Thomas Groh
>>>>>>>>> >>> >>                                 <tgroh@google.com.invalid
>>>>>>>>> >>> >>                                 <mailto:
>>>>>>>>> tgroh@google.com.invalid>>
>>>>>>>>> >>> >>                                 wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                     +1 to the goal. I'm
>>>>>>>>> hugely in
>>>>>>>>> >>> >> favor of not
>>>>>>>>> >>> >>                                     doing the same shading
>>>>>>>>> work
>>>>>>>>> >>> >>                                     every time for
>>>>>>>>> dependencies we
>>>>>>>>> >>> >> know we'll use.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                     This also means that if
>>>>>>>>> we end up
>>>>>>>>> >>> >> pulling
>>>>>>>>> >>> >>                                     in transitive
>>>>>>>>> dependencies we
>>>>>>>>> >>> >>                                     don't want in any
>>>>>>>>> particular
>>>>>>>>> >>> >> module we can
>>>>>>>>> >>> >>                                     avoid having to adjust
>>>>>>>>> our
>>>>>>>>> >>> >>                                     repackaging strategy
>>>>>>>>> for that
>>>>>>>>> >>> >> module -
>>>>>>>>> >>> >>                                     which I have run into
>>>>>>>>> face-first
>>>>>>>>> >>> >> in
>>>>>>>>> >>> >>                                     the past.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                     On Tue, Oct 17, 2017 at
>>>>>>>>> 9:48 AM,
>>>>>>>>> >>> >> Kenneth
>>>>>>>>> >>> >>                                     Knowles
>>>>>>>>> <klk@google.com.invalid
>>>>>>>>> >>> >>                                     <mailto:
>>>>>>>>> klk@google.com.invalid>>
>>>>>>>>> >>> >>                                     wrote:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         Hi all,
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         Shading is a big
>>>>>>>>> part of how
>>>>>>>>> >>> >> we keep
>>>>>>>>> >>> >>                                         our dependencies
>>>>>>>>> sane in Beam.
>>>>>>>>> >>> >> But
>>>>>>>>> >>> >>                                         downsides: shading
>>>>>>>>> is super
>>>>>>>>> >>> >> slow,
>>>>>>>>> >>> >>                                         causes massive jar
>>>>>>>>> bloat, and
>>>>>>>>> >>> >> kind of
>>>>>>>>> >>> >>                                         hard to get right
>>>>>>>>> because
>>>>>>>>> >>> >> artifacts
>>>>>>>>> >>> >>                                         and namespaces are
>>>>>>>>> not 1-to-1.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         I know that some
>>>>>>>>> communities
>>>>>>>>> >>> >>                                         distribute their
>>>>>>>>> own shaded
>>>>>>>>> >>> >>                                         distributions of
>>>>>>>>> >>> >>                                         dependencies. I had
>>>>>>>>> a thought
>>>>>>>>> >>> >> about
>>>>>>>>> >>> >>                                         doing something
>>>>>>>>> similar that I
>>>>>>>>> >>> >> wanted
>>>>>>>>> >>> >>                                         to throw out there
>>>>>>>>> for people
>>>>>>>>> >>> >> to poke
>>>>>>>>> >>> >>                                         holes in.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         To set the scene,
>>>>>>>>> here is how
>>>>>>>>> >>> >> I view
>>>>>>>>> >>> >>                                         shading:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                           - A module has
>>>>>>>>> public
>>>>>>>>> >>> >> dependencies
>>>>>>>>> >>> >>                                         and private
>>>>>>>>> dependencies.
>>>>>>>>> >>> >>                                           - Public deps are
>>>>>>>>> used for
>>>>>>>>> >>> >> data
>>>>>>>>> >>> >>                                         interchange; users
>>>>>>>>> must share
>>>>>>>>> >>> >> these
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                     deps.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                           - Private deps
>>>>>>>>> are just
>>>>>>>>> >>> >>                                         functionality and
>>>>>>>>> can be
>>>>>>>>> >>> >> hidden (in
>>>>>>>>> >>> >>                                         our case,
>>>>>>>>> >>> >>                                         relocated + bundled)
>>>>>>>>> >>> >>                                           - It isn't
>>>>>>>>> necessarily that
>>>>>>>>> >>> >> simple,
>>>>>>>>> >>> >>                                         because public and
>>>>>>>>> private
>>>>>>>>> >>> >> deps
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                     might
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         interact in
>>>>>>>>> higher-order ways
>>>>>>>>> >>> >>                                         ("public" is
>>>>>>>>> contagious)
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         Shading is an
>>>>>>>>> implementation
>>>>>>>>> >>> >> detail of
>>>>>>>>> >>> >>                                         expressing these
>>>>>>>>> >>> >> characteristics.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                     We
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         use shading
>>>>>>>>> selectively
>>>>>>>>> >>> >> because of its
>>>>>>>>> >>> >>                                         downsides I
>>>>>>>>> mentioned above.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         But what about this
>>>>>>>>> idea:
>>>>>>>>> >>> >> Introduce
>>>>>>>>> >>> >>                                         shaded deps as a
>>>>>>>>> single
>>>>>>>>> >>> >> separate
>>>>>>>>> >>> >>                                         artifact.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                           -
>>>>>>>>> sdks/java/private-deps:
>>>>>>>>> >>> >> bundled
>>>>>>>>> >>> >>                                         uber jar with
>>>>>>>>> relocated
>>>>>>>>> >>> >> versions of
>>>>>>>>> >>> >>                                         everything we want
>>>>>>>>> to shade
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                           - sdks/java/core
>>>>>>>>> and
>>>>>>>>> >>> >>                                         sdks/java/harness:
>>>>>>>>> no
>>>>>>>>> >>> >> relocation or
>>>>>>>>> >>> >>                                         bundling -
>>>>>>>>> >>> >>                                         depends on
>>>>>>>>> >>> >>
>>>>>>>>>  `beam-sdks-java-private-deps`
>>>>>>>>> >>> >> and
>>>>>>>>> >>> >>                                         imports like
>>>>>>>>> >>> >>
>>>>>>>>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>>>>>>>>> >>> >>                                         directly (this is
>>>>>>>>> what
>>>>>>>>> >>> >>                                         they
>>>>>>>>> >>> >>                                         are rewritten to
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         Some benefits
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                           - much faster
>>>>>>>>> builds of
>>>>>>>>> >>> >> other modules
>>>>>>>>> >>> >>                                           - only one shaded
>>>>>>>>> uber jar
>>>>>>>>> >>> >>                                           - rare/no
>>>>>>>>> rebuilds of the
>>>>>>>>> >>> >> uber jar
>>>>>>>>> >>> >>                                           - can use maven
>>>>>>>>> enforcer to
>>>>>>>>> >>> >> forbid
>>>>>>>>> >>> >>                                         imports like
>>>>>>>>> com.google.common
>>>>>>>>> >>> >>                                           - configuration
>>>>>>>>> all in one
>>>>>>>>> >>> >> place
>>>>>>>>> >>> >>                                           - no automated
>>>>>>>>> rewriting of
>>>>>>>>> >>> >> our real
>>>>>>>>> >>> >>                                         code, which has led
>>>>>>>>> to some
>>>>>>>>> >>> >> major
>>>>>>>>> >>> >>                                         confusion
>>>>>>>>> >>> >>                                           - easy to
>>>>>>>>> implement
>>>>>>>>> >>> >> incrementally
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         Downsides:
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                           - plenty of
>>>>>>>>> effort work to
>>>>>>>>> >>> >> get there
>>>>>>>>> >>> >>                                           - unclear how
>>>>>>>>> many different
>>>>>>>>> >>> >> such
>>>>>>>>> >>> >>                                         deps modules we
>>>>>>>>> need; sharing
>>>>>>>>> >>> >> them
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                     could
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         get weird
>>>>>>>>> >>> >>                                           - if we hit a
>>>>>>>>> roadblock, we
>>>>>>>>> >>> >> will
>>>>>>>>> >>> >>                                         have committed a
>>>>>>>>> lot of time
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         Just something I
>>>>>>>>> was musing as
>>>>>>>>> >>> >> I spent
>>>>>>>>> >>> >>                                         another evening
>>>>>>>>> waiting for
>>>>>>>>> >>> >> slow
>>>>>>>>> >>> >>                                         builds to try to
>>>>>>>>> confirm
>>>>>>>>> >>> >> changes to
>>>>>>>>> >>> >>                                         brittle poms.
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                                         Kenn
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>                         --
>>>>>>>>> >>> >>                         Jean-Baptiste Onofré
>>>>>>>>> >>> >>                         jbonofre@apache.org
>>>>>>>>> >>> >> <ma...@apache.org>
>>>>>>>>> >>> >>                         http://blog.nanthrax.net
>>>>>>>>> >>> >> <http://blog.nanthrax.net/>
>>>>>>>>> >>> >>                         Talend - http://www.talend.com
>>>>>>>>> >>> >> <http://www.talend.com/>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >>
>>>>>>>>> >>> >
>>>>>>>>> >>> >
>>>>>>>>> >>> >
>>>>>>>>> >>>
>>>>>>>>> >>> --
>>>>>>>>> >>> Jean-Baptiste Onofré
>>>>>>>>> >>> jbonofre@apache.org
>>>>>>>>> >>> http://blog.nanthrax.net
>>>>>>>>> >>> Talend - http://www.talend.com
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Reuven Lax <re...@google.com>.
TypeToken is not trivial. I've written code to do what TypeToken does
before (figuring out generic ancestor types). It's actually somewhat
tricky, and the code I wrote had subtle bugs in it; eventually we removed
this code in favor of Guava's implementation :)

On Fri, Feb 2, 2018 at 7:47 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Yep, note I never said to reinvent the wheel, we can copy it from guava,
> openwebbeans or any other impl. Point was more to avoid to depend on
> something we don't own for that which is after all not that much code. I
> also think we can limit it a lot to align it on what is supported by beam
> (I'm thinking to coders here) but this can be another topic.
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-02 16:33 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>
>> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau <rmannibucau@gmail.com
>> > wrote:
>>
>>> Don't forget beam doesn't support much behind it (mainly only a few
>>> ParameterizedType due the usage code path) so it is mainly only about
>>> handling parameterized types and typevariables recursively. Not a lot of
>>> work. Main concern being it is in the API so using a shade as an API is
>>> never a good idea, in particular cause the shade can be broken and requires
>>> to setup clirr or things like that and when it breaks you are done and need
>>> to fork it anyway. Limiting the dependencies for an API - as beam is - is
>>> always saner even if it requires a small fork of code.
>>>
>>
>> The main thing that TypeToken is used for is capturing generics that are
>> lost by Java reflection. It is a bit tricky, actually.
>>
>> Kenn
>>
>>
>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>
>>> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>>
>>>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>>>>
>>>>>> Another couple:
>>>>>>
>>>>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>>>>>> TypeToken
>>>>>>
>>>>>
>>>>> Technically reflect Type is enough
>>>>>
>>>>
>>>> If you want to try to remove TypeToken from underneath TypeDescriptor,
>>>> I have no objections as long as you expand the test suite to cover all the
>>>> functionality where we trust TypeToken's tests. Good luck :-)
>>>>
>>>> Kenn
>>>>
>>>>
>>>>>
>>>>>
>>>>>>  - ImmutableList and friends and their builders are very widely used
>>>>>> and IMO still add a lot for readability and preventing someone coming along
>>>>>> and adding mistakes to a codebase
>>>>>>
>>>>>
>>>>> Sugar but not required. When you compare the cost of a shade versus of
>>>>> duplicating the parts we need there is no real match IMHO.
>>>>>
>>>>>
>>>>>>
>>>>>> So considering it all, I would keep a vendored Guava (but also move
>>>>>> off where we can, and also have our own improvements). I hope it will be a
>>>>>> near-empty build file to generate it so not a maintenance burden.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Nice. This sounds like a great idea (or two?) and goes along with
>>>>>>> what I just started for futures.
>>>>>>>
>>>>>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>>>>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>>>>>
>>>>>>> Specific things for our micro guava:
>>>>>>>
>>>>>>>  - checkArgumentNotNull can throw the right exception
>>>>>>>  - our own Optional because Java's is not Serializable
>>>>>>>  - futures combinators since many are missing, especially Java's
>>>>>>> don't do exceptions right
>>>>>>>
>>>>>>> Protobuf: didn't file an issue because I'm not sure
>>>>>>>
>>>>>>> I was wondering if pre-shading works. We really need to get rid of
>>>>>>> it from public APIs in a 100% reliable way. It is also a problem for
>>>>>>> Dataflow. I was wondering if one approach is to pre-shade
>>>>>>> gcpio-protobuf-java, gcpio-protobuf-java-util, gcpio-grpc-java, etc.
>>>>>>> Anything that needs to take a Message object. (and do the same for
>>>>>>> beam-model-protobuf-java since the model bits have to depend on each other
>>>>>>> but nothing else).
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Huge +1 to get rid of Guava!
>>>>>>>>
>>>>>>>> This solves annoying dependency issues for some IOs and allow us to
>>>>>>>> get rid of the shading that makes current jars bigger than they
>>>>>>>> should
>>>>>>>> be.
>>>>>>>>
>>>>>>>> We can create our own 'micro guava' package with some classes for
>>>>>>>> things that are hard to migrate, or that we  prefer to still have
>>>>>>>> like
>>>>>>>> the check* methods for example. Given the size of the task we should
>>>>>>>> probably divide it into subtasks, more important is to get rid of it
>>>>>>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>>>>>>
>>>>>>>> Other important idea would be to get rid of Protobuf in public APIs
>>>>>>>> like GCPIO and to better shade it from leaking into the runners. An
>>>>>>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>>>>>>> that is byting us for the Spark runner, but well that's worth a
>>>>>>>> different discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>>>>>>> <rm...@gmail.com> wrote:
>>>>>>>> > a map of list is fine and not a challenge we'll face long I hope
>>>>>>>> ;)
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Romain Manni-Bucau
>>>>>>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>>>>>>> >
>>>>>>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>>>> >>
>>>>>>>> >> Not sure we'll be able to replace them all. Things like guava
>>>>>>>> Table and
>>>>>>>> >> Multimap don't have great replacements in Java8.
>>>>>>>> >>
>>>>>>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>>>>>>> jb@nanthrax.net>
>>>>>>>> >> wrote:
>>>>>>>> >>>
>>>>>>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>>>>>>> >>>
>>>>>>>> >>> Regards
>>>>>>>> >>> JB
>>>>>>>> >>>
>>>>>>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>>>>>>> >>> > Why not dropping guava for all beam codebase? With java 8 it
>>>>>>>> is quite
>>>>>>>> >>> > easy to do
>>>>>>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with
>>>>>>>> quite a
>>>>>>>> >>> > good result.
>>>>>>>> >>> >
>>>>>>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>>>>>>> >>> > <ma...@google.com>> a écrit :
>>>>>>>> >>> >
>>>>>>>> >>> >     Make sure to include the guava version in the artifact
>>>>>>>> name so that
>>>>>>>> >>> > we can
>>>>>>>> >>> >     have multiple vendored versions.
>>>>>>>> >>> >
>>>>>>>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>>>>>>> klk@google.com
>>>>>>>> >>> >     <ma...@google.com>> wrote:
>>>>>>>> >>> >
>>>>>>>> >>> >         I didn't have time for this, but it just bit me. We
>>>>>>>> definitely
>>>>>>>> >>> > have
>>>>>>>> >>> >         Guava on the API surface of runner support code in
>>>>>>>> ways that
>>>>>>>> >>> > get
>>>>>>>> >>> >         incompatibly shaded. I will probably start "1a" by
>>>>>>>> making a
>>>>>>>> >>> > shaded
>>>>>>>> >>> >         library org.apache.beam:vendored-guava and starting
>>>>>>>> to use it.
>>>>>>>> >>> > It sounds
>>>>>>>> >>> >         like there is generally unanimous support for that
>>>>>>>> much,
>>>>>>>> >>> > anyhow.
>>>>>>>> >>> >
>>>>>>>> >>> >         Kenn
>>>>>>>> >>> >
>>>>>>>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>>>>>>> >>> > <aljoscha@apache.org
>>>>>>>> >>> >         <ma...@apache.org>> wrote:
>>>>>>>> >>> >
>>>>>>>> >>> >             Thanks Ismaël for bringing up this discussion
>>>>>>>> again!
>>>>>>>> >>> >
>>>>>>>> >>> >             I would be in favour of 1) and more specifically
>>>>>>>> of 1a)
>>>>>>>> >>> >
>>>>>>>> >>> >             Aljoscha
>>>>>>>> >>> >
>>>>>>>> >>> >
>>>>>>>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>>>>>>>> lcwik@google.com
>>>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>             You can always run tests on post shaded
>>>>>>>> artifacts instead
>>>>>>>> >>> >> of the
>>>>>>>> >>> >>             compiled classes, it just requires us to change
>>>>>>>> our maven
>>>>>>>> >>> >> surefire
>>>>>>>> >>> >>             / gradle test configurations but it is true that
>>>>>>>> most IDEs
>>>>>>>> >>> >> would
>>>>>>>> >>> >>             behave better with a dependency jar unless you
>>>>>>>> delegate
>>>>>>>> >>> >> all the
>>>>>>>> >>> >>             build/test actions to the build system and then
>>>>>>>> it won't
>>>>>>>> >>> >> matter.
>>>>>>>> >>> >>
>>>>>>>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>>>>>>> >>> >> <klk@google.com
>>>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                 There's also, with additional overhead,
>>>>>>>> >>> >>
>>>>>>>> >>> >>                 1a) A relocated and shipped package for each
>>>>>>>> thing we
>>>>>>>> >>> >> want to
>>>>>>>> >>> >>                 relocate. I think this has also been tried
>>>>>>>> outside
>>>>>>>> >>> >> Beam...
>>>>>>>> >>> >>
>>>>>>>> >>> >>                 Pros:
>>>>>>>> >>> >>                 * all the pros of 1) plus no bloat beyond
>>>>>>>> what is
>>>>>>>> >>> >> necessary
>>>>>>>> >>> >>                 Cons:
>>>>>>>> >>> >>                 * abandons whitelist approach for public
>>>>>>>> deps,
>>>>>>>> >>> >> reverting to
>>>>>>>> >>> >>                 blacklist approach for trouble things like
>>>>>>>> guava, so a
>>>>>>>> >>> >> bit
>>>>>>>> >>> >>                 less principled
>>>>>>>> >>> >>
>>>>>>>> >>> >>                 For both 1) and 1a) I would add:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                 Pros:
>>>>>>>> >>> >>                 * clearly readable dependency since code
>>>>>>>> will `import
>>>>>>>> >>> >>                 org.apache.beam.private.guava21` and IDEs
>>>>>>>> will
>>>>>>>> >>> >> understand it
>>>>>>>> >>> >>                 is a distinct lilbrary
>>>>>>>> >>> >>                 * can run tests on unpackaged classes, as
>>>>>>>> long as deps
>>>>>>>> >>> >> are
>>>>>>>> >>> >>                 shaded or provided as jars
>>>>>>>> >>> >>                 * no mysterious action at a distance from
>>>>>>>> inherited
>>>>>>>> >>> >> configuration
>>>>>>>> >>> >>                 Cons:
>>>>>>>> >>> >>                 * need to adjust imports
>>>>>>>> >>> >>
>>>>>>>> >>> >>                 Kenn
>>>>>>>> >>> >>
>>>>>>>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>>>>>>> >>> >> <lcwik@google.com
>>>>>>>> >>> >>                 <ma...@google.com>> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                     I would suggest that either we use:
>>>>>>>> >>> >>                     1) A common deps package containing
>>>>>>>> shaded
>>>>>>>> >>> >> dependencies
>>>>>>>> >>> >>                     allows for
>>>>>>>> >>> >>                     Pros
>>>>>>>> >>> >>                     * doesn't require the user to build an
>>>>>>>> uber jar
>>>>>>>> >>> >>                     Risks
>>>>>>>> >>> >>                     * dependencies package will keep growing
>>>>>>>> even if
>>>>>>>> >>> >> something
>>>>>>>> >>> >>                     is or isn't needed by all of Apache Beam
>>>>>>>> leading
>>>>>>>> >>> >> to a
>>>>>>>> >>> >>                     large jar anyways negating any space
>>>>>>>> savings
>>>>>>>> >>> >>
>>>>>>>> >>> >>                     2) Shade within each module to a common
>>>>>>>> location
>>>>>>>> >>> >> like
>>>>>>>> >>> >>                     org.apache.beam.relocated.guava....
>>>>>>>> >>> >>                     Pros
>>>>>>>> >>> >>                     * you only get the shaded dependencies
>>>>>>>> of the
>>>>>>>> >>> >> things that
>>>>>>>> >>> >>                     are required
>>>>>>>> >>> >>                     * its one less dependency for users to
>>>>>>>> manage
>>>>>>>> >>> >>                     Risks
>>>>>>>> >>> >>                     * requires an uber jar to be built to
>>>>>>>> get the
>>>>>>>> >>> >> space
>>>>>>>> >>> >>                     savings (either by a user or a
>>>>>>>> distribution of
>>>>>>>> >>> >> Apache
>>>>>>>> >>> >>                     Beam) otherwise we negate any space
>>>>>>>> savings.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                     If we either use a common relocation
>>>>>>>> scheme or a
>>>>>>>> >>> >>                     dependencies jar then each relocation
>>>>>>>> should
>>>>>>>> >>> >> specifically
>>>>>>>> >>> >>                     contain the version number of the
>>>>>>>> package because
>>>>>>>> >>> >> we would
>>>>>>>> >>> >>                     like to allow for us to be using two
>>>>>>>> different
>>>>>>>> >>> >> versions of
>>>>>>>> >>> >>                     the same library.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                     For the common deps package approach,
>>>>>>>> should we
>>>>>>>> >>> >> check in
>>>>>>>> >>> >>                     code where the imports contain the
>>>>>>>> relocated
>>>>>>>> >>> >> location
>>>>>>>> >>> >>                     (e.g. import
>>>>>>>> org.apache.beam.guava.20.0.com
>>>>>>>> >>> >>
>>>>>>>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle
>>>>>>>> ct.ImmutableList)?
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM,
>>>>>>>> Jean-Baptiste
>>>>>>>> >>> >> Onofré
>>>>>>>> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
>>>>>>>> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                         Thanks for bringing that back.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                         Indeed guava is shaded in different
>>>>>>>> uber-jar.
>>>>>>>> >>> >> Maybe we
>>>>>>>> >>> >>                         can have a common deps module that
>>>>>>>> we include
>>>>>>>> >>> >> once
>>>>>>>> >>> >>                         (but the user will have to
>>>>>>>> explicitly define
>>>>>>>> >>> >> the dep) ?
>>>>>>>> >>> >>
>>>>>>>> >>> >>                         Basically, what do you propose for
>>>>>>>> protobuf
>>>>>>>> >>> >>                         (unfortunately, I don't see an
>>>>>>>> obvious) ?
>>>>>>>> >>> >>
>>>>>>>> >>> >>                         Regards
>>>>>>>> >>> >>                         JB
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía
>>>>>>>> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             Hello, I wanted to bring back
>>>>>>>> this subject
>>>>>>>> >>> >> because
>>>>>>>> >>> >>                             I think we should
>>>>>>>> >>> >>                             take action on this and at least
>>>>>>>> first
>>>>>>>> >>> >> have a
>>>>>>>> >>> >>                             shaded version of guava.
>>>>>>>> >>> >>                             I was playing with a toy project
>>>>>>>> and I did
>>>>>>>> >>> >> the
>>>>>>>> >>> >>                             procedure we use to
>>>>>>>> >>> >>                             submit jars to a Hadoop cluster
>>>>>>>> via
>>>>>>>> >>> >> Flink/Spark
>>>>>>>> >>> >>                             which involves
>>>>>>>> >>> >>                             creating an uber jar and I
>>>>>>>> realized that
>>>>>>>> >>> >> the size
>>>>>>>> >>> >>                             of the jar was way
>>>>>>>> >>> >>                             bigger than I expected, and the
>>>>>>>> fact that
>>>>>>>> >>> >> we shade
>>>>>>>> >>> >>                             guava in every
>>>>>>>> >>> >>                             module contributes to this. I
>>>>>>>> found guava
>>>>>>>> >>> >> shaded on:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             sdks/java/core
>>>>>>>> >>> >>                             runners/core-construction-java
>>>>>>>> >>> >>                             runners/core-java
>>>>>>>> >>> >>                             model/job-management
>>>>>>>> >>> >>                             runners/spark
>>>>>>>> >>> >>                             sdks/java/io/hadoop-file-system
>>>>>>>> >>> >>                             sdks/java/io/kafka
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             This means at least 6 times more
>>>>>>>> of the
>>>>>>>> >>> >> size it
>>>>>>>> >>> >>                             should which counts in
>>>>>>>> >>> >>                             around 15MB more (2.4MB*6 deps)
>>>>>>>> of extra
>>>>>>>> >>> >> weight
>>>>>>>> >>> >>                             that we can simply
>>>>>>>> >>> >>                             reduce by using a shaded version
>>>>>>>> of the
>>>>>>>> >>> >> library.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             I add this point to the previous
>>>>>>>> ones
>>>>>>>> >>> >> mentioned
>>>>>>>> >>> >>                             during the discussion
>>>>>>>> >>> >>                             because this goes against the
>>>>>>>> end user
>>>>>>>> >>> >> experience
>>>>>>>> >>> >>                             and affects us all
>>>>>>>> >>> >>                             (devs/users).
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             Another question is if we should
>>>>>>>> shade
>>>>>>>> >>> >> (and how)
>>>>>>>> >>> >>                             protocol buffers
>>>>>>>> >>> >>                             because now with the portability
>>>>>>>> work we
>>>>>>>> >>> >> are
>>>>>>>> >>> >>                             exposing it closer to the
>>>>>>>> >>> >>                             end users. I say this because I
>>>>>>>> also found
>>>>>>>> >>> >> an
>>>>>>>> >>> >>                             issue while running a
>>>>>>>> >>> >>                             job on YARN with the spark
>>>>>>>> runner because
>>>>>>>> >>> >>                             hadoop-common includes
>>>>>>>> >>> >>                             protobuf-java 2 and I had to
>>>>>>>> explicitly
>>>>>>>> >>> >> provide
>>>>>>>> >>> >>                             protocol-buffers 3 as
>>>>>>>> >>> >>                             a dependency to be able to use
>>>>>>>> triggers
>>>>>>>> >>> >> (note the
>>>>>>>> >>> >>                             Spark runner
>>>>>>>> >>> >>                             translates them using some
>>>>>>>> method from
>>>>>>>> >>> >>                             runners/core-java). Since
>>>>>>>> >>> >>                             hadoop-common is provided in the
>>>>>>>> cluster
>>>>>>>> >>> >> with the
>>>>>>>> >>> >>                             older version of
>>>>>>>> >>> >>                             protobuf, I am afraid that this
>>>>>>>> will bite
>>>>>>>> >>> >> us in
>>>>>>>> >>> >>                             the future.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             Ismaël
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             ps. There is already a JIRA for
>>>>>>>> that
>>>>>>>> >>> >> shading for
>>>>>>>> >>> >>                             protobuf on
>>>>>>>> >>> >>                             hadoop-common but this is not
>>>>>>>> coming until
>>>>>>>> >>> >> version
>>>>>>>> >>> >>                             3 is out.
>>>>>>>> >>> >>
>>>>>>>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>>>> >>> >>
>>>>>>>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             ps2. Extra curious situation is
>>>>>>>> to see
>>>>>>>> >>> >> that the
>>>>>>>> >>> >>                             dataflow-runner ends
>>>>>>>> >>> >>                             up having guava shaded twice via
>>>>>>>> its
>>>>>>>> >>> >> shaded version on
>>>>>>>> >>> >>                             core-construction-java.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             ps3. Of course this message
>>>>>>>> means a
>>>>>>>> >>> >> de-facto +1 at
>>>>>>>> >>> >>                             least to do it for
>>>>>>>> >>> >>                             guava and evaluate it for other
>>>>>>>> libs.
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM,
>>>>>>>> Lukasz
>>>>>>>> >>> >> Cwik
>>>>>>>> >>> >>                             <lcwik@google.com.invalid
>>>>>>>> >>> >>                             <ma...@google.com.invalid>>
>>>>>>>> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                 An issue to call out is how
>>>>>>>> to deal
>>>>>>>> >>> >> with our
>>>>>>>> >>> >>                                 generated code (.avro and
>>>>>>>> >>> >>                                 .proto) as I don't believe
>>>>>>>> those
>>>>>>>> >>> >> plugins allow
>>>>>>>> >>> >>                                 you to generate code using a
>>>>>>>> >>> >>                                 shaded package prefix on
>>>>>>>> imports.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                 On Tue, Oct 17, 2017 at
>>>>>>>> 10:28 AM,
>>>>>>>> >>> >> Thomas Groh
>>>>>>>> >>> >>                                 <tgroh@google.com.invalid
>>>>>>>> >>> >>                                 <mailto:
>>>>>>>> tgroh@google.com.invalid>>
>>>>>>>> >>> >>                                 wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                     +1 to the goal. I'm
>>>>>>>> hugely in
>>>>>>>> >>> >> favor of not
>>>>>>>> >>> >>                                     doing the same shading
>>>>>>>> work
>>>>>>>> >>> >>                                     every time for
>>>>>>>> dependencies we
>>>>>>>> >>> >> know we'll use.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                     This also means that if
>>>>>>>> we end up
>>>>>>>> >>> >> pulling
>>>>>>>> >>> >>                                     in transitive
>>>>>>>> dependencies we
>>>>>>>> >>> >>                                     don't want in any
>>>>>>>> particular
>>>>>>>> >>> >> module we can
>>>>>>>> >>> >>                                     avoid having to adjust
>>>>>>>> our
>>>>>>>> >>> >>                                     repackaging strategy for
>>>>>>>> that
>>>>>>>> >>> >> module -
>>>>>>>> >>> >>                                     which I have run into
>>>>>>>> face-first
>>>>>>>> >>> >> in
>>>>>>>> >>> >>                                     the past.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                     On Tue, Oct 17, 2017 at
>>>>>>>> 9:48 AM,
>>>>>>>> >>> >> Kenneth
>>>>>>>> >>> >>                                     Knowles
>>>>>>>> <klk@google.com.invalid
>>>>>>>> >>> >>                                     <mailto:
>>>>>>>> klk@google.com.invalid>>
>>>>>>>> >>> >>                                     wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         Hi all,
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         Shading is a big
>>>>>>>> part of how
>>>>>>>> >>> >> we keep
>>>>>>>> >>> >>                                         our dependencies
>>>>>>>> sane in Beam.
>>>>>>>> >>> >> But
>>>>>>>> >>> >>                                         downsides: shading
>>>>>>>> is super
>>>>>>>> >>> >> slow,
>>>>>>>> >>> >>                                         causes massive jar
>>>>>>>> bloat, and
>>>>>>>> >>> >> kind of
>>>>>>>> >>> >>                                         hard to get right
>>>>>>>> because
>>>>>>>> >>> >> artifacts
>>>>>>>> >>> >>                                         and namespaces are
>>>>>>>> not 1-to-1.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         I know that some
>>>>>>>> communities
>>>>>>>> >>> >>                                         distribute their own
>>>>>>>> shaded
>>>>>>>> >>> >>                                         distributions of
>>>>>>>> >>> >>                                         dependencies. I had
>>>>>>>> a thought
>>>>>>>> >>> >> about
>>>>>>>> >>> >>                                         doing something
>>>>>>>> similar that I
>>>>>>>> >>> >> wanted
>>>>>>>> >>> >>                                         to throw out there
>>>>>>>> for people
>>>>>>>> >>> >> to poke
>>>>>>>> >>> >>                                         holes in.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         To set the scene,
>>>>>>>> here is how
>>>>>>>> >>> >> I view
>>>>>>>> >>> >>                                         shading:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                           - A module has
>>>>>>>> public
>>>>>>>> >>> >> dependencies
>>>>>>>> >>> >>                                         and private
>>>>>>>> dependencies.
>>>>>>>> >>> >>                                           - Public deps are
>>>>>>>> used for
>>>>>>>> >>> >> data
>>>>>>>> >>> >>                                         interchange; users
>>>>>>>> must share
>>>>>>>> >>> >> these
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                     deps.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                           - Private deps are
>>>>>>>> just
>>>>>>>> >>> >>                                         functionality and
>>>>>>>> can be
>>>>>>>> >>> >> hidden (in
>>>>>>>> >>> >>                                         our case,
>>>>>>>> >>> >>                                         relocated + bundled)
>>>>>>>> >>> >>                                           - It isn't
>>>>>>>> necessarily that
>>>>>>>> >>> >> simple,
>>>>>>>> >>> >>                                         because public and
>>>>>>>> private
>>>>>>>> >>> >> deps
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                     might
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         interact in
>>>>>>>> higher-order ways
>>>>>>>> >>> >>                                         ("public" is
>>>>>>>> contagious)
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         Shading is an
>>>>>>>> implementation
>>>>>>>> >>> >> detail of
>>>>>>>> >>> >>                                         expressing these
>>>>>>>> >>> >> characteristics.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                     We
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         use shading
>>>>>>>> selectively
>>>>>>>> >>> >> because of its
>>>>>>>> >>> >>                                         downsides I
>>>>>>>> mentioned above.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         But what about this
>>>>>>>> idea:
>>>>>>>> >>> >> Introduce
>>>>>>>> >>> >>                                         shaded deps as a
>>>>>>>> single
>>>>>>>> >>> >> separate
>>>>>>>> >>> >>                                         artifact.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                           -
>>>>>>>> sdks/java/private-deps:
>>>>>>>> >>> >> bundled
>>>>>>>> >>> >>                                         uber jar with
>>>>>>>> relocated
>>>>>>>> >>> >> versions of
>>>>>>>> >>> >>                                         everything we want
>>>>>>>> to shade
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                           - sdks/java/core
>>>>>>>> and
>>>>>>>> >>> >>                                         sdks/java/harness: no
>>>>>>>> >>> >> relocation or
>>>>>>>> >>> >>                                         bundling -
>>>>>>>> >>> >>                                         depends on
>>>>>>>> >>> >>
>>>>>>>>  `beam-sdks-java-private-deps`
>>>>>>>> >>> >> and
>>>>>>>> >>> >>                                         imports like
>>>>>>>> >>> >>
>>>>>>>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>>>>>>>> >>> >>                                         directly (this is
>>>>>>>> what
>>>>>>>> >>> >>                                         they
>>>>>>>> >>> >>                                         are rewritten to
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         Some benefits
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                           - much faster
>>>>>>>> builds of
>>>>>>>> >>> >> other modules
>>>>>>>> >>> >>                                           - only one shaded
>>>>>>>> uber jar
>>>>>>>> >>> >>                                           - rare/no rebuilds
>>>>>>>> of the
>>>>>>>> >>> >> uber jar
>>>>>>>> >>> >>                                           - can use maven
>>>>>>>> enforcer to
>>>>>>>> >>> >> forbid
>>>>>>>> >>> >>                                         imports like
>>>>>>>> com.google.common
>>>>>>>> >>> >>                                           - configuration
>>>>>>>> all in one
>>>>>>>> >>> >> place
>>>>>>>> >>> >>                                           - no automated
>>>>>>>> rewriting of
>>>>>>>> >>> >> our real
>>>>>>>> >>> >>                                         code, which has led
>>>>>>>> to some
>>>>>>>> >>> >> major
>>>>>>>> >>> >>                                         confusion
>>>>>>>> >>> >>                                           - easy to implement
>>>>>>>> >>> >> incrementally
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         Downsides:
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                           - plenty of effort
>>>>>>>> work to
>>>>>>>> >>> >> get there
>>>>>>>> >>> >>                                           - unclear how many
>>>>>>>> different
>>>>>>>> >>> >> such
>>>>>>>> >>> >>                                         deps modules we
>>>>>>>> need; sharing
>>>>>>>> >>> >> them
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                     could
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         get weird
>>>>>>>> >>> >>                                           - if we hit a
>>>>>>>> roadblock, we
>>>>>>>> >>> >> will
>>>>>>>> >>> >>                                         have committed a lot
>>>>>>>> of time
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         Just something I was
>>>>>>>> musing as
>>>>>>>> >>> >> I spent
>>>>>>>> >>> >>                                         another evening
>>>>>>>> waiting for
>>>>>>>> >>> >> slow
>>>>>>>> >>> >>                                         builds to try to
>>>>>>>> confirm
>>>>>>>> >>> >> changes to
>>>>>>>> >>> >>                                         brittle poms.
>>>>>>>> >>> >>
>>>>>>>> >>> >>                                         Kenn
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >>                         --
>>>>>>>> >>> >>                         Jean-Baptiste Onofré
>>>>>>>> >>> >>                         jbonofre@apache.org
>>>>>>>> >>> >> <ma...@apache.org>
>>>>>>>> >>> >>                         http://blog.nanthrax.net
>>>>>>>> >>> >> <http://blog.nanthrax.net/>
>>>>>>>> >>> >>                         Talend - http://www.talend.com
>>>>>>>> >>> >> <http://www.talend.com/>
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >>
>>>>>>>> >>> >
>>>>>>>> >>> >
>>>>>>>> >>> >
>>>>>>>> >>>
>>>>>>>> >>> --
>>>>>>>> >>> Jean-Baptiste Onofré
>>>>>>>> >>> jbonofre@apache.org
>>>>>>>> >>> http://blog.nanthrax.net
>>>>>>>> >>> Talend - http://www.talend.com
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Romain Manni-Bucau <rm...@gmail.com>.
Yep, note I never said to reinvent the wheel, we can copy it from guava,
openwebbeans or any other impl. Point was more to avoid to depend on
something we don't own for that which is after all not that much code. I
also think we can limit it a lot to align it on what is supported by beam
(I'm thinking to coders here) but this can be another topic.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-02 16:33 GMT+01:00 Kenneth Knowles <kl...@google.com>:

> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Don't forget beam doesn't support much behind it (mainly only a few
>> ParameterizedType due the usage code path) so it is mainly only about
>> handling parameterized types and typevariables recursively. Not a lot of
>> work. Main concern being it is in the API so using a shade as an API is
>> never a good idea, in particular cause the shade can be broken and requires
>> to setup clirr or things like that and when it breaks you are done and need
>> to fork it anyway. Limiting the dependencies for an API - as beam is - is
>> always saner even if it requires a small fork of code.
>>
>
> The main thing that TypeToken is used for is capturing generics that are
> lost by Java reflection. It is a bit tricky, actually.
>
> Kenn
>
>
>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>
>>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>>>
>>>>> Another couple:
>>>>>
>>>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>>>>> TypeToken
>>>>>
>>>>
>>>> Technically reflect Type is enough
>>>>
>>>
>>> If you want to try to remove TypeToken from underneath TypeDescriptor, I
>>> have no objections as long as you expand the test suite to cover all the
>>> functionality where we trust TypeToken's tests. Good luck :-)
>>>
>>> Kenn
>>>
>>>
>>>>
>>>>
>>>>>  - ImmutableList and friends and their builders are very widely used
>>>>> and IMO still add a lot for readability and preventing someone coming along
>>>>> and adding mistakes to a codebase
>>>>>
>>>>
>>>> Sugar but not required. When you compare the cost of a shade versus of
>>>> duplicating the parts we need there is no real match IMHO.
>>>>
>>>>
>>>>>
>>>>> So considering it all, I would keep a vendored Guava (but also move
>>>>> off where we can, and also have our own improvements). I hope it will be a
>>>>> near-empty build file to generate it so not a maintenance burden.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Nice. This sounds like a great idea (or two?) and goes along with
>>>>>> what I just started for futures.
>>>>>>
>>>>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>>>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>>>>
>>>>>> Specific things for our micro guava:
>>>>>>
>>>>>>  - checkArgumentNotNull can throw the right exception
>>>>>>  - our own Optional because Java's is not Serializable
>>>>>>  - futures combinators since many are missing, especially Java's
>>>>>> don't do exceptions right
>>>>>>
>>>>>> Protobuf: didn't file an issue because I'm not sure
>>>>>>
>>>>>> I was wondering if pre-shading works. We really need to get rid of it
>>>>>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>>>>>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>>>>>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>>>>>> a Message object. (and do the same for beam-model-protobuf-java since the
>>>>>> model bits have to depend on each other but nothing else).
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Huge +1 to get rid of Guava!
>>>>>>>
>>>>>>> This solves annoying dependency issues for some IOs and allow us to
>>>>>>> get rid of the shading that makes current jars bigger than they
>>>>>>> should
>>>>>>> be.
>>>>>>>
>>>>>>> We can create our own 'micro guava' package with some classes for
>>>>>>> things that are hard to migrate, or that we  prefer to still have
>>>>>>> like
>>>>>>> the check* methods for example. Given the size of the task we should
>>>>>>> probably divide it into subtasks, more important is to get rid of it
>>>>>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>>>>>
>>>>>>> Other important idea would be to get rid of Protobuf in public APIs
>>>>>>> like GCPIO and to better shade it from leaking into the runners. An
>>>>>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>>>>>> that is byting us for the Spark runner, but well that's worth a
>>>>>>> different discussion.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>>>>>> <rm...@gmail.com> wrote:
>>>>>>> > a map of list is fine and not a challenge we'll face long I hope ;)
>>>>>>> >
>>>>>>> >
>>>>>>> > Romain Manni-Bucau
>>>>>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>>>>>> >
>>>>>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>>> >>
>>>>>>> >> Not sure we'll be able to replace them all. Things like guava
>>>>>>> Table and
>>>>>>> >> Multimap don't have great replacements in Java8.
>>>>>>> >>
>>>>>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>>>>>> jb@nanthrax.net>
>>>>>>> >> wrote:
>>>>>>> >>>
>>>>>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>>>>>> >>>
>>>>>>> >>> Regards
>>>>>>> >>> JB
>>>>>>> >>>
>>>>>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>>>>>> >>> > Why not dropping guava for all beam codebase? With java 8 it
>>>>>>> is quite
>>>>>>> >>> > easy to do
>>>>>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with
>>>>>>> quite a
>>>>>>> >>> > good result.
>>>>>>> >>> >
>>>>>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>>>>>> >>> > <ma...@google.com>> a écrit :
>>>>>>> >>> >
>>>>>>> >>> >     Make sure to include the guava version in the artifact
>>>>>>> name so that
>>>>>>> >>> > we can
>>>>>>> >>> >     have multiple vendored versions.
>>>>>>> >>> >
>>>>>>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>>>>>> klk@google.com
>>>>>>> >>> >     <ma...@google.com>> wrote:
>>>>>>> >>> >
>>>>>>> >>> >         I didn't have time for this, but it just bit me. We
>>>>>>> definitely
>>>>>>> >>> > have
>>>>>>> >>> >         Guava on the API surface of runner support code in
>>>>>>> ways that
>>>>>>> >>> > get
>>>>>>> >>> >         incompatibly shaded. I will probably start "1a" by
>>>>>>> making a
>>>>>>> >>> > shaded
>>>>>>> >>> >         library org.apache.beam:vendored-guava and starting to
>>>>>>> use it.
>>>>>>> >>> > It sounds
>>>>>>> >>> >         like there is generally unanimous support for that
>>>>>>> much,
>>>>>>> >>> > anyhow.
>>>>>>> >>> >
>>>>>>> >>> >         Kenn
>>>>>>> >>> >
>>>>>>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>>>>>> >>> > <aljoscha@apache.org
>>>>>>> >>> >         <ma...@apache.org>> wrote:
>>>>>>> >>> >
>>>>>>> >>> >             Thanks Ismaël for bringing up this discussion
>>>>>>> again!
>>>>>>> >>> >
>>>>>>> >>> >             I would be in favour of 1) and more specifically
>>>>>>> of 1a)
>>>>>>> >>> >
>>>>>>> >>> >             Aljoscha
>>>>>>> >>> >
>>>>>>> >>> >
>>>>>>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>>>>>>> lcwik@google.com
>>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>             You can always run tests on post shaded artifacts
>>>>>>> instead
>>>>>>> >>> >> of the
>>>>>>> >>> >>             compiled classes, it just requires us to change
>>>>>>> our maven
>>>>>>> >>> >> surefire
>>>>>>> >>> >>             / gradle test configurations but it is true that
>>>>>>> most IDEs
>>>>>>> >>> >> would
>>>>>>> >>> >>             behave better with a dependency jar unless you
>>>>>>> delegate
>>>>>>> >>> >> all the
>>>>>>> >>> >>             build/test actions to the build system and then
>>>>>>> it won't
>>>>>>> >>> >> matter.
>>>>>>> >>> >>
>>>>>>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>>>>>> >>> >> <klk@google.com
>>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>                 There's also, with additional overhead,
>>>>>>> >>> >>
>>>>>>> >>> >>                 1a) A relocated and shipped package for each
>>>>>>> thing we
>>>>>>> >>> >> want to
>>>>>>> >>> >>                 relocate. I think this has also been tried
>>>>>>> outside
>>>>>>> >>> >> Beam...
>>>>>>> >>> >>
>>>>>>> >>> >>                 Pros:
>>>>>>> >>> >>                 * all the pros of 1) plus no bloat beyond
>>>>>>> what is
>>>>>>> >>> >> necessary
>>>>>>> >>> >>                 Cons:
>>>>>>> >>> >>                 * abandons whitelist approach for public deps,
>>>>>>> >>> >> reverting to
>>>>>>> >>> >>                 blacklist approach for trouble things like
>>>>>>> guava, so a
>>>>>>> >>> >> bit
>>>>>>> >>> >>                 less principled
>>>>>>> >>> >>
>>>>>>> >>> >>                 For both 1) and 1a) I would add:
>>>>>>> >>> >>
>>>>>>> >>> >>                 Pros:
>>>>>>> >>> >>                 * clearly readable dependency since code will
>>>>>>> `import
>>>>>>> >>> >>                 org.apache.beam.private.guava21` and IDEs
>>>>>>> will
>>>>>>> >>> >> understand it
>>>>>>> >>> >>                 is a distinct lilbrary
>>>>>>> >>> >>                 * can run tests on unpackaged classes, as
>>>>>>> long as deps
>>>>>>> >>> >> are
>>>>>>> >>> >>                 shaded or provided as jars
>>>>>>> >>> >>                 * no mysterious action at a distance from
>>>>>>> inherited
>>>>>>> >>> >> configuration
>>>>>>> >>> >>                 Cons:
>>>>>>> >>> >>                 * need to adjust imports
>>>>>>> >>> >>
>>>>>>> >>> >>                 Kenn
>>>>>>> >>> >>
>>>>>>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>>>>>> >>> >> <lcwik@google.com
>>>>>>> >>> >>                 <ma...@google.com>> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>                     I would suggest that either we use:
>>>>>>> >>> >>                     1) A common deps package containing shaded
>>>>>>> >>> >> dependencies
>>>>>>> >>> >>                     allows for
>>>>>>> >>> >>                     Pros
>>>>>>> >>> >>                     * doesn't require the user to build an
>>>>>>> uber jar
>>>>>>> >>> >>                     Risks
>>>>>>> >>> >>                     * dependencies package will keep growing
>>>>>>> even if
>>>>>>> >>> >> something
>>>>>>> >>> >>                     is or isn't needed by all of Apache Beam
>>>>>>> leading
>>>>>>> >>> >> to a
>>>>>>> >>> >>                     large jar anyways negating any space
>>>>>>> savings
>>>>>>> >>> >>
>>>>>>> >>> >>                     2) Shade within each module to a common
>>>>>>> location
>>>>>>> >>> >> like
>>>>>>> >>> >>                     org.apache.beam.relocated.guava....
>>>>>>> >>> >>                     Pros
>>>>>>> >>> >>                     * you only get the shaded dependencies of
>>>>>>> the
>>>>>>> >>> >> things that
>>>>>>> >>> >>                     are required
>>>>>>> >>> >>                     * its one less dependency for users to
>>>>>>> manage
>>>>>>> >>> >>                     Risks
>>>>>>> >>> >>                     * requires an uber jar to be built to get
>>>>>>> the
>>>>>>> >>> >> space
>>>>>>> >>> >>                     savings (either by a user or a
>>>>>>> distribution of
>>>>>>> >>> >> Apache
>>>>>>> >>> >>                     Beam) otherwise we negate any space
>>>>>>> savings.
>>>>>>> >>> >>
>>>>>>> >>> >>                     If we either use a common relocation
>>>>>>> scheme or a
>>>>>>> >>> >>                     dependencies jar then each relocation
>>>>>>> should
>>>>>>> >>> >> specifically
>>>>>>> >>> >>                     contain the version number of the package
>>>>>>> because
>>>>>>> >>> >> we would
>>>>>>> >>> >>                     like to allow for us to be using two
>>>>>>> different
>>>>>>> >>> >> versions of
>>>>>>> >>> >>                     the same library.
>>>>>>> >>> >>
>>>>>>> >>> >>                     For the common deps package approach,
>>>>>>> should we
>>>>>>> >>> >> check in
>>>>>>> >>> >>                     code where the imports contain the
>>>>>>> relocated
>>>>>>> >>> >> location
>>>>>>> >>> >>                     (e.g. import
>>>>>>> org.apache.beam.guava.20.0.com
>>>>>>> >>> >>
>>>>>>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle
>>>>>>> ct.ImmutableList)?
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM,
>>>>>>> Jean-Baptiste
>>>>>>> >>> >> Onofré
>>>>>>> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
>>>>>>> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>                         Thanks for bringing that back.
>>>>>>> >>> >>
>>>>>>> >>> >>                         Indeed guava is shaded in different
>>>>>>> uber-jar.
>>>>>>> >>> >> Maybe we
>>>>>>> >>> >>                         can have a common deps module that we
>>>>>>> include
>>>>>>> >>> >> once
>>>>>>> >>> >>                         (but the user will have to explicitly
>>>>>>> define
>>>>>>> >>> >> the dep) ?
>>>>>>> >>> >>
>>>>>>> >>> >>                         Basically, what do you propose for
>>>>>>> protobuf
>>>>>>> >>> >>                         (unfortunately, I don't see an
>>>>>>> obvious) ?
>>>>>>> >>> >>
>>>>>>> >>> >>                         Regards
>>>>>>> >>> >>                         JB
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía
>>>>>>> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>                             Hello, I wanted to bring back
>>>>>>> this subject
>>>>>>> >>> >> because
>>>>>>> >>> >>                             I think we should
>>>>>>> >>> >>                             take action on this and at least
>>>>>>> first
>>>>>>> >>> >> have a
>>>>>>> >>> >>                             shaded version of guava.
>>>>>>> >>> >>                             I was playing with a toy project
>>>>>>> and I did
>>>>>>> >>> >> the
>>>>>>> >>> >>                             procedure we use to
>>>>>>> >>> >>                             submit jars to a Hadoop cluster
>>>>>>> via
>>>>>>> >>> >> Flink/Spark
>>>>>>> >>> >>                             which involves
>>>>>>> >>> >>                             creating an uber jar and I
>>>>>>> realized that
>>>>>>> >>> >> the size
>>>>>>> >>> >>                             of the jar was way
>>>>>>> >>> >>                             bigger than I expected, and the
>>>>>>> fact that
>>>>>>> >>> >> we shade
>>>>>>> >>> >>                             guava in every
>>>>>>> >>> >>                             module contributes to this. I
>>>>>>> found guava
>>>>>>> >>> >> shaded on:
>>>>>>> >>> >>
>>>>>>> >>> >>                             sdks/java/core
>>>>>>> >>> >>                             runners/core-construction-java
>>>>>>> >>> >>                             runners/core-java
>>>>>>> >>> >>                             model/job-management
>>>>>>> >>> >>                             runners/spark
>>>>>>> >>> >>                             sdks/java/io/hadoop-file-system
>>>>>>> >>> >>                             sdks/java/io/kafka
>>>>>>> >>> >>
>>>>>>> >>> >>                             This means at least 6 times more
>>>>>>> of the
>>>>>>> >>> >> size it
>>>>>>> >>> >>                             should which counts in
>>>>>>> >>> >>                             around 15MB more (2.4MB*6 deps)
>>>>>>> of extra
>>>>>>> >>> >> weight
>>>>>>> >>> >>                             that we can simply
>>>>>>> >>> >>                             reduce by using a shaded version
>>>>>>> of the
>>>>>>> >>> >> library.
>>>>>>> >>> >>
>>>>>>> >>> >>                             I add this point to the previous
>>>>>>> ones
>>>>>>> >>> >> mentioned
>>>>>>> >>> >>                             during the discussion
>>>>>>> >>> >>                             because this goes against the end
>>>>>>> user
>>>>>>> >>> >> experience
>>>>>>> >>> >>                             and affects us all
>>>>>>> >>> >>                             (devs/users).
>>>>>>> >>> >>
>>>>>>> >>> >>                             Another question is if we should
>>>>>>> shade
>>>>>>> >>> >> (and how)
>>>>>>> >>> >>                             protocol buffers
>>>>>>> >>> >>                             because now with the portability
>>>>>>> work we
>>>>>>> >>> >> are
>>>>>>> >>> >>                             exposing it closer to the
>>>>>>> >>> >>                             end users. I say this because I
>>>>>>> also found
>>>>>>> >>> >> an
>>>>>>> >>> >>                             issue while running a
>>>>>>> >>> >>                             job on YARN with the spark runner
>>>>>>> because
>>>>>>> >>> >>                             hadoop-common includes
>>>>>>> >>> >>                             protobuf-java 2 and I had to
>>>>>>> explicitly
>>>>>>> >>> >> provide
>>>>>>> >>> >>                             protocol-buffers 3 as
>>>>>>> >>> >>                             a dependency to be able to use
>>>>>>> triggers
>>>>>>> >>> >> (note the
>>>>>>> >>> >>                             Spark runner
>>>>>>> >>> >>                             translates them using some method
>>>>>>> from
>>>>>>> >>> >>                             runners/core-java). Since
>>>>>>> >>> >>                             hadoop-common is provided in the
>>>>>>> cluster
>>>>>>> >>> >> with the
>>>>>>> >>> >>                             older version of
>>>>>>> >>> >>                             protobuf, I am afraid that this
>>>>>>> will bite
>>>>>>> >>> >> us in
>>>>>>> >>> >>                             the future.
>>>>>>> >>> >>
>>>>>>> >>> >>                             Ismaël
>>>>>>> >>> >>
>>>>>>> >>> >>                             ps. There is already a JIRA for
>>>>>>> that
>>>>>>> >>> >> shading for
>>>>>>> >>> >>                             protobuf on
>>>>>>> >>> >>                             hadoop-common but this is not
>>>>>>> coming until
>>>>>>> >>> >> version
>>>>>>> >>> >>                             3 is out.
>>>>>>> >>> >>
>>>>>>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>>> >>> >>
>>>>>>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>>>>>> >>> >>
>>>>>>> >>> >>                             ps2. Extra curious situation is
>>>>>>> to see
>>>>>>> >>> >> that the
>>>>>>> >>> >>                             dataflow-runner ends
>>>>>>> >>> >>                             up having guava shaded twice via
>>>>>>> its
>>>>>>> >>> >> shaded version on
>>>>>>> >>> >>                             core-construction-java.
>>>>>>> >>> >>
>>>>>>> >>> >>                             ps3. Of course this message means
>>>>>>> a
>>>>>>> >>> >> de-facto +1 at
>>>>>>> >>> >>                             least to do it for
>>>>>>> >>> >>                             guava and evaluate it for other
>>>>>>> libs.
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM,
>>>>>>> Lukasz
>>>>>>> >>> >> Cwik
>>>>>>> >>> >>                             <lcwik@google.com.invalid
>>>>>>> >>> >>                             <ma...@google.com.invalid>>
>>>>>>> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>                                 An issue to call out is how
>>>>>>> to deal
>>>>>>> >>> >> with our
>>>>>>> >>> >>                                 generated code (.avro and
>>>>>>> >>> >>                                 .proto) as I don't believe
>>>>>>> those
>>>>>>> >>> >> plugins allow
>>>>>>> >>> >>                                 you to generate code using a
>>>>>>> >>> >>                                 shaded package prefix on
>>>>>>> imports.
>>>>>>> >>> >>
>>>>>>> >>> >>                                 On Tue, Oct 17, 2017 at 10:28
>>>>>>> AM,
>>>>>>> >>> >> Thomas Groh
>>>>>>> >>> >>                                 <tgroh@google.com.invalid
>>>>>>> >>> >>                                 <mailto:tgroh@google.com.inva
>>>>>>> lid>>
>>>>>>> >>> >>                                 wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>                                     +1 to the goal. I'm
>>>>>>> hugely in
>>>>>>> >>> >> favor of not
>>>>>>> >>> >>                                     doing the same shading
>>>>>>> work
>>>>>>> >>> >>                                     every time for
>>>>>>> dependencies we
>>>>>>> >>> >> know we'll use.
>>>>>>> >>> >>
>>>>>>> >>> >>                                     This also means that if
>>>>>>> we end up
>>>>>>> >>> >> pulling
>>>>>>> >>> >>                                     in transitive
>>>>>>> dependencies we
>>>>>>> >>> >>                                     don't want in any
>>>>>>> particular
>>>>>>> >>> >> module we can
>>>>>>> >>> >>                                     avoid having to adjust our
>>>>>>> >>> >>                                     repackaging strategy for
>>>>>>> that
>>>>>>> >>> >> module -
>>>>>>> >>> >>                                     which I have run into
>>>>>>> face-first
>>>>>>> >>> >> in
>>>>>>> >>> >>                                     the past.
>>>>>>> >>> >>
>>>>>>> >>> >>                                     On Tue, Oct 17, 2017 at
>>>>>>> 9:48 AM,
>>>>>>> >>> >> Kenneth
>>>>>>> >>> >>                                     Knowles
>>>>>>> <klk@google.com.invalid
>>>>>>> >>> >>                                     <mailto:
>>>>>>> klk@google.com.invalid>>
>>>>>>> >>> >>                                     wrote:
>>>>>>> >>> >>
>>>>>>> >>> >>                                         Hi all,
>>>>>>> >>> >>
>>>>>>> >>> >>                                         Shading is a big part
>>>>>>> of how
>>>>>>> >>> >> we keep
>>>>>>> >>> >>                                         our dependencies sane
>>>>>>> in Beam.
>>>>>>> >>> >> But
>>>>>>> >>> >>                                         downsides: shading is
>>>>>>> super
>>>>>>> >>> >> slow,
>>>>>>> >>> >>                                         causes massive jar
>>>>>>> bloat, and
>>>>>>> >>> >> kind of
>>>>>>> >>> >>                                         hard to get right
>>>>>>> because
>>>>>>> >>> >> artifacts
>>>>>>> >>> >>                                         and namespaces are
>>>>>>> not 1-to-1.
>>>>>>> >>> >>
>>>>>>> >>> >>                                         I know that some
>>>>>>> communities
>>>>>>> >>> >>                                         distribute their own
>>>>>>> shaded
>>>>>>> >>> >>                                         distributions of
>>>>>>> >>> >>                                         dependencies. I had a
>>>>>>> thought
>>>>>>> >>> >> about
>>>>>>> >>> >>                                         doing something
>>>>>>> similar that I
>>>>>>> >>> >> wanted
>>>>>>> >>> >>                                         to throw out there
>>>>>>> for people
>>>>>>> >>> >> to poke
>>>>>>> >>> >>                                         holes in.
>>>>>>> >>> >>
>>>>>>> >>> >>                                         To set the scene,
>>>>>>> here is how
>>>>>>> >>> >> I view
>>>>>>> >>> >>                                         shading:
>>>>>>> >>> >>
>>>>>>> >>> >>                                           - A module has
>>>>>>> public
>>>>>>> >>> >> dependencies
>>>>>>> >>> >>                                         and private
>>>>>>> dependencies.
>>>>>>> >>> >>                                           - Public deps are
>>>>>>> used for
>>>>>>> >>> >> data
>>>>>>> >>> >>                                         interchange; users
>>>>>>> must share
>>>>>>> >>> >> these
>>>>>>> >>> >>
>>>>>>> >>> >>                                     deps.
>>>>>>> >>> >>
>>>>>>> >>> >>                                           - Private deps are
>>>>>>> just
>>>>>>> >>> >>                                         functionality and can
>>>>>>> be
>>>>>>> >>> >> hidden (in
>>>>>>> >>> >>                                         our case,
>>>>>>> >>> >>                                         relocated + bundled)
>>>>>>> >>> >>                                           - It isn't
>>>>>>> necessarily that
>>>>>>> >>> >> simple,
>>>>>>> >>> >>                                         because public and
>>>>>>> private
>>>>>>> >>> >> deps
>>>>>>> >>> >>
>>>>>>> >>> >>                                     might
>>>>>>> >>> >>
>>>>>>> >>> >>                                         interact in
>>>>>>> higher-order ways
>>>>>>> >>> >>                                         ("public" is
>>>>>>> contagious)
>>>>>>> >>> >>
>>>>>>> >>> >>                                         Shading is an
>>>>>>> implementation
>>>>>>> >>> >> detail of
>>>>>>> >>> >>                                         expressing these
>>>>>>> >>> >> characteristics.
>>>>>>> >>> >>
>>>>>>> >>> >>                                     We
>>>>>>> >>> >>
>>>>>>> >>> >>                                         use shading
>>>>>>> selectively
>>>>>>> >>> >> because of its
>>>>>>> >>> >>                                         downsides I mentioned
>>>>>>> above.
>>>>>>> >>> >>
>>>>>>> >>> >>                                         But what about this
>>>>>>> idea:
>>>>>>> >>> >> Introduce
>>>>>>> >>> >>                                         shaded deps as a
>>>>>>> single
>>>>>>> >>> >> separate
>>>>>>> >>> >>                                         artifact.
>>>>>>> >>> >>
>>>>>>> >>> >>                                           -
>>>>>>> sdks/java/private-deps:
>>>>>>> >>> >> bundled
>>>>>>> >>> >>                                         uber jar with
>>>>>>> relocated
>>>>>>> >>> >> versions of
>>>>>>> >>> >>                                         everything we want to
>>>>>>> shade
>>>>>>> >>> >>
>>>>>>> >>> >>                                           - sdks/java/core and
>>>>>>> >>> >>                                         sdks/java/harness: no
>>>>>>> >>> >> relocation or
>>>>>>> >>> >>                                         bundling -
>>>>>>> >>> >>                                         depends on
>>>>>>> >>> >>
>>>>>>>  `beam-sdks-java-private-deps`
>>>>>>> >>> >> and
>>>>>>> >>> >>                                         imports like
>>>>>>> >>> >>
>>>>>>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>>>>>>> >>> >>                                         directly (this is what
>>>>>>> >>> >>                                         they
>>>>>>> >>> >>                                         are rewritten to
>>>>>>> >>> >>
>>>>>>> >>> >>                                         Some benefits
>>>>>>> >>> >>
>>>>>>> >>> >>                                           - much faster
>>>>>>> builds of
>>>>>>> >>> >> other modules
>>>>>>> >>> >>                                           - only one shaded
>>>>>>> uber jar
>>>>>>> >>> >>                                           - rare/no rebuilds
>>>>>>> of the
>>>>>>> >>> >> uber jar
>>>>>>> >>> >>                                           - can use maven
>>>>>>> enforcer to
>>>>>>> >>> >> forbid
>>>>>>> >>> >>                                         imports like
>>>>>>> com.google.common
>>>>>>> >>> >>                                           - configuration all
>>>>>>> in one
>>>>>>> >>> >> place
>>>>>>> >>> >>                                           - no automated
>>>>>>> rewriting of
>>>>>>> >>> >> our real
>>>>>>> >>> >>                                         code, which has led
>>>>>>> to some
>>>>>>> >>> >> major
>>>>>>> >>> >>                                         confusion
>>>>>>> >>> >>                                           - easy to implement
>>>>>>> >>> >> incrementally
>>>>>>> >>> >>
>>>>>>> >>> >>                                         Downsides:
>>>>>>> >>> >>
>>>>>>> >>> >>                                           - plenty of effort
>>>>>>> work to
>>>>>>> >>> >> get there
>>>>>>> >>> >>                                           - unclear how many
>>>>>>> different
>>>>>>> >>> >> such
>>>>>>> >>> >>                                         deps modules we need;
>>>>>>> sharing
>>>>>>> >>> >> them
>>>>>>> >>> >>
>>>>>>> >>> >>                                     could
>>>>>>> >>> >>
>>>>>>> >>> >>                                         get weird
>>>>>>> >>> >>                                           - if we hit a
>>>>>>> roadblock, we
>>>>>>> >>> >> will
>>>>>>> >>> >>                                         have committed a lot
>>>>>>> of time
>>>>>>> >>> >>
>>>>>>> >>> >>                                         Just something I was
>>>>>>> musing as
>>>>>>> >>> >> I spent
>>>>>>> >>> >>                                         another evening
>>>>>>> waiting for
>>>>>>> >>> >> slow
>>>>>>> >>> >>                                         builds to try to
>>>>>>> confirm
>>>>>>> >>> >> changes to
>>>>>>> >>> >>                                         brittle poms.
>>>>>>> >>> >>
>>>>>>> >>> >>                                         Kenn
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>                         --
>>>>>>> >>> >>                         Jean-Baptiste Onofré
>>>>>>> >>> >>                         jbonofre@apache.org
>>>>>>> >>> >> <ma...@apache.org>
>>>>>>> >>> >>                         http://blog.nanthrax.net
>>>>>>> >>> >> <http://blog.nanthrax.net/>
>>>>>>> >>> >>                         Talend - http://www.talend.com
>>>>>>> >>> >> <http://www.talend.com/>
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >
>>>>>>> >>> >
>>>>>>> >>> >
>>>>>>> >>>
>>>>>>> >>> --
>>>>>>> >>> Jean-Baptiste Onofré
>>>>>>> >>> jbonofre@apache.org
>>>>>>> >>> http://blog.nanthrax.net
>>>>>>> >>> Talend - http://www.talend.com
>>>>>>> >>
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Kenneth Knowles <kl...@google.com>.
On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Don't forget beam doesn't support much behind it (mainly only a few
> ParameterizedType due the usage code path) so it is mainly only about
> handling parameterized types and typevariables recursively. Not a lot of
> work. Main concern being it is in the API so using a shade as an API is
> never a good idea, in particular cause the shade can be broken and requires
> to setup clirr or things like that and when it breaks you are done and need
> to fork it anyway. Limiting the dependencies for an API - as beam is - is
> always saner even if it requires a small fork of code.
>

The main thing that TypeToken is used for is capturing generics that are
lost by Java reflection. It is a bit tricky, actually.

Kenn



>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>
>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <rmannibucau@gmail.com
>> > wrote:
>>
>>>
>>>
>>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>>
>>>> Another couple:
>>>>
>>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>>>> TypeToken
>>>>
>>>
>>> Technically reflect Type is enough
>>>
>>
>> If you want to try to remove TypeToken from underneath TypeDescriptor, I
>> have no objections as long as you expand the test suite to cover all the
>> functionality where we trust TypeToken's tests. Good luck :-)
>>
>> Kenn
>>
>>
>>>
>>>
>>>>  - ImmutableList and friends and their builders are very widely used
>>>> and IMO still add a lot for readability and preventing someone coming along
>>>> and adding mistakes to a codebase
>>>>
>>>
>>> Sugar but not required. When you compare the cost of a shade versus of
>>> duplicating the parts we need there is no real match IMHO.
>>>
>>>
>>>>
>>>> So considering it all, I would keep a vendored Guava (but also move off
>>>> where we can, and also have our own improvements). I hope it will be a
>>>> near-empty build file to generate it so not a maintenance burden.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com> wrote:
>>>>
>>>>> Nice. This sounds like a great idea (or two?) and goes along with what
>>>>> I just started for futures.
>>>>>
>>>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>>>
>>>>> Specific things for our micro guava:
>>>>>
>>>>>  - checkArgumentNotNull can throw the right exception
>>>>>  - our own Optional because Java's is not Serializable
>>>>>  - futures combinators since many are missing, especially Java's don't
>>>>> do exceptions right
>>>>>
>>>>> Protobuf: didn't file an issue because I'm not sure
>>>>>
>>>>> I was wondering if pre-shading works. We really need to get rid of it
>>>>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>>>>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>>>>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>>>>> a Message object. (and do the same for beam-model-protobuf-java since the
>>>>> model bits have to depend on each other but nothing else).
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Huge +1 to get rid of Guava!
>>>>>>
>>>>>> This solves annoying dependency issues for some IOs and allow us to
>>>>>> get rid of the shading that makes current jars bigger than they should
>>>>>> be.
>>>>>>
>>>>>> We can create our own 'micro guava' package with some classes for
>>>>>> things that are hard to migrate, or that we  prefer to still have like
>>>>>> the check* methods for example. Given the size of the task we should
>>>>>> probably divide it into subtasks, more important is to get rid of it
>>>>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>>>>
>>>>>> Other important idea would be to get rid of Protobuf in public APIs
>>>>>> like GCPIO and to better shade it from leaking into the runners. An
>>>>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>>>>> that is byting us for the Spark runner, but well that's worth a
>>>>>> different discussion.
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>>>>> <rm...@gmail.com> wrote:
>>>>>> > a map of list is fine and not a challenge we'll face long I hope ;)
>>>>>> >
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>>>>> >
>>>>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>> >>
>>>>>> >> Not sure we'll be able to replace them all. Things like guava
>>>>>> Table and
>>>>>> >> Multimap don't have great replacements in Java8.
>>>>>> >>
>>>>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>>>>> jb@nanthrax.net>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>>>>> >>>
>>>>>> >>> Regards
>>>>>> >>> JB
>>>>>> >>>
>>>>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>>>>> >>> > Why not dropping guava for all beam codebase? With java 8 it is
>>>>>> quite
>>>>>> >>> > easy to do
>>>>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with
>>>>>> quite a
>>>>>> >>> > good result.
>>>>>> >>> >
>>>>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>>>>> >>> > <ma...@google.com>> a écrit :
>>>>>> >>> >
>>>>>> >>> >     Make sure to include the guava version in the artifact name
>>>>>> so that
>>>>>> >>> > we can
>>>>>> >>> >     have multiple vendored versions.
>>>>>> >>> >
>>>>>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>>>>> klk@google.com
>>>>>> >>> >     <ma...@google.com>> wrote:
>>>>>> >>> >
>>>>>> >>> >         I didn't have time for this, but it just bit me. We
>>>>>> definitely
>>>>>> >>> > have
>>>>>> >>> >         Guava on the API surface of runner support code in ways
>>>>>> that
>>>>>> >>> > get
>>>>>> >>> >         incompatibly shaded. I will probably start "1a" by
>>>>>> making a
>>>>>> >>> > shaded
>>>>>> >>> >         library org.apache.beam:vendored-guava and starting to
>>>>>> use it.
>>>>>> >>> > It sounds
>>>>>> >>> >         like there is generally unanimous support for that much,
>>>>>> >>> > anyhow.
>>>>>> >>> >
>>>>>> >>> >         Kenn
>>>>>> >>> >
>>>>>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>>>>> >>> > <aljoscha@apache.org
>>>>>> >>> >         <ma...@apache.org>> wrote:
>>>>>> >>> >
>>>>>> >>> >             Thanks Ismaël for bringing up this discussion again!
>>>>>> >>> >
>>>>>> >>> >             I would be in favour of 1) and more specifically of
>>>>>> 1a)
>>>>>> >>> >
>>>>>> >>> >             Aljoscha
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>>>>>> lcwik@google.com
>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>> >>> >>
>>>>>> >>> >>             You can always run tests on post shaded artifacts
>>>>>> instead
>>>>>> >>> >> of the
>>>>>> >>> >>             compiled classes, it just requires us to change
>>>>>> our maven
>>>>>> >>> >> surefire
>>>>>> >>> >>             / gradle test configurations but it is true that
>>>>>> most IDEs
>>>>>> >>> >> would
>>>>>> >>> >>             behave better with a dependency jar unless you
>>>>>> delegate
>>>>>> >>> >> all the
>>>>>> >>> >>             build/test actions to the build system and then it
>>>>>> won't
>>>>>> >>> >> matter.
>>>>>> >>> >>
>>>>>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>>>>> >>> >> <klk@google.com
>>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>>> >>> >>
>>>>>> >>> >>                 There's also, with additional overhead,
>>>>>> >>> >>
>>>>>> >>> >>                 1a) A relocated and shipped package for each
>>>>>> thing we
>>>>>> >>> >> want to
>>>>>> >>> >>                 relocate. I think this has also been tried
>>>>>> outside
>>>>>> >>> >> Beam...
>>>>>> >>> >>
>>>>>> >>> >>                 Pros:
>>>>>> >>> >>                 * all the pros of 1) plus no bloat beyond what
>>>>>> is
>>>>>> >>> >> necessary
>>>>>> >>> >>                 Cons:
>>>>>> >>> >>                 * abandons whitelist approach for public deps,
>>>>>> >>> >> reverting to
>>>>>> >>> >>                 blacklist approach for trouble things like
>>>>>> guava, so a
>>>>>> >>> >> bit
>>>>>> >>> >>                 less principled
>>>>>> >>> >>
>>>>>> >>> >>                 For both 1) and 1a) I would add:
>>>>>> >>> >>
>>>>>> >>> >>                 Pros:
>>>>>> >>> >>                 * clearly readable dependency since code will
>>>>>> `import
>>>>>> >>> >>                 org.apache.beam.private.guava21` and IDEs will
>>>>>> >>> >> understand it
>>>>>> >>> >>                 is a distinct lilbrary
>>>>>> >>> >>                 * can run tests on unpackaged classes, as long
>>>>>> as deps
>>>>>> >>> >> are
>>>>>> >>> >>                 shaded or provided as jars
>>>>>> >>> >>                 * no mysterious action at a distance from
>>>>>> inherited
>>>>>> >>> >> configuration
>>>>>> >>> >>                 Cons:
>>>>>> >>> >>                 * need to adjust imports
>>>>>> >>> >>
>>>>>> >>> >>                 Kenn
>>>>>> >>> >>
>>>>>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>>>>> >>> >> <lcwik@google.com
>>>>>> >>> >>                 <ma...@google.com>> wrote:
>>>>>> >>> >>
>>>>>> >>> >>                     I would suggest that either we use:
>>>>>> >>> >>                     1) A common deps package containing shaded
>>>>>> >>> >> dependencies
>>>>>> >>> >>                     allows for
>>>>>> >>> >>                     Pros
>>>>>> >>> >>                     * doesn't require the user to build an
>>>>>> uber jar
>>>>>> >>> >>                     Risks
>>>>>> >>> >>                     * dependencies package will keep growing
>>>>>> even if
>>>>>> >>> >> something
>>>>>> >>> >>                     is or isn't needed by all of Apache Beam
>>>>>> leading
>>>>>> >>> >> to a
>>>>>> >>> >>                     large jar anyways negating any space
>>>>>> savings
>>>>>> >>> >>
>>>>>> >>> >>                     2) Shade within each module to a common
>>>>>> location
>>>>>> >>> >> like
>>>>>> >>> >>                     org.apache.beam.relocated.guava....
>>>>>> >>> >>                     Pros
>>>>>> >>> >>                     * you only get the shaded dependencies of
>>>>>> the
>>>>>> >>> >> things that
>>>>>> >>> >>                     are required
>>>>>> >>> >>                     * its one less dependency for users to
>>>>>> manage
>>>>>> >>> >>                     Risks
>>>>>> >>> >>                     * requires an uber jar to be built to get
>>>>>> the
>>>>>> >>> >> space
>>>>>> >>> >>                     savings (either by a user or a
>>>>>> distribution of
>>>>>> >>> >> Apache
>>>>>> >>> >>                     Beam) otherwise we negate any space
>>>>>> savings.
>>>>>> >>> >>
>>>>>> >>> >>                     If we either use a common relocation
>>>>>> scheme or a
>>>>>> >>> >>                     dependencies jar then each relocation
>>>>>> should
>>>>>> >>> >> specifically
>>>>>> >>> >>                     contain the version number of the package
>>>>>> because
>>>>>> >>> >> we would
>>>>>> >>> >>                     like to allow for us to be using two
>>>>>> different
>>>>>> >>> >> versions of
>>>>>> >>> >>                     the same library.
>>>>>> >>> >>
>>>>>> >>> >>                     For the common deps package approach,
>>>>>> should we
>>>>>> >>> >> check in
>>>>>> >>> >>                     code where the imports contain the
>>>>>> relocated
>>>>>> >>> >> location
>>>>>> >>> >>                     (e.g. import
>>>>>> org.apache.beam.guava.20.0.com
>>>>>> >>> >>
>>>>>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle
>>>>>> ct.ImmutableList)?
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM,
>>>>>> Jean-Baptiste
>>>>>> >>> >> Onofré
>>>>>> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
>>>>>> wrote:
>>>>>> >>> >>
>>>>>> >>> >>                         Thanks for bringing that back.
>>>>>> >>> >>
>>>>>> >>> >>                         Indeed guava is shaded in different
>>>>>> uber-jar.
>>>>>> >>> >> Maybe we
>>>>>> >>> >>                         can have a common deps module that we
>>>>>> include
>>>>>> >>> >> once
>>>>>> >>> >>                         (but the user will have to explicitly
>>>>>> define
>>>>>> >>> >> the dep) ?
>>>>>> >>> >>
>>>>>> >>> >>                         Basically, what do you propose for
>>>>>> protobuf
>>>>>> >>> >>                         (unfortunately, I don't see an
>>>>>> obvious) ?
>>>>>> >>> >>
>>>>>> >>> >>                         Regards
>>>>>> >>> >>                         JB
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía
>>>>>> wrote:
>>>>>> >>> >>
>>>>>> >>> >>                             Hello, I wanted to bring back this
>>>>>> subject
>>>>>> >>> >> because
>>>>>> >>> >>                             I think we should
>>>>>> >>> >>                             take action on this and at least
>>>>>> first
>>>>>> >>> >> have a
>>>>>> >>> >>                             shaded version of guava.
>>>>>> >>> >>                             I was playing with a toy project
>>>>>> and I did
>>>>>> >>> >> the
>>>>>> >>> >>                             procedure we use to
>>>>>> >>> >>                             submit jars to a Hadoop cluster via
>>>>>> >>> >> Flink/Spark
>>>>>> >>> >>                             which involves
>>>>>> >>> >>                             creating an uber jar and I
>>>>>> realized that
>>>>>> >>> >> the size
>>>>>> >>> >>                             of the jar was way
>>>>>> >>> >>                             bigger than I expected, and the
>>>>>> fact that
>>>>>> >>> >> we shade
>>>>>> >>> >>                             guava in every
>>>>>> >>> >>                             module contributes to this. I
>>>>>> found guava
>>>>>> >>> >> shaded on:
>>>>>> >>> >>
>>>>>> >>> >>                             sdks/java/core
>>>>>> >>> >>                             runners/core-construction-java
>>>>>> >>> >>                             runners/core-java
>>>>>> >>> >>                             model/job-management
>>>>>> >>> >>                             runners/spark
>>>>>> >>> >>                             sdks/java/io/hadoop-file-system
>>>>>> >>> >>                             sdks/java/io/kafka
>>>>>> >>> >>
>>>>>> >>> >>                             This means at least 6 times more
>>>>>> of the
>>>>>> >>> >> size it
>>>>>> >>> >>                             should which counts in
>>>>>> >>> >>                             around 15MB more (2.4MB*6 deps) of
>>>>>> extra
>>>>>> >>> >> weight
>>>>>> >>> >>                             that we can simply
>>>>>> >>> >>                             reduce by using a shaded version
>>>>>> of the
>>>>>> >>> >> library.
>>>>>> >>> >>
>>>>>> >>> >>                             I add this point to the previous
>>>>>> ones
>>>>>> >>> >> mentioned
>>>>>> >>> >>                             during the discussion
>>>>>> >>> >>                             because this goes against the end
>>>>>> user
>>>>>> >>> >> experience
>>>>>> >>> >>                             and affects us all
>>>>>> >>> >>                             (devs/users).
>>>>>> >>> >>
>>>>>> >>> >>                             Another question is if we should
>>>>>> shade
>>>>>> >>> >> (and how)
>>>>>> >>> >>                             protocol buffers
>>>>>> >>> >>                             because now with the portability
>>>>>> work we
>>>>>> >>> >> are
>>>>>> >>> >>                             exposing it closer to the
>>>>>> >>> >>                             end users. I say this because I
>>>>>> also found
>>>>>> >>> >> an
>>>>>> >>> >>                             issue while running a
>>>>>> >>> >>                             job on YARN with the spark runner
>>>>>> because
>>>>>> >>> >>                             hadoop-common includes
>>>>>> >>> >>                             protobuf-java 2 and I had to
>>>>>> explicitly
>>>>>> >>> >> provide
>>>>>> >>> >>                             protocol-buffers 3 as
>>>>>> >>> >>                             a dependency to be able to use
>>>>>> triggers
>>>>>> >>> >> (note the
>>>>>> >>> >>                             Spark runner
>>>>>> >>> >>                             translates them using some method
>>>>>> from
>>>>>> >>> >>                             runners/core-java). Since
>>>>>> >>> >>                             hadoop-common is provided in the
>>>>>> cluster
>>>>>> >>> >> with the
>>>>>> >>> >>                             older version of
>>>>>> >>> >>                             protobuf, I am afraid that this
>>>>>> will bite
>>>>>> >>> >> us in
>>>>>> >>> >>                             the future.
>>>>>> >>> >>
>>>>>> >>> >>                             Ismaël
>>>>>> >>> >>
>>>>>> >>> >>                             ps. There is already a JIRA for
>>>>>> that
>>>>>> >>> >> shading for
>>>>>> >>> >>                             protobuf on
>>>>>> >>> >>                             hadoop-common but this is not
>>>>>> coming until
>>>>>> >>> >> version
>>>>>> >>> >>                             3 is out.
>>>>>> >>> >>
>>>>>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>> >>> >>
>>>>>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>>>>> >>> >>
>>>>>> >>> >>                             ps2. Extra curious situation is to
>>>>>> see
>>>>>> >>> >> that the
>>>>>> >>> >>                             dataflow-runner ends
>>>>>> >>> >>                             up having guava shaded twice via
>>>>>> its
>>>>>> >>> >> shaded version on
>>>>>> >>> >>                             core-construction-java.
>>>>>> >>> >>
>>>>>> >>> >>                             ps3. Of course this message means a
>>>>>> >>> >> de-facto +1 at
>>>>>> >>> >>                             least to do it for
>>>>>> >>> >>                             guava and evaluate it for other
>>>>>> libs.
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM,
>>>>>> Lukasz
>>>>>> >>> >> Cwik
>>>>>> >>> >>                             <lcwik@google.com.invalid
>>>>>> >>> >>                             <ma...@google.com.invalid>>
>>>>>> wrote:
>>>>>> >>> >>
>>>>>> >>> >>                                 An issue to call out is how to
>>>>>> deal
>>>>>> >>> >> with our
>>>>>> >>> >>                                 generated code (.avro and
>>>>>> >>> >>                                 .proto) as I don't believe
>>>>>> those
>>>>>> >>> >> plugins allow
>>>>>> >>> >>                                 you to generate code using a
>>>>>> >>> >>                                 shaded package prefix on
>>>>>> imports.
>>>>>> >>> >>
>>>>>> >>> >>                                 On Tue, Oct 17, 2017 at 10:28
>>>>>> AM,
>>>>>> >>> >> Thomas Groh
>>>>>> >>> >>                                 <tgroh@google.com.invalid
>>>>>> >>> >>                                 <mailto:tgroh@google.com.inva
>>>>>> lid>>
>>>>>> >>> >>                                 wrote:
>>>>>> >>> >>
>>>>>> >>> >>                                     +1 to the goal. I'm hugely
>>>>>> in
>>>>>> >>> >> favor of not
>>>>>> >>> >>                                     doing the same shading work
>>>>>> >>> >>                                     every time for
>>>>>> dependencies we
>>>>>> >>> >> know we'll use.
>>>>>> >>> >>
>>>>>> >>> >>                                     This also means that if we
>>>>>> end up
>>>>>> >>> >> pulling
>>>>>> >>> >>                                     in transitive dependencies
>>>>>> we
>>>>>> >>> >>                                     don't want in any
>>>>>> particular
>>>>>> >>> >> module we can
>>>>>> >>> >>                                     avoid having to adjust our
>>>>>> >>> >>                                     repackaging strategy for
>>>>>> that
>>>>>> >>> >> module -
>>>>>> >>> >>                                     which I have run into
>>>>>> face-first
>>>>>> >>> >> in
>>>>>> >>> >>                                     the past.
>>>>>> >>> >>
>>>>>> >>> >>                                     On Tue, Oct 17, 2017 at
>>>>>> 9:48 AM,
>>>>>> >>> >> Kenneth
>>>>>> >>> >>                                     Knowles
>>>>>> <klk@google.com.invalid
>>>>>> >>> >>                                     <mailto:
>>>>>> klk@google.com.invalid>>
>>>>>> >>> >>                                     wrote:
>>>>>> >>> >>
>>>>>> >>> >>                                         Hi all,
>>>>>> >>> >>
>>>>>> >>> >>                                         Shading is a big part
>>>>>> of how
>>>>>> >>> >> we keep
>>>>>> >>> >>                                         our dependencies sane
>>>>>> in Beam.
>>>>>> >>> >> But
>>>>>> >>> >>                                         downsides: shading is
>>>>>> super
>>>>>> >>> >> slow,
>>>>>> >>> >>                                         causes massive jar
>>>>>> bloat, and
>>>>>> >>> >> kind of
>>>>>> >>> >>                                         hard to get right
>>>>>> because
>>>>>> >>> >> artifacts
>>>>>> >>> >>                                         and namespaces are not
>>>>>> 1-to-1.
>>>>>> >>> >>
>>>>>> >>> >>                                         I know that some
>>>>>> communities
>>>>>> >>> >>                                         distribute their own
>>>>>> shaded
>>>>>> >>> >>                                         distributions of
>>>>>> >>> >>                                         dependencies. I had a
>>>>>> thought
>>>>>> >>> >> about
>>>>>> >>> >>                                         doing something
>>>>>> similar that I
>>>>>> >>> >> wanted
>>>>>> >>> >>                                         to throw out there for
>>>>>> people
>>>>>> >>> >> to poke
>>>>>> >>> >>                                         holes in.
>>>>>> >>> >>
>>>>>> >>> >>                                         To set the scene, here
>>>>>> is how
>>>>>> >>> >> I view
>>>>>> >>> >>                                         shading:
>>>>>> >>> >>
>>>>>> >>> >>                                           - A module has public
>>>>>> >>> >> dependencies
>>>>>> >>> >>                                         and private
>>>>>> dependencies.
>>>>>> >>> >>                                           - Public deps are
>>>>>> used for
>>>>>> >>> >> data
>>>>>> >>> >>                                         interchange; users
>>>>>> must share
>>>>>> >>> >> these
>>>>>> >>> >>
>>>>>> >>> >>                                     deps.
>>>>>> >>> >>
>>>>>> >>> >>                                           - Private deps are
>>>>>> just
>>>>>> >>> >>                                         functionality and can
>>>>>> be
>>>>>> >>> >> hidden (in
>>>>>> >>> >>                                         our case,
>>>>>> >>> >>                                         relocated + bundled)
>>>>>> >>> >>                                           - It isn't
>>>>>> necessarily that
>>>>>> >>> >> simple,
>>>>>> >>> >>                                         because public and
>>>>>> private
>>>>>> >>> >> deps
>>>>>> >>> >>
>>>>>> >>> >>                                     might
>>>>>> >>> >>
>>>>>> >>> >>                                         interact in
>>>>>> higher-order ways
>>>>>> >>> >>                                         ("public" is
>>>>>> contagious)
>>>>>> >>> >>
>>>>>> >>> >>                                         Shading is an
>>>>>> implementation
>>>>>> >>> >> detail of
>>>>>> >>> >>                                         expressing these
>>>>>> >>> >> characteristics.
>>>>>> >>> >>
>>>>>> >>> >>                                     We
>>>>>> >>> >>
>>>>>> >>> >>                                         use shading selectively
>>>>>> >>> >> because of its
>>>>>> >>> >>                                         downsides I mentioned
>>>>>> above.
>>>>>> >>> >>
>>>>>> >>> >>                                         But what about this
>>>>>> idea:
>>>>>> >>> >> Introduce
>>>>>> >>> >>                                         shaded deps as a single
>>>>>> >>> >> separate
>>>>>> >>> >>                                         artifact.
>>>>>> >>> >>
>>>>>> >>> >>                                           -
>>>>>> sdks/java/private-deps:
>>>>>> >>> >> bundled
>>>>>> >>> >>                                         uber jar with relocated
>>>>>> >>> >> versions of
>>>>>> >>> >>                                         everything we want to
>>>>>> shade
>>>>>> >>> >>
>>>>>> >>> >>                                           - sdks/java/core and
>>>>>> >>> >>                                         sdks/java/harness: no
>>>>>> >>> >> relocation or
>>>>>> >>> >>                                         bundling -
>>>>>> >>> >>                                         depends on
>>>>>> >>> >>
>>>>>>  `beam-sdks-java-private-deps`
>>>>>> >>> >> and
>>>>>> >>> >>                                         imports like
>>>>>> >>> >>
>>>>>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>>>>>> >>> >>                                         directly (this is what
>>>>>> >>> >>                                         they
>>>>>> >>> >>                                         are rewritten to
>>>>>> >>> >>
>>>>>> >>> >>                                         Some benefits
>>>>>> >>> >>
>>>>>> >>> >>                                           - much faster builds
>>>>>> of
>>>>>> >>> >> other modules
>>>>>> >>> >>                                           - only one shaded
>>>>>> uber jar
>>>>>> >>> >>                                           - rare/no rebuilds
>>>>>> of the
>>>>>> >>> >> uber jar
>>>>>> >>> >>                                           - can use maven
>>>>>> enforcer to
>>>>>> >>> >> forbid
>>>>>> >>> >>                                         imports like
>>>>>> com.google.common
>>>>>> >>> >>                                           - configuration all
>>>>>> in one
>>>>>> >>> >> place
>>>>>> >>> >>                                           - no automated
>>>>>> rewriting of
>>>>>> >>> >> our real
>>>>>> >>> >>                                         code, which has led to
>>>>>> some
>>>>>> >>> >> major
>>>>>> >>> >>                                         confusion
>>>>>> >>> >>                                           - easy to implement
>>>>>> >>> >> incrementally
>>>>>> >>> >>
>>>>>> >>> >>                                         Downsides:
>>>>>> >>> >>
>>>>>> >>> >>                                           - plenty of effort
>>>>>> work to
>>>>>> >>> >> get there
>>>>>> >>> >>                                           - unclear how many
>>>>>> different
>>>>>> >>> >> such
>>>>>> >>> >>                                         deps modules we need;
>>>>>> sharing
>>>>>> >>> >> them
>>>>>> >>> >>
>>>>>> >>> >>                                     could
>>>>>> >>> >>
>>>>>> >>> >>                                         get weird
>>>>>> >>> >>                                           - if we hit a
>>>>>> roadblock, we
>>>>>> >>> >> will
>>>>>> >>> >>                                         have committed a lot
>>>>>> of time
>>>>>> >>> >>
>>>>>> >>> >>                                         Just something I was
>>>>>> musing as
>>>>>> >>> >> I spent
>>>>>> >>> >>                                         another evening
>>>>>> waiting for
>>>>>> >>> >> slow
>>>>>> >>> >>                                         builds to try to
>>>>>> confirm
>>>>>> >>> >> changes to
>>>>>> >>> >>                                         brittle poms.
>>>>>> >>> >>
>>>>>> >>> >>                                         Kenn
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>                         --
>>>>>> >>> >>                         Jean-Baptiste Onofré
>>>>>> >>> >>                         jbonofre@apache.org
>>>>>> >>> >> <ma...@apache.org>
>>>>>> >>> >>                         http://blog.nanthrax.net
>>>>>> >>> >> <http://blog.nanthrax.net/>
>>>>>> >>> >>                         Talend - http://www.talend.com
>>>>>> >>> >> <http://www.talend.com/>
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>>
>>>>>> >>> --
>>>>>> >>> Jean-Baptiste Onofré
>>>>>> >>> jbonofre@apache.org
>>>>>> >>> http://blog.nanthrax.net
>>>>>> >>> Talend - http://www.talend.com
>>>>>> >>
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Romain Manni-Bucau <rm...@gmail.com>.
Don't forget beam doesn't support much behind it (mainly only a few
ParameterizedType due the usage code path) so it is mainly only about
handling parameterized types and typevariables recursively. Not a lot of
work. Main concern being it is in the API so using a shade as an API is
never a good idea, in particular cause the shade can be broken and requires
to setup clirr or things like that and when it breaks you are done and need
to fork it anyway. Limiting the dependencies for an API - as beam is - is
always saner even if it requires a small fork of code.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-02 15:49 GMT+01:00 Kenneth Knowles <kl...@google.com>:

> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>>
>>
>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>>
>>> Another couple:
>>>
>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>>>
>>
>> Technically reflect Type is enough
>>
>
> If you want to try to remove TypeToken from underneath TypeDescriptor, I
> have no objections as long as you expand the test suite to cover all the
> functionality where we trust TypeToken's tests. Good luck :-)
>
> Kenn
>
>
>>
>>
>>>  - ImmutableList and friends and their builders are very widely used and
>>> IMO still add a lot for readability and preventing someone coming along and
>>> adding mistakes to a codebase
>>>
>>
>> Sugar but not required. When you compare the cost of a shade versus of
>> duplicating the parts we need there is no real match IMHO.
>>
>>
>>>
>>> So considering it all, I would keep a vendored Guava (but also move off
>>> where we can, and also have our own improvements). I hope it will be a
>>> near-empty build file to generate it so not a maintenance burden.
>>>
>>> Kenn
>>>
>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> Nice. This sounds like a great idea (or two?) and goes along with what
>>>> I just started for futures.
>>>>
>>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>>
>>>> Specific things for our micro guava:
>>>>
>>>>  - checkArgumentNotNull can throw the right exception
>>>>  - our own Optional because Java's is not Serializable
>>>>  - futures combinators since many are missing, especially Java's don't
>>>> do exceptions right
>>>>
>>>> Protobuf: didn't file an issue because I'm not sure
>>>>
>>>> I was wondering if pre-shading works. We really need to get rid of it
>>>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>>>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>>>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>>>> a Message object. (and do the same for beam-model-protobuf-java since the
>>>> model bits have to depend on each other but nothing else).
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>>> Huge +1 to get rid of Guava!
>>>>>
>>>>> This solves annoying dependency issues for some IOs and allow us to
>>>>> get rid of the shading that makes current jars bigger than they should
>>>>> be.
>>>>>
>>>>> We can create our own 'micro guava' package with some classes for
>>>>> things that are hard to migrate, or that we  prefer to still have like
>>>>> the check* methods for example. Given the size of the task we should
>>>>> probably divide it into subtasks, more important is to get rid of it
>>>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>>>
>>>>> Other important idea would be to get rid of Protobuf in public APIs
>>>>> like GCPIO and to better shade it from leaking into the runners. An
>>>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>>>> that is byting us for the Spark runner, but well that's worth a
>>>>> different discussion.
>>>>>
>>>>>
>>>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>>>> <rm...@gmail.com> wrote:
>>>>> > a map of list is fine and not a challenge we'll face long I hope ;)
>>>>> >
>>>>> >
>>>>> > Romain Manni-Bucau
>>>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>>>> >
>>>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>> >>
>>>>> >> Not sure we'll be able to replace them all. Things like guava Table
>>>>> and
>>>>> >> Multimap don't have great replacements in Java8.
>>>>> >>
>>>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>>>> jb@nanthrax.net>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>>>> >>>
>>>>> >>> Regards
>>>>> >>> JB
>>>>> >>>
>>>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>>>> >>> > Why not dropping guava for all beam codebase? With java 8 it is
>>>>> quite
>>>>> >>> > easy to do
>>>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with
>>>>> quite a
>>>>> >>> > good result.
>>>>> >>> >
>>>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>>>> >>> > <ma...@google.com>> a écrit :
>>>>> >>> >
>>>>> >>> >     Make sure to include the guava version in the artifact name
>>>>> so that
>>>>> >>> > we can
>>>>> >>> >     have multiple vendored versions.
>>>>> >>> >
>>>>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>>>> klk@google.com
>>>>> >>> >     <ma...@google.com>> wrote:
>>>>> >>> >
>>>>> >>> >         I didn't have time for this, but it just bit me. We
>>>>> definitely
>>>>> >>> > have
>>>>> >>> >         Guava on the API surface of runner support code in ways
>>>>> that
>>>>> >>> > get
>>>>> >>> >         incompatibly shaded. I will probably start "1a" by
>>>>> making a
>>>>> >>> > shaded
>>>>> >>> >         library org.apache.beam:vendored-guava and starting to
>>>>> use it.
>>>>> >>> > It sounds
>>>>> >>> >         like there is generally unanimous support for that much,
>>>>> >>> > anyhow.
>>>>> >>> >
>>>>> >>> >         Kenn
>>>>> >>> >
>>>>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>>>> >>> > <aljoscha@apache.org
>>>>> >>> >         <ma...@apache.org>> wrote:
>>>>> >>> >
>>>>> >>> >             Thanks Ismaël for bringing up this discussion again!
>>>>> >>> >
>>>>> >>> >             I would be in favour of 1) and more specifically of
>>>>> 1a)
>>>>> >>> >
>>>>> >>> >             Aljoscha
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>>>>> lcwik@google.com
>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>> >>> >>
>>>>> >>> >>             You can always run tests on post shaded artifacts
>>>>> instead
>>>>> >>> >> of the
>>>>> >>> >>             compiled classes, it just requires us to change our
>>>>> maven
>>>>> >>> >> surefire
>>>>> >>> >>             / gradle test configurations but it is true that
>>>>> most IDEs
>>>>> >>> >> would
>>>>> >>> >>             behave better with a dependency jar unless you
>>>>> delegate
>>>>> >>> >> all the
>>>>> >>> >>             build/test actions to the build system and then it
>>>>> won't
>>>>> >>> >> matter.
>>>>> >>> >>
>>>>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>>>> >>> >> <klk@google.com
>>>>> >>> >>             <ma...@google.com>> wrote:
>>>>> >>> >>
>>>>> >>> >>                 There's also, with additional overhead,
>>>>> >>> >>
>>>>> >>> >>                 1a) A relocated and shipped package for each
>>>>> thing we
>>>>> >>> >> want to
>>>>> >>> >>                 relocate. I think this has also been tried
>>>>> outside
>>>>> >>> >> Beam...
>>>>> >>> >>
>>>>> >>> >>                 Pros:
>>>>> >>> >>                 * all the pros of 1) plus no bloat beyond what
>>>>> is
>>>>> >>> >> necessary
>>>>> >>> >>                 Cons:
>>>>> >>> >>                 * abandons whitelist approach for public deps,
>>>>> >>> >> reverting to
>>>>> >>> >>                 blacklist approach for trouble things like
>>>>> guava, so a
>>>>> >>> >> bit
>>>>> >>> >>                 less principled
>>>>> >>> >>
>>>>> >>> >>                 For both 1) and 1a) I would add:
>>>>> >>> >>
>>>>> >>> >>                 Pros:
>>>>> >>> >>                 * clearly readable dependency since code will
>>>>> `import
>>>>> >>> >>                 org.apache.beam.private.guava21` and IDEs will
>>>>> >>> >> understand it
>>>>> >>> >>                 is a distinct lilbrary
>>>>> >>> >>                 * can run tests on unpackaged classes, as long
>>>>> as deps
>>>>> >>> >> are
>>>>> >>> >>                 shaded or provided as jars
>>>>> >>> >>                 * no mysterious action at a distance from
>>>>> inherited
>>>>> >>> >> configuration
>>>>> >>> >>                 Cons:
>>>>> >>> >>                 * need to adjust imports
>>>>> >>> >>
>>>>> >>> >>                 Kenn
>>>>> >>> >>
>>>>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>>>> >>> >> <lcwik@google.com
>>>>> >>> >>                 <ma...@google.com>> wrote:
>>>>> >>> >>
>>>>> >>> >>                     I would suggest that either we use:
>>>>> >>> >>                     1) A common deps package containing shaded
>>>>> >>> >> dependencies
>>>>> >>> >>                     allows for
>>>>> >>> >>                     Pros
>>>>> >>> >>                     * doesn't require the user to build an uber
>>>>> jar
>>>>> >>> >>                     Risks
>>>>> >>> >>                     * dependencies package will keep growing
>>>>> even if
>>>>> >>> >> something
>>>>> >>> >>                     is or isn't needed by all of Apache Beam
>>>>> leading
>>>>> >>> >> to a
>>>>> >>> >>                     large jar anyways negating any space savings
>>>>> >>> >>
>>>>> >>> >>                     2) Shade within each module to a common
>>>>> location
>>>>> >>> >> like
>>>>> >>> >>                     org.apache.beam.relocated.guava....
>>>>> >>> >>                     Pros
>>>>> >>> >>                     * you only get the shaded dependencies of
>>>>> the
>>>>> >>> >> things that
>>>>> >>> >>                     are required
>>>>> >>> >>                     * its one less dependency for users to
>>>>> manage
>>>>> >>> >>                     Risks
>>>>> >>> >>                     * requires an uber jar to be built to get
>>>>> the
>>>>> >>> >> space
>>>>> >>> >>                     savings (either by a user or a distribution
>>>>> of
>>>>> >>> >> Apache
>>>>> >>> >>                     Beam) otherwise we negate any space savings.
>>>>> >>> >>
>>>>> >>> >>                     If we either use a common relocation scheme
>>>>> or a
>>>>> >>> >>                     dependencies jar then each relocation should
>>>>> >>> >> specifically
>>>>> >>> >>                     contain the version number of the package
>>>>> because
>>>>> >>> >> we would
>>>>> >>> >>                     like to allow for us to be using two
>>>>> different
>>>>> >>> >> versions of
>>>>> >>> >>                     the same library.
>>>>> >>> >>
>>>>> >>> >>                     For the common deps package approach,
>>>>> should we
>>>>> >>> >> check in
>>>>> >>> >>                     code where the imports contain the relocated
>>>>> >>> >> location
>>>>> >>> >>                     (e.g. import org.apache.beam.guava.20.0.com
>>>>> >>> >>
>>>>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle
>>>>> ct.ImmutableList)?
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM,
>>>>> Jean-Baptiste
>>>>> >>> >> Onofré
>>>>> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
>>>>> wrote:
>>>>> >>> >>
>>>>> >>> >>                         Thanks for bringing that back.
>>>>> >>> >>
>>>>> >>> >>                         Indeed guava is shaded in different
>>>>> uber-jar.
>>>>> >>> >> Maybe we
>>>>> >>> >>                         can have a common deps module that we
>>>>> include
>>>>> >>> >> once
>>>>> >>> >>                         (but the user will have to explicitly
>>>>> define
>>>>> >>> >> the dep) ?
>>>>> >>> >>
>>>>> >>> >>                         Basically, what do you propose for
>>>>> protobuf
>>>>> >>> >>                         (unfortunately, I don't see an obvious)
>>>>> ?
>>>>> >>> >>
>>>>> >>> >>                         Regards
>>>>> >>> >>                         JB
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía
>>>>> wrote:
>>>>> >>> >>
>>>>> >>> >>                             Hello, I wanted to bring back this
>>>>> subject
>>>>> >>> >> because
>>>>> >>> >>                             I think we should
>>>>> >>> >>                             take action on this and at least
>>>>> first
>>>>> >>> >> have a
>>>>> >>> >>                             shaded version of guava.
>>>>> >>> >>                             I was playing with a toy project
>>>>> and I did
>>>>> >>> >> the
>>>>> >>> >>                             procedure we use to
>>>>> >>> >>                             submit jars to a Hadoop cluster via
>>>>> >>> >> Flink/Spark
>>>>> >>> >>                             which involves
>>>>> >>> >>                             creating an uber jar and I realized
>>>>> that
>>>>> >>> >> the size
>>>>> >>> >>                             of the jar was way
>>>>> >>> >>                             bigger than I expected, and the
>>>>> fact that
>>>>> >>> >> we shade
>>>>> >>> >>                             guava in every
>>>>> >>> >>                             module contributes to this. I found
>>>>> guava
>>>>> >>> >> shaded on:
>>>>> >>> >>
>>>>> >>> >>                             sdks/java/core
>>>>> >>> >>                             runners/core-construction-java
>>>>> >>> >>                             runners/core-java
>>>>> >>> >>                             model/job-management
>>>>> >>> >>                             runners/spark
>>>>> >>> >>                             sdks/java/io/hadoop-file-system
>>>>> >>> >>                             sdks/java/io/kafka
>>>>> >>> >>
>>>>> >>> >>                             This means at least 6 times more of
>>>>> the
>>>>> >>> >> size it
>>>>> >>> >>                             should which counts in
>>>>> >>> >>                             around 15MB more (2.4MB*6 deps) of
>>>>> extra
>>>>> >>> >> weight
>>>>> >>> >>                             that we can simply
>>>>> >>> >>                             reduce by using a shaded version of
>>>>> the
>>>>> >>> >> library.
>>>>> >>> >>
>>>>> >>> >>                             I add this point to the previous
>>>>> ones
>>>>> >>> >> mentioned
>>>>> >>> >>                             during the discussion
>>>>> >>> >>                             because this goes against the end
>>>>> user
>>>>> >>> >> experience
>>>>> >>> >>                             and affects us all
>>>>> >>> >>                             (devs/users).
>>>>> >>> >>
>>>>> >>> >>                             Another question is if we should
>>>>> shade
>>>>> >>> >> (and how)
>>>>> >>> >>                             protocol buffers
>>>>> >>> >>                             because now with the portability
>>>>> work we
>>>>> >>> >> are
>>>>> >>> >>                             exposing it closer to the
>>>>> >>> >>                             end users. I say this because I
>>>>> also found
>>>>> >>> >> an
>>>>> >>> >>                             issue while running a
>>>>> >>> >>                             job on YARN with the spark runner
>>>>> because
>>>>> >>> >>                             hadoop-common includes
>>>>> >>> >>                             protobuf-java 2 and I had to
>>>>> explicitly
>>>>> >>> >> provide
>>>>> >>> >>                             protocol-buffers 3 as
>>>>> >>> >>                             a dependency to be able to use
>>>>> triggers
>>>>> >>> >> (note the
>>>>> >>> >>                             Spark runner
>>>>> >>> >>                             translates them using some method
>>>>> from
>>>>> >>> >>                             runners/core-java). Since
>>>>> >>> >>                             hadoop-common is provided in the
>>>>> cluster
>>>>> >>> >> with the
>>>>> >>> >>                             older version of
>>>>> >>> >>                             protobuf, I am afraid that this
>>>>> will bite
>>>>> >>> >> us in
>>>>> >>> >>                             the future.
>>>>> >>> >>
>>>>> >>> >>                             Ismaël
>>>>> >>> >>
>>>>> >>> >>                             ps. There is already a JIRA for that
>>>>> >>> >> shading for
>>>>> >>> >>                             protobuf on
>>>>> >>> >>                             hadoop-common but this is not
>>>>> coming until
>>>>> >>> >> version
>>>>> >>> >>                             3 is out.
>>>>> >>> >>
>>>>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>> >>> >>
>>>>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>>>> >>> >>
>>>>> >>> >>                             ps2. Extra curious situation is to
>>>>> see
>>>>> >>> >> that the
>>>>> >>> >>                             dataflow-runner ends
>>>>> >>> >>                             up having guava shaded twice via its
>>>>> >>> >> shaded version on
>>>>> >>> >>                             core-construction-java.
>>>>> >>> >>
>>>>> >>> >>                             ps3. Of course this message means a
>>>>> >>> >> de-facto +1 at
>>>>> >>> >>                             least to do it for
>>>>> >>> >>                             guava and evaluate it for other
>>>>> libs.
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM,
>>>>> Lukasz
>>>>> >>> >> Cwik
>>>>> >>> >>                             <lcwik@google.com.invalid
>>>>> >>> >>                             <ma...@google.com.invalid>>
>>>>> wrote:
>>>>> >>> >>
>>>>> >>> >>                                 An issue to call out is how to
>>>>> deal
>>>>> >>> >> with our
>>>>> >>> >>                                 generated code (.avro and
>>>>> >>> >>                                 .proto) as I don't believe those
>>>>> >>> >> plugins allow
>>>>> >>> >>                                 you to generate code using a
>>>>> >>> >>                                 shaded package prefix on
>>>>> imports.
>>>>> >>> >>
>>>>> >>> >>                                 On Tue, Oct 17, 2017 at 10:28
>>>>> AM,
>>>>> >>> >> Thomas Groh
>>>>> >>> >>                                 <tgroh@google.com.invalid
>>>>> >>> >>                                 <mailto:tgroh@google.com.inva
>>>>> lid>>
>>>>> >>> >>                                 wrote:
>>>>> >>> >>
>>>>> >>> >>                                     +1 to the goal. I'm hugely
>>>>> in
>>>>> >>> >> favor of not
>>>>> >>> >>                                     doing the same shading work
>>>>> >>> >>                                     every time for dependencies
>>>>> we
>>>>> >>> >> know we'll use.
>>>>> >>> >>
>>>>> >>> >>                                     This also means that if we
>>>>> end up
>>>>> >>> >> pulling
>>>>> >>> >>                                     in transitive dependencies
>>>>> we
>>>>> >>> >>                                     don't want in any particular
>>>>> >>> >> module we can
>>>>> >>> >>                                     avoid having to adjust our
>>>>> >>> >>                                     repackaging strategy for
>>>>> that
>>>>> >>> >> module -
>>>>> >>> >>                                     which I have run into
>>>>> face-first
>>>>> >>> >> in
>>>>> >>> >>                                     the past.
>>>>> >>> >>
>>>>> >>> >>                                     On Tue, Oct 17, 2017 at
>>>>> 9:48 AM,
>>>>> >>> >> Kenneth
>>>>> >>> >>                                     Knowles
>>>>> <klk@google.com.invalid
>>>>> >>> >>                                     <mailto:
>>>>> klk@google.com.invalid>>
>>>>> >>> >>                                     wrote:
>>>>> >>> >>
>>>>> >>> >>                                         Hi all,
>>>>> >>> >>
>>>>> >>> >>                                         Shading is a big part
>>>>> of how
>>>>> >>> >> we keep
>>>>> >>> >>                                         our dependencies sane
>>>>> in Beam.
>>>>> >>> >> But
>>>>> >>> >>                                         downsides: shading is
>>>>> super
>>>>> >>> >> slow,
>>>>> >>> >>                                         causes massive jar
>>>>> bloat, and
>>>>> >>> >> kind of
>>>>> >>> >>                                         hard to get right
>>>>> because
>>>>> >>> >> artifacts
>>>>> >>> >>                                         and namespaces are not
>>>>> 1-to-1.
>>>>> >>> >>
>>>>> >>> >>                                         I know that some
>>>>> communities
>>>>> >>> >>                                         distribute their own
>>>>> shaded
>>>>> >>> >>                                         distributions of
>>>>> >>> >>                                         dependencies. I had a
>>>>> thought
>>>>> >>> >> about
>>>>> >>> >>                                         doing something similar
>>>>> that I
>>>>> >>> >> wanted
>>>>> >>> >>                                         to throw out there for
>>>>> people
>>>>> >>> >> to poke
>>>>> >>> >>                                         holes in.
>>>>> >>> >>
>>>>> >>> >>                                         To set the scene, here
>>>>> is how
>>>>> >>> >> I view
>>>>> >>> >>                                         shading:
>>>>> >>> >>
>>>>> >>> >>                                           - A module has public
>>>>> >>> >> dependencies
>>>>> >>> >>                                         and private
>>>>> dependencies.
>>>>> >>> >>                                           - Public deps are
>>>>> used for
>>>>> >>> >> data
>>>>> >>> >>                                         interchange; users must
>>>>> share
>>>>> >>> >> these
>>>>> >>> >>
>>>>> >>> >>                                     deps.
>>>>> >>> >>
>>>>> >>> >>                                           - Private deps are
>>>>> just
>>>>> >>> >>                                         functionality and can be
>>>>> >>> >> hidden (in
>>>>> >>> >>                                         our case,
>>>>> >>> >>                                         relocated + bundled)
>>>>> >>> >>                                           - It isn't
>>>>> necessarily that
>>>>> >>> >> simple,
>>>>> >>> >>                                         because public and
>>>>> private
>>>>> >>> >> deps
>>>>> >>> >>
>>>>> >>> >>                                     might
>>>>> >>> >>
>>>>> >>> >>                                         interact in
>>>>> higher-order ways
>>>>> >>> >>                                         ("public" is contagious)
>>>>> >>> >>
>>>>> >>> >>                                         Shading is an
>>>>> implementation
>>>>> >>> >> detail of
>>>>> >>> >>                                         expressing these
>>>>> >>> >> characteristics.
>>>>> >>> >>
>>>>> >>> >>                                     We
>>>>> >>> >>
>>>>> >>> >>                                         use shading selectively
>>>>> >>> >> because of its
>>>>> >>> >>                                         downsides I mentioned
>>>>> above.
>>>>> >>> >>
>>>>> >>> >>                                         But what about this
>>>>> idea:
>>>>> >>> >> Introduce
>>>>> >>> >>                                         shaded deps as a single
>>>>> >>> >> separate
>>>>> >>> >>                                         artifact.
>>>>> >>> >>
>>>>> >>> >>                                           -
>>>>> sdks/java/private-deps:
>>>>> >>> >> bundled
>>>>> >>> >>                                         uber jar with relocated
>>>>> >>> >> versions of
>>>>> >>> >>                                         everything we want to
>>>>> shade
>>>>> >>> >>
>>>>> >>> >>                                           - sdks/java/core and
>>>>> >>> >>                                         sdks/java/harness: no
>>>>> >>> >> relocation or
>>>>> >>> >>                                         bundling -
>>>>> >>> >>                                         depends on
>>>>> >>> >>
>>>>>  `beam-sdks-java-private-deps`
>>>>> >>> >> and
>>>>> >>> >>                                         imports like
>>>>> >>> >>
>>>>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>>>>> >>> >>                                         directly (this is what
>>>>> >>> >>                                         they
>>>>> >>> >>                                         are rewritten to
>>>>> >>> >>
>>>>> >>> >>                                         Some benefits
>>>>> >>> >>
>>>>> >>> >>                                           - much faster builds
>>>>> of
>>>>> >>> >> other modules
>>>>> >>> >>                                           - only one shaded
>>>>> uber jar
>>>>> >>> >>                                           - rare/no rebuilds of
>>>>> the
>>>>> >>> >> uber jar
>>>>> >>> >>                                           - can use maven
>>>>> enforcer to
>>>>> >>> >> forbid
>>>>> >>> >>                                         imports like
>>>>> com.google.common
>>>>> >>> >>                                           - configuration all
>>>>> in one
>>>>> >>> >> place
>>>>> >>> >>                                           - no automated
>>>>> rewriting of
>>>>> >>> >> our real
>>>>> >>> >>                                         code, which has led to
>>>>> some
>>>>> >>> >> major
>>>>> >>> >>                                         confusion
>>>>> >>> >>                                           - easy to implement
>>>>> >>> >> incrementally
>>>>> >>> >>
>>>>> >>> >>                                         Downsides:
>>>>> >>> >>
>>>>> >>> >>                                           - plenty of effort
>>>>> work to
>>>>> >>> >> get there
>>>>> >>> >>                                           - unclear how many
>>>>> different
>>>>> >>> >> such
>>>>> >>> >>                                         deps modules we need;
>>>>> sharing
>>>>> >>> >> them
>>>>> >>> >>
>>>>> >>> >>                                     could
>>>>> >>> >>
>>>>> >>> >>                                         get weird
>>>>> >>> >>                                           - if we hit a
>>>>> roadblock, we
>>>>> >>> >> will
>>>>> >>> >>                                         have committed a lot of
>>>>> time
>>>>> >>> >>
>>>>> >>> >>                                         Just something I was
>>>>> musing as
>>>>> >>> >> I spent
>>>>> >>> >>                                         another evening waiting
>>>>> for
>>>>> >>> >> slow
>>>>> >>> >>                                         builds to try to confirm
>>>>> >>> >> changes to
>>>>> >>> >>                                         brittle poms.
>>>>> >>> >>
>>>>> >>> >>                                         Kenn
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>                         --
>>>>> >>> >>                         Jean-Baptiste Onofré
>>>>> >>> >>                         jbonofre@apache.org
>>>>> >>> >> <ma...@apache.org>
>>>>> >>> >>                         http://blog.nanthrax.net
>>>>> >>> >> <http://blog.nanthrax.net/>
>>>>> >>> >>                         Talend - http://www.talend.com
>>>>> >>> >> <http://www.talend.com/>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>>
>>>>> >>> --
>>>>> >>> Jean-Baptiste Onofré
>>>>> >>> jbonofre@apache.org
>>>>> >>> http://blog.nanthrax.net
>>>>> >>> Talend - http://www.talend.com
>>>>> >>
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Kenneth Knowles <kl...@google.com>.
On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

>
>
> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <kl...@google.com>:
>
>> Another couple:
>>
>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>>
>
> Technically reflect Type is enough
>

If you want to try to remove TypeToken from underneath TypeDescriptor, I
have no objections as long as you expand the test suite to cover all the
functionality where we trust TypeToken's tests. Good luck :-)

Kenn


>
>
>>  - ImmutableList and friends and their builders are very widely used and
>> IMO still add a lot for readability and preventing someone coming along and
>> adding mistakes to a codebase
>>
>
> Sugar but not required. When you compare the cost of a shade versus of
> duplicating the parts we need there is no real match IMHO.
>
>
>>
>> So considering it all, I would keep a vendored Guava (but also move off
>> where we can, and also have our own improvements). I hope it will be a
>> near-empty build file to generate it so not a maintenance burden.
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com> wrote:
>>
>>> Nice. This sounds like a great idea (or two?) and goes along with what I
>>> just started for futures.
>>>
>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>
>>> Specific things for our micro guava:
>>>
>>>  - checkArgumentNotNull can throw the right exception
>>>  - our own Optional because Java's is not Serializable
>>>  - futures combinators since many are missing, especially Java's don't
>>> do exceptions right
>>>
>>> Protobuf: didn't file an issue because I'm not sure
>>>
>>> I was wondering if pre-shading works. We really need to get rid of it
>>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>>> a Message object. (and do the same for beam-model-protobuf-java since the
>>> model bits have to depend on each other but nothing else).
>>>
>>> Kenn
>>>
>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>>> Huge +1 to get rid of Guava!
>>>>
>>>> This solves annoying dependency issues for some IOs and allow us to
>>>> get rid of the shading that makes current jars bigger than they should
>>>> be.
>>>>
>>>> We can create our own 'micro guava' package with some classes for
>>>> things that are hard to migrate, or that we  prefer to still have like
>>>> the check* methods for example. Given the size of the task we should
>>>> probably divide it into subtasks, more important is to get rid of it
>>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>>
>>>> Other important idea would be to get rid of Protobuf in public APIs
>>>> like GCPIO and to better shade it from leaking into the runners. An
>>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>>> that is byting us for the Spark runner, but well that's worth a
>>>> different discussion.
>>>>
>>>>
>>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>>> <rm...@gmail.com> wrote:
>>>> > a map of list is fine and not a challenge we'll face long I hope ;)
>>>> >
>>>> >
>>>> > Romain Manni-Bucau
>>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>>> >
>>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>>> >>
>>>> >> Not sure we'll be able to replace them all. Things like guava Table
>>>> and
>>>> >> Multimap don't have great replacements in Java8.
>>>> >>
>>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>>> jb@nanthrax.net>
>>>> >> wrote:
>>>> >>>
>>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>>> >>>
>>>> >>> Regards
>>>> >>> JB
>>>> >>>
>>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>>> >>> > Why not dropping guava for all beam codebase? With java 8 it is
>>>> quite
>>>> >>> > easy to do
>>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with
>>>> quite a
>>>> >>> > good result.
>>>> >>> >
>>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>>> >>> > <ma...@google.com>> a écrit :
>>>> >>> >
>>>> >>> >     Make sure to include the guava version in the artifact name
>>>> so that
>>>> >>> > we can
>>>> >>> >     have multiple vendored versions.
>>>> >>> >
>>>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>>> klk@google.com
>>>> >>> >     <ma...@google.com>> wrote:
>>>> >>> >
>>>> >>> >         I didn't have time for this, but it just bit me. We
>>>> definitely
>>>> >>> > have
>>>> >>> >         Guava on the API surface of runner support code in ways
>>>> that
>>>> >>> > get
>>>> >>> >         incompatibly shaded. I will probably start "1a" by making
>>>> a
>>>> >>> > shaded
>>>> >>> >         library org.apache.beam:vendored-guava and starting to
>>>> use it.
>>>> >>> > It sounds
>>>> >>> >         like there is generally unanimous support for that much,
>>>> >>> > anyhow.
>>>> >>> >
>>>> >>> >         Kenn
>>>> >>> >
>>>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>>> >>> > <aljoscha@apache.org
>>>> >>> >         <ma...@apache.org>> wrote:
>>>> >>> >
>>>> >>> >             Thanks Ismaël for bringing up this discussion again!
>>>> >>> >
>>>> >>> >             I would be in favour of 1) and more specifically of
>>>> 1a)
>>>> >>> >
>>>> >>> >             Aljoscha
>>>> >>> >
>>>> >>> >
>>>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>>>> lcwik@google.com
>>>> >>> >>             <ma...@google.com>> wrote:
>>>> >>> >>
>>>> >>> >>             You can always run tests on post shaded artifacts
>>>> instead
>>>> >>> >> of the
>>>> >>> >>             compiled classes, it just requires us to change our
>>>> maven
>>>> >>> >> surefire
>>>> >>> >>             / gradle test configurations but it is true that
>>>> most IDEs
>>>> >>> >> would
>>>> >>> >>             behave better with a dependency jar unless you
>>>> delegate
>>>> >>> >> all the
>>>> >>> >>             build/test actions to the build system and then it
>>>> won't
>>>> >>> >> matter.
>>>> >>> >>
>>>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>>> >>> >> <klk@google.com
>>>> >>> >>             <ma...@google.com>> wrote:
>>>> >>> >>
>>>> >>> >>                 There's also, with additional overhead,
>>>> >>> >>
>>>> >>> >>                 1a) A relocated and shipped package for each
>>>> thing we
>>>> >>> >> want to
>>>> >>> >>                 relocate. I think this has also been tried
>>>> outside
>>>> >>> >> Beam...
>>>> >>> >>
>>>> >>> >>                 Pros:
>>>> >>> >>                 * all the pros of 1) plus no bloat beyond what is
>>>> >>> >> necessary
>>>> >>> >>                 Cons:
>>>> >>> >>                 * abandons whitelist approach for public deps,
>>>> >>> >> reverting to
>>>> >>> >>                 blacklist approach for trouble things like
>>>> guava, so a
>>>> >>> >> bit
>>>> >>> >>                 less principled
>>>> >>> >>
>>>> >>> >>                 For both 1) and 1a) I would add:
>>>> >>> >>
>>>> >>> >>                 Pros:
>>>> >>> >>                 * clearly readable dependency since code will
>>>> `import
>>>> >>> >>                 org.apache.beam.private.guava21` and IDEs will
>>>> >>> >> understand it
>>>> >>> >>                 is a distinct lilbrary
>>>> >>> >>                 * can run tests on unpackaged classes, as long
>>>> as deps
>>>> >>> >> are
>>>> >>> >>                 shaded or provided as jars
>>>> >>> >>                 * no mysterious action at a distance from
>>>> inherited
>>>> >>> >> configuration
>>>> >>> >>                 Cons:
>>>> >>> >>                 * need to adjust imports
>>>> >>> >>
>>>> >>> >>                 Kenn
>>>> >>> >>
>>>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>>> >>> >> <lcwik@google.com
>>>> >>> >>                 <ma...@google.com>> wrote:
>>>> >>> >>
>>>> >>> >>                     I would suggest that either we use:
>>>> >>> >>                     1) A common deps package containing shaded
>>>> >>> >> dependencies
>>>> >>> >>                     allows for
>>>> >>> >>                     Pros
>>>> >>> >>                     * doesn't require the user to build an uber
>>>> jar
>>>> >>> >>                     Risks
>>>> >>> >>                     * dependencies package will keep growing
>>>> even if
>>>> >>> >> something
>>>> >>> >>                     is or isn't needed by all of Apache Beam
>>>> leading
>>>> >>> >> to a
>>>> >>> >>                     large jar anyways negating any space savings
>>>> >>> >>
>>>> >>> >>                     2) Shade within each module to a common
>>>> location
>>>> >>> >> like
>>>> >>> >>                     org.apache.beam.relocated.guava....
>>>> >>> >>                     Pros
>>>> >>> >>                     * you only get the shaded dependencies of the
>>>> >>> >> things that
>>>> >>> >>                     are required
>>>> >>> >>                     * its one less dependency for users to manage
>>>> >>> >>                     Risks
>>>> >>> >>                     * requires an uber jar to be built to get the
>>>> >>> >> space
>>>> >>> >>                     savings (either by a user or a distribution
>>>> of
>>>> >>> >> Apache
>>>> >>> >>                     Beam) otherwise we negate any space savings.
>>>> >>> >>
>>>> >>> >>                     If we either use a common relocation scheme
>>>> or a
>>>> >>> >>                     dependencies jar then each relocation should
>>>> >>> >> specifically
>>>> >>> >>                     contain the version number of the package
>>>> because
>>>> >>> >> we would
>>>> >>> >>                     like to allow for us to be using two
>>>> different
>>>> >>> >> versions of
>>>> >>> >>                     the same library.
>>>> >>> >>
>>>> >>> >>                     For the common deps package approach, should
>>>> we
>>>> >>> >> check in
>>>> >>> >>                     code where the imports contain the relocated
>>>> >>> >> location
>>>> >>> >>                     (e.g. import org.apache.beam.guava.20.0.com
>>>> >>> >>
>>>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle
>>>> ct.ImmutableList)?
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM,
>>>> Jean-Baptiste
>>>> >>> >> Onofré
>>>> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
>>>> wrote:
>>>> >>> >>
>>>> >>> >>                         Thanks for bringing that back.
>>>> >>> >>
>>>> >>> >>                         Indeed guava is shaded in different
>>>> uber-jar.
>>>> >>> >> Maybe we
>>>> >>> >>                         can have a common deps module that we
>>>> include
>>>> >>> >> once
>>>> >>> >>                         (but the user will have to explicitly
>>>> define
>>>> >>> >> the dep) ?
>>>> >>> >>
>>>> >>> >>                         Basically, what do you propose for
>>>> protobuf
>>>> >>> >>                         (unfortunately, I don't see an obvious) ?
>>>> >>> >>
>>>> >>> >>                         Regards
>>>> >>> >>                         JB
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía
>>>> wrote:
>>>> >>> >>
>>>> >>> >>                             Hello, I wanted to bring back this
>>>> subject
>>>> >>> >> because
>>>> >>> >>                             I think we should
>>>> >>> >>                             take action on this and at least
>>>> first
>>>> >>> >> have a
>>>> >>> >>                             shaded version of guava.
>>>> >>> >>                             I was playing with a toy project and
>>>> I did
>>>> >>> >> the
>>>> >>> >>                             procedure we use to
>>>> >>> >>                             submit jars to a Hadoop cluster via
>>>> >>> >> Flink/Spark
>>>> >>> >>                             which involves
>>>> >>> >>                             creating an uber jar and I realized
>>>> that
>>>> >>> >> the size
>>>> >>> >>                             of the jar was way
>>>> >>> >>                             bigger than I expected, and the fact
>>>> that
>>>> >>> >> we shade
>>>> >>> >>                             guava in every
>>>> >>> >>                             module contributes to this. I found
>>>> guava
>>>> >>> >> shaded on:
>>>> >>> >>
>>>> >>> >>                             sdks/java/core
>>>> >>> >>                             runners/core-construction-java
>>>> >>> >>                             runners/core-java
>>>> >>> >>                             model/job-management
>>>> >>> >>                             runners/spark
>>>> >>> >>                             sdks/java/io/hadoop-file-system
>>>> >>> >>                             sdks/java/io/kafka
>>>> >>> >>
>>>> >>> >>                             This means at least 6 times more of
>>>> the
>>>> >>> >> size it
>>>> >>> >>                             should which counts in
>>>> >>> >>                             around 15MB more (2.4MB*6 deps) of
>>>> extra
>>>> >>> >> weight
>>>> >>> >>                             that we can simply
>>>> >>> >>                             reduce by using a shaded version of
>>>> the
>>>> >>> >> library.
>>>> >>> >>
>>>> >>> >>                             I add this point to the previous ones
>>>> >>> >> mentioned
>>>> >>> >>                             during the discussion
>>>> >>> >>                             because this goes against the end
>>>> user
>>>> >>> >> experience
>>>> >>> >>                             and affects us all
>>>> >>> >>                             (devs/users).
>>>> >>> >>
>>>> >>> >>                             Another question is if we should
>>>> shade
>>>> >>> >> (and how)
>>>> >>> >>                             protocol buffers
>>>> >>> >>                             because now with the portability
>>>> work we
>>>> >>> >> are
>>>> >>> >>                             exposing it closer to the
>>>> >>> >>                             end users. I say this because I also
>>>> found
>>>> >>> >> an
>>>> >>> >>                             issue while running a
>>>> >>> >>                             job on YARN with the spark runner
>>>> because
>>>> >>> >>                             hadoop-common includes
>>>> >>> >>                             protobuf-java 2 and I had to
>>>> explicitly
>>>> >>> >> provide
>>>> >>> >>                             protocol-buffers 3 as
>>>> >>> >>                             a dependency to be able to use
>>>> triggers
>>>> >>> >> (note the
>>>> >>> >>                             Spark runner
>>>> >>> >>                             translates them using some method
>>>> from
>>>> >>> >>                             runners/core-java). Since
>>>> >>> >>                             hadoop-common is provided in the
>>>> cluster
>>>> >>> >> with the
>>>> >>> >>                             older version of
>>>> >>> >>                             protobuf, I am afraid that this will
>>>> bite
>>>> >>> >> us in
>>>> >>> >>                             the future.
>>>> >>> >>
>>>> >>> >>                             Ismaël
>>>> >>> >>
>>>> >>> >>                             ps. There is already a JIRA for that
>>>> >>> >> shading for
>>>> >>> >>                             protobuf on
>>>> >>> >>                             hadoop-common but this is not coming
>>>> until
>>>> >>> >> version
>>>> >>> >>                             3 is out.
>>>> >>> >>
>>>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>>> >>> >>
>>>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>>> >>> >>
>>>> >>> >>                             ps2. Extra curious situation is to
>>>> see
>>>> >>> >> that the
>>>> >>> >>                             dataflow-runner ends
>>>> >>> >>                             up having guava shaded twice via its
>>>> >>> >> shaded version on
>>>> >>> >>                             core-construction-java.
>>>> >>> >>
>>>> >>> >>                             ps3. Of course this message means a
>>>> >>> >> de-facto +1 at
>>>> >>> >>                             least to do it for
>>>> >>> >>                             guava and evaluate it for other libs.
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM,
>>>> Lukasz
>>>> >>> >> Cwik
>>>> >>> >>                             <lcwik@google.com.invalid
>>>> >>> >>                             <ma...@google.com.invalid>>
>>>> wrote:
>>>> >>> >>
>>>> >>> >>                                 An issue to call out is how to
>>>> deal
>>>> >>> >> with our
>>>> >>> >>                                 generated code (.avro and
>>>> >>> >>                                 .proto) as I don't believe those
>>>> >>> >> plugins allow
>>>> >>> >>                                 you to generate code using a
>>>> >>> >>                                 shaded package prefix on imports.
>>>> >>> >>
>>>> >>> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
>>>> >>> >> Thomas Groh
>>>> >>> >>                                 <tgroh@google.com.invalid
>>>> >>> >>                                 <mailto:tgroh@google.com.invalid
>>>> >>
>>>> >>> >>                                 wrote:
>>>> >>> >>
>>>> >>> >>                                     +1 to the goal. I'm hugely in
>>>> >>> >> favor of not
>>>> >>> >>                                     doing the same shading work
>>>> >>> >>                                     every time for dependencies
>>>> we
>>>> >>> >> know we'll use.
>>>> >>> >>
>>>> >>> >>                                     This also means that if we
>>>> end up
>>>> >>> >> pulling
>>>> >>> >>                                     in transitive dependencies we
>>>> >>> >>                                     don't want in any particular
>>>> >>> >> module we can
>>>> >>> >>                                     avoid having to adjust our
>>>> >>> >>                                     repackaging strategy for that
>>>> >>> >> module -
>>>> >>> >>                                     which I have run into
>>>> face-first
>>>> >>> >> in
>>>> >>> >>                                     the past.
>>>> >>> >>
>>>> >>> >>                                     On Tue, Oct 17, 2017 at 9:48
>>>> AM,
>>>> >>> >> Kenneth
>>>> >>> >>                                     Knowles
>>>> <klk@google.com.invalid
>>>> >>> >>                                     <mailto:
>>>> klk@google.com.invalid>>
>>>> >>> >>                                     wrote:
>>>> >>> >>
>>>> >>> >>                                         Hi all,
>>>> >>> >>
>>>> >>> >>                                         Shading is a big part of
>>>> how
>>>> >>> >> we keep
>>>> >>> >>                                         our dependencies sane in
>>>> Beam.
>>>> >>> >> But
>>>> >>> >>                                         downsides: shading is
>>>> super
>>>> >>> >> slow,
>>>> >>> >>                                         causes massive jar
>>>> bloat, and
>>>> >>> >> kind of
>>>> >>> >>                                         hard to get right because
>>>> >>> >> artifacts
>>>> >>> >>                                         and namespaces are not
>>>> 1-to-1.
>>>> >>> >>
>>>> >>> >>                                         I know that some
>>>> communities
>>>> >>> >>                                         distribute their own
>>>> shaded
>>>> >>> >>                                         distributions of
>>>> >>> >>                                         dependencies. I had a
>>>> thought
>>>> >>> >> about
>>>> >>> >>                                         doing something similar
>>>> that I
>>>> >>> >> wanted
>>>> >>> >>                                         to throw out there for
>>>> people
>>>> >>> >> to poke
>>>> >>> >>                                         holes in.
>>>> >>> >>
>>>> >>> >>                                         To set the scene, here
>>>> is how
>>>> >>> >> I view
>>>> >>> >>                                         shading:
>>>> >>> >>
>>>> >>> >>                                           - A module has public
>>>> >>> >> dependencies
>>>> >>> >>                                         and private dependencies.
>>>> >>> >>                                           - Public deps are used
>>>> for
>>>> >>> >> data
>>>> >>> >>                                         interchange; users must
>>>> share
>>>> >>> >> these
>>>> >>> >>
>>>> >>> >>                                     deps.
>>>> >>> >>
>>>> >>> >>                                           - Private deps are just
>>>> >>> >>                                         functionality and can be
>>>> >>> >> hidden (in
>>>> >>> >>                                         our case,
>>>> >>> >>                                         relocated + bundled)
>>>> >>> >>                                           - It isn't necessarily
>>>> that
>>>> >>> >> simple,
>>>> >>> >>                                         because public and
>>>> private
>>>> >>> >> deps
>>>> >>> >>
>>>> >>> >>                                     might
>>>> >>> >>
>>>> >>> >>                                         interact in higher-order
>>>> ways
>>>> >>> >>                                         ("public" is contagious)
>>>> >>> >>
>>>> >>> >>                                         Shading is an
>>>> implementation
>>>> >>> >> detail of
>>>> >>> >>                                         expressing these
>>>> >>> >> characteristics.
>>>> >>> >>
>>>> >>> >>                                     We
>>>> >>> >>
>>>> >>> >>                                         use shading selectively
>>>> >>> >> because of its
>>>> >>> >>                                         downsides I mentioned
>>>> above.
>>>> >>> >>
>>>> >>> >>                                         But what about this idea:
>>>> >>> >> Introduce
>>>> >>> >>                                         shaded deps as a single
>>>> >>> >> separate
>>>> >>> >>                                         artifact.
>>>> >>> >>
>>>> >>> >>                                           -
>>>> sdks/java/private-deps:
>>>> >>> >> bundled
>>>> >>> >>                                         uber jar with relocated
>>>> >>> >> versions of
>>>> >>> >>                                         everything we want to
>>>> shade
>>>> >>> >>
>>>> >>> >>                                           - sdks/java/core and
>>>> >>> >>                                         sdks/java/harness: no
>>>> >>> >> relocation or
>>>> >>> >>                                         bundling -
>>>> >>> >>                                         depends on
>>>> >>> >>
>>>>  `beam-sdks-java-private-deps`
>>>> >>> >> and
>>>> >>> >>                                         imports like
>>>> >>> >>
>>>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>>>> >>> >>                                         directly (this is what
>>>> >>> >>                                         they
>>>> >>> >>                                         are rewritten to
>>>> >>> >>
>>>> >>> >>                                         Some benefits
>>>> >>> >>
>>>> >>> >>                                           - much faster builds of
>>>> >>> >> other modules
>>>> >>> >>                                           - only one shaded uber
>>>> jar
>>>> >>> >>                                           - rare/no rebuilds of
>>>> the
>>>> >>> >> uber jar
>>>> >>> >>                                           - can use maven
>>>> enforcer to
>>>> >>> >> forbid
>>>> >>> >>                                         imports like
>>>> com.google.common
>>>> >>> >>                                           - configuration all in
>>>> one
>>>> >>> >> place
>>>> >>> >>                                           - no automated
>>>> rewriting of
>>>> >>> >> our real
>>>> >>> >>                                         code, which has led to
>>>> some
>>>> >>> >> major
>>>> >>> >>                                         confusion
>>>> >>> >>                                           - easy to implement
>>>> >>> >> incrementally
>>>> >>> >>
>>>> >>> >>                                         Downsides:
>>>> >>> >>
>>>> >>> >>                                           - plenty of effort
>>>> work to
>>>> >>> >> get there
>>>> >>> >>                                           - unclear how many
>>>> different
>>>> >>> >> such
>>>> >>> >>                                         deps modules we need;
>>>> sharing
>>>> >>> >> them
>>>> >>> >>
>>>> >>> >>                                     could
>>>> >>> >>
>>>> >>> >>                                         get weird
>>>> >>> >>                                           - if we hit a
>>>> roadblock, we
>>>> >>> >> will
>>>> >>> >>                                         have committed a lot of
>>>> time
>>>> >>> >>
>>>> >>> >>                                         Just something I was
>>>> musing as
>>>> >>> >> I spent
>>>> >>> >>                                         another evening waiting
>>>> for
>>>> >>> >> slow
>>>> >>> >>                                         builds to try to confirm
>>>> >>> >> changes to
>>>> >>> >>                                         brittle poms.
>>>> >>> >>
>>>> >>> >>                                         Kenn
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>                         --
>>>> >>> >>                         Jean-Baptiste Onofré
>>>> >>> >>                         jbonofre@apache.org
>>>> >>> >> <ma...@apache.org>
>>>> >>> >>                         http://blog.nanthrax.net
>>>> >>> >> <http://blog.nanthrax.net/>
>>>> >>> >>                         Talend - http://www.talend.com
>>>> >>> >> <http://www.talend.com/>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>>
>>>> >>> --
>>>> >>> Jean-Baptiste Onofré
>>>> >>> jbonofre@apache.org
>>>> >>> http://blog.nanthrax.net
>>>> >>> Talend - http://www.talend.com
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Romain Manni-Bucau <rm...@gmail.com>.
2018-02-02 15:37 GMT+01:00 Kenneth Knowles <kl...@google.com>:

> Another couple:
>
>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>

Technically reflect Type is enough


>  - ImmutableList and friends and their builders are very widely used and
> IMO still add a lot for readability and preventing someone coming along and
> adding mistakes to a codebase
>

Sugar but not required. When you compare the cost of a shade versus of
duplicating the parts we need there is no real match IMHO.


>
> So considering it all, I would keep a vendored Guava (but also move off
> where we can, and also have our own improvements). I hope it will be a
> near-empty build file to generate it so not a maintenance burden.
>
> Kenn
>
> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com> wrote:
>
>> Nice. This sounds like a great idea (or two?) and goes along with what I
>> just started for futures.
>>
>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>
>> Specific things for our micro guava:
>>
>>  - checkArgumentNotNull can throw the right exception
>>  - our own Optional because Java's is not Serializable
>>  - futures combinators since many are missing, especially Java's don't do
>> exceptions right
>>
>> Protobuf: didn't file an issue because I'm not sure
>>
>> I was wondering if pre-shading works. We really need to get rid of it
>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>> a Message object. (and do the same for beam-model-protobuf-java since the
>> model bits have to depend on each other but nothing else).
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> Huge +1 to get rid of Guava!
>>>
>>> This solves annoying dependency issues for some IOs and allow us to
>>> get rid of the shading that makes current jars bigger than they should
>>> be.
>>>
>>> We can create our own 'micro guava' package with some classes for
>>> things that are hard to migrate, or that we  prefer to still have like
>>> the check* methods for example. Given the size of the task we should
>>> probably divide it into subtasks, more important is to get rid of it
>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>
>>> Other important idea would be to get rid of Protobuf in public APIs
>>> like GCPIO and to better shade it from leaking into the runners. An
>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>> that is byting us for the Spark runner, but well that's worth a
>>> different discussion.
>>>
>>>
>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>> <rm...@gmail.com> wrote:
>>> > a map of list is fine and not a challenge we'll face long I hope ;)
>>> >
>>> >
>>> > Romain Manni-Bucau
>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>> >
>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>> >>
>>> >> Not sure we'll be able to replace them all. Things like guava Table
>>> and
>>> >> Multimap don't have great replacements in Java8.
>>> >>
>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>> jb@nanthrax.net>
>>> >> wrote:
>>> >>>
>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>> >>>
>>> >>> Regards
>>> >>> JB
>>> >>>
>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>> >>> > Why not dropping guava for all beam codebase? With java 8 it is
>>> quite
>>> >>> > easy to do
>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite
>>> a
>>> >>> > good result.
>>> >>> >
>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>> >>> > <ma...@google.com>> a écrit :
>>> >>> >
>>> >>> >     Make sure to include the guava version in the artifact name so
>>> that
>>> >>> > we can
>>> >>> >     have multiple vendored versions.
>>> >>> >
>>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>> klk@google.com
>>> >>> >     <ma...@google.com>> wrote:
>>> >>> >
>>> >>> >         I didn't have time for this, but it just bit me. We
>>> definitely
>>> >>> > have
>>> >>> >         Guava on the API surface of runner support code in ways
>>> that
>>> >>> > get
>>> >>> >         incompatibly shaded. I will probably start "1a" by making a
>>> >>> > shaded
>>> >>> >         library org.apache.beam:vendored-guava and starting to use
>>> it.
>>> >>> > It sounds
>>> >>> >         like there is generally unanimous support for that much,
>>> >>> > anyhow.
>>> >>> >
>>> >>> >         Kenn
>>> >>> >
>>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>> >>> > <aljoscha@apache.org
>>> >>> >         <ma...@apache.org>> wrote:
>>> >>> >
>>> >>> >             Thanks Ismaël for bringing up this discussion again!
>>> >>> >
>>> >>> >             I would be in favour of 1) and more specifically of 1a)
>>> >>> >
>>> >>> >             Aljoscha
>>> >>> >
>>> >>> >
>>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>>> lcwik@google.com
>>> >>> >>             <ma...@google.com>> wrote:
>>> >>> >>
>>> >>> >>             You can always run tests on post shaded artifacts
>>> instead
>>> >>> >> of the
>>> >>> >>             compiled classes, it just requires us to change our
>>> maven
>>> >>> >> surefire
>>> >>> >>             / gradle test configurations but it is true that most
>>> IDEs
>>> >>> >> would
>>> >>> >>             behave better with a dependency jar unless you
>>> delegate
>>> >>> >> all the
>>> >>> >>             build/test actions to the build system and then it
>>> won't
>>> >>> >> matter.
>>> >>> >>
>>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>> >>> >> <klk@google.com
>>> >>> >>             <ma...@google.com>> wrote:
>>> >>> >>
>>> >>> >>                 There's also, with additional overhead,
>>> >>> >>
>>> >>> >>                 1a) A relocated and shipped package for each
>>> thing we
>>> >>> >> want to
>>> >>> >>                 relocate. I think this has also been tried outside
>>> >>> >> Beam...
>>> >>> >>
>>> >>> >>                 Pros:
>>> >>> >>                 * all the pros of 1) plus no bloat beyond what is
>>> >>> >> necessary
>>> >>> >>                 Cons:
>>> >>> >>                 * abandons whitelist approach for public deps,
>>> >>> >> reverting to
>>> >>> >>                 blacklist approach for trouble things like guava,
>>> so a
>>> >>> >> bit
>>> >>> >>                 less principled
>>> >>> >>
>>> >>> >>                 For both 1) and 1a) I would add:
>>> >>> >>
>>> >>> >>                 Pros:
>>> >>> >>                 * clearly readable dependency since code will
>>> `import
>>> >>> >>                 org.apache.beam.private.guava21` and IDEs will
>>> >>> >> understand it
>>> >>> >>                 is a distinct lilbrary
>>> >>> >>                 * can run tests on unpackaged classes, as long as
>>> deps
>>> >>> >> are
>>> >>> >>                 shaded or provided as jars
>>> >>> >>                 * no mysterious action at a distance from
>>> inherited
>>> >>> >> configuration
>>> >>> >>                 Cons:
>>> >>> >>                 * need to adjust imports
>>> >>> >>
>>> >>> >>                 Kenn
>>> >>> >>
>>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>> >>> >> <lcwik@google.com
>>> >>> >>                 <ma...@google.com>> wrote:
>>> >>> >>
>>> >>> >>                     I would suggest that either we use:
>>> >>> >>                     1) A common deps package containing shaded
>>> >>> >> dependencies
>>> >>> >>                     allows for
>>> >>> >>                     Pros
>>> >>> >>                     * doesn't require the user to build an uber
>>> jar
>>> >>> >>                     Risks
>>> >>> >>                     * dependencies package will keep growing even
>>> if
>>> >>> >> something
>>> >>> >>                     is or isn't needed by all of Apache Beam
>>> leading
>>> >>> >> to a
>>> >>> >>                     large jar anyways negating any space savings
>>> >>> >>
>>> >>> >>                     2) Shade within each module to a common
>>> location
>>> >>> >> like
>>> >>> >>                     org.apache.beam.relocated.guava....
>>> >>> >>                     Pros
>>> >>> >>                     * you only get the shaded dependencies of the
>>> >>> >> things that
>>> >>> >>                     are required
>>> >>> >>                     * its one less dependency for users to manage
>>> >>> >>                     Risks
>>> >>> >>                     * requires an uber jar to be built to get the
>>> >>> >> space
>>> >>> >>                     savings (either by a user or a distribution of
>>> >>> >> Apache
>>> >>> >>                     Beam) otherwise we negate any space savings.
>>> >>> >>
>>> >>> >>                     If we either use a common relocation scheme
>>> or a
>>> >>> >>                     dependencies jar then each relocation should
>>> >>> >> specifically
>>> >>> >>                     contain the version number of the package
>>> because
>>> >>> >> we would
>>> >>> >>                     like to allow for us to be using two different
>>> >>> >> versions of
>>> >>> >>                     the same library.
>>> >>> >>
>>> >>> >>                     For the common deps package approach, should
>>> we
>>> >>> >> check in
>>> >>> >>                     code where the imports contain the relocated
>>> >>> >> location
>>> >>> >>                     (e.g. import org.apache.beam.guava.20.0.com
>>> >>> >>
>>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle
>>> ct.ImmutableList)?
>>> >>> >>
>>> >>> >>
>>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste
>>> >>> >> Onofré
>>> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
>>> wrote:
>>> >>> >>
>>> >>> >>                         Thanks for bringing that back.
>>> >>> >>
>>> >>> >>                         Indeed guava is shaded in different
>>> uber-jar.
>>> >>> >> Maybe we
>>> >>> >>                         can have a common deps module that we
>>> include
>>> >>> >> once
>>> >>> >>                         (but the user will have to explicitly
>>> define
>>> >>> >> the dep) ?
>>> >>> >>
>>> >>> >>                         Basically, what do you propose for
>>> protobuf
>>> >>> >>                         (unfortunately, I don't see an obvious) ?
>>> >>> >>
>>> >>> >>                         Regards
>>> >>> >>                         JB
>>> >>> >>
>>> >>> >>
>>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía
>>> wrote:
>>> >>> >>
>>> >>> >>                             Hello, I wanted to bring back this
>>> subject
>>> >>> >> because
>>> >>> >>                             I think we should
>>> >>> >>                             take action on this and at least first
>>> >>> >> have a
>>> >>> >>                             shaded version of guava.
>>> >>> >>                             I was playing with a toy project and
>>> I did
>>> >>> >> the
>>> >>> >>                             procedure we use to
>>> >>> >>                             submit jars to a Hadoop cluster via
>>> >>> >> Flink/Spark
>>> >>> >>                             which involves
>>> >>> >>                             creating an uber jar and I realized
>>> that
>>> >>> >> the size
>>> >>> >>                             of the jar was way
>>> >>> >>                             bigger than I expected, and the fact
>>> that
>>> >>> >> we shade
>>> >>> >>                             guava in every
>>> >>> >>                             module contributes to this. I found
>>> guava
>>> >>> >> shaded on:
>>> >>> >>
>>> >>> >>                             sdks/java/core
>>> >>> >>                             runners/core-construction-java
>>> >>> >>                             runners/core-java
>>> >>> >>                             model/job-management
>>> >>> >>                             runners/spark
>>> >>> >>                             sdks/java/io/hadoop-file-system
>>> >>> >>                             sdks/java/io/kafka
>>> >>> >>
>>> >>> >>                             This means at least 6 times more of
>>> the
>>> >>> >> size it
>>> >>> >>                             should which counts in
>>> >>> >>                             around 15MB more (2.4MB*6 deps) of
>>> extra
>>> >>> >> weight
>>> >>> >>                             that we can simply
>>> >>> >>                             reduce by using a shaded version of
>>> the
>>> >>> >> library.
>>> >>> >>
>>> >>> >>                             I add this point to the previous ones
>>> >>> >> mentioned
>>> >>> >>                             during the discussion
>>> >>> >>                             because this goes against the end user
>>> >>> >> experience
>>> >>> >>                             and affects us all
>>> >>> >>                             (devs/users).
>>> >>> >>
>>> >>> >>                             Another question is if we should shade
>>> >>> >> (and how)
>>> >>> >>                             protocol buffers
>>> >>> >>                             because now with the portability work
>>> we
>>> >>> >> are
>>> >>> >>                             exposing it closer to the
>>> >>> >>                             end users. I say this because I also
>>> found
>>> >>> >> an
>>> >>> >>                             issue while running a
>>> >>> >>                             job on YARN with the spark runner
>>> because
>>> >>> >>                             hadoop-common includes
>>> >>> >>                             protobuf-java 2 and I had to
>>> explicitly
>>> >>> >> provide
>>> >>> >>                             protocol-buffers 3 as
>>> >>> >>                             a dependency to be able to use
>>> triggers
>>> >>> >> (note the
>>> >>> >>                             Spark runner
>>> >>> >>                             translates them using some method from
>>> >>> >>                             runners/core-java). Since
>>> >>> >>                             hadoop-common is provided in the
>>> cluster
>>> >>> >> with the
>>> >>> >>                             older version of
>>> >>> >>                             protobuf, I am afraid that this will
>>> bite
>>> >>> >> us in
>>> >>> >>                             the future.
>>> >>> >>
>>> >>> >>                             Ismaël
>>> >>> >>
>>> >>> >>                             ps. There is already a JIRA for that
>>> >>> >> shading for
>>> >>> >>                             protobuf on
>>> >>> >>                             hadoop-common but this is not coming
>>> until
>>> >>> >> version
>>> >>> >>                             3 is out.
>>> >>> >>
>>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>> >>> >>
>>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>> >>> >>
>>> >>> >>                             ps2. Extra curious situation is to see
>>> >>> >> that the
>>> >>> >>                             dataflow-runner ends
>>> >>> >>                             up having guava shaded twice via its
>>> >>> >> shaded version on
>>> >>> >>                             core-construction-java.
>>> >>> >>
>>> >>> >>                             ps3. Of course this message means a
>>> >>> >> de-facto +1 at
>>> >>> >>                             least to do it for
>>> >>> >>                             guava and evaluate it for other libs.
>>> >>> >>
>>> >>> >>
>>> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM,
>>> Lukasz
>>> >>> >> Cwik
>>> >>> >>                             <lcwik@google.com.invalid
>>> >>> >>                             <ma...@google.com.invalid>>
>>> wrote:
>>> >>> >>
>>> >>> >>                                 An issue to call out is how to
>>> deal
>>> >>> >> with our
>>> >>> >>                                 generated code (.avro and
>>> >>> >>                                 .proto) as I don't believe those
>>> >>> >> plugins allow
>>> >>> >>                                 you to generate code using a
>>> >>> >>                                 shaded package prefix on imports.
>>> >>> >>
>>> >>> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
>>> >>> >> Thomas Groh
>>> >>> >>                                 <tgroh@google.com.invalid
>>> >>> >>                                 <mailto:tgroh@google.com.invalid
>>> >>
>>> >>> >>                                 wrote:
>>> >>> >>
>>> >>> >>                                     +1 to the goal. I'm hugely in
>>> >>> >> favor of not
>>> >>> >>                                     doing the same shading work
>>> >>> >>                                     every time for dependencies we
>>> >>> >> know we'll use.
>>> >>> >>
>>> >>> >>                                     This also means that if we
>>> end up
>>> >>> >> pulling
>>> >>> >>                                     in transitive dependencies we
>>> >>> >>                                     don't want in any particular
>>> >>> >> module we can
>>> >>> >>                                     avoid having to adjust our
>>> >>> >>                                     repackaging strategy for that
>>> >>> >> module -
>>> >>> >>                                     which I have run into
>>> face-first
>>> >>> >> in
>>> >>> >>                                     the past.
>>> >>> >>
>>> >>> >>                                     On Tue, Oct 17, 2017 at 9:48
>>> AM,
>>> >>> >> Kenneth
>>> >>> >>                                     Knowles
>>> <klk@google.com.invalid
>>> >>> >>                                     <mailto:klk@google.com.invali
>>> d>>
>>> >>> >>                                     wrote:
>>> >>> >>
>>> >>> >>                                         Hi all,
>>> >>> >>
>>> >>> >>                                         Shading is a big part of
>>> how
>>> >>> >> we keep
>>> >>> >>                                         our dependencies sane in
>>> Beam.
>>> >>> >> But
>>> >>> >>                                         downsides: shading is
>>> super
>>> >>> >> slow,
>>> >>> >>                                         causes massive jar bloat,
>>> and
>>> >>> >> kind of
>>> >>> >>                                         hard to get right because
>>> >>> >> artifacts
>>> >>> >>                                         and namespaces are not
>>> 1-to-1.
>>> >>> >>
>>> >>> >>                                         I know that some
>>> communities
>>> >>> >>                                         distribute their own
>>> shaded
>>> >>> >>                                         distributions of
>>> >>> >>                                         dependencies. I had a
>>> thought
>>> >>> >> about
>>> >>> >>                                         doing something similar
>>> that I
>>> >>> >> wanted
>>> >>> >>                                         to throw out there for
>>> people
>>> >>> >> to poke
>>> >>> >>                                         holes in.
>>> >>> >>
>>> >>> >>                                         To set the scene, here is
>>> how
>>> >>> >> I view
>>> >>> >>                                         shading:
>>> >>> >>
>>> >>> >>                                           - A module has public
>>> >>> >> dependencies
>>> >>> >>                                         and private dependencies.
>>> >>> >>                                           - Public deps are used
>>> for
>>> >>> >> data
>>> >>> >>                                         interchange; users must
>>> share
>>> >>> >> these
>>> >>> >>
>>> >>> >>                                     deps.
>>> >>> >>
>>> >>> >>                                           - Private deps are just
>>> >>> >>                                         functionality and can be
>>> >>> >> hidden (in
>>> >>> >>                                         our case,
>>> >>> >>                                         relocated + bundled)
>>> >>> >>                                           - It isn't necessarily
>>> that
>>> >>> >> simple,
>>> >>> >>                                         because public and private
>>> >>> >> deps
>>> >>> >>
>>> >>> >>                                     might
>>> >>> >>
>>> >>> >>                                         interact in higher-order
>>> ways
>>> >>> >>                                         ("public" is contagious)
>>> >>> >>
>>> >>> >>                                         Shading is an
>>> implementation
>>> >>> >> detail of
>>> >>> >>                                         expressing these
>>> >>> >> characteristics.
>>> >>> >>
>>> >>> >>                                     We
>>> >>> >>
>>> >>> >>                                         use shading selectively
>>> >>> >> because of its
>>> >>> >>                                         downsides I mentioned
>>> above.
>>> >>> >>
>>> >>> >>                                         But what about this idea:
>>> >>> >> Introduce
>>> >>> >>                                         shaded deps as a single
>>> >>> >> separate
>>> >>> >>                                         artifact.
>>> >>> >>
>>> >>> >>                                           -
>>> sdks/java/private-deps:
>>> >>> >> bundled
>>> >>> >>                                         uber jar with relocated
>>> >>> >> versions of
>>> >>> >>                                         everything we want to
>>> shade
>>> >>> >>
>>> >>> >>                                           - sdks/java/core and
>>> >>> >>                                         sdks/java/harness: no
>>> >>> >> relocation or
>>> >>> >>                                         bundling -
>>> >>> >>                                         depends on
>>> >>> >>
>>>  `beam-sdks-java-private-deps`
>>> >>> >> and
>>> >>> >>                                         imports like
>>> >>> >>
>>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>>> >>> >>                                         directly (this is what
>>> >>> >>                                         they
>>> >>> >>                                         are rewritten to
>>> >>> >>
>>> >>> >>                                         Some benefits
>>> >>> >>
>>> >>> >>                                           - much faster builds of
>>> >>> >> other modules
>>> >>> >>                                           - only one shaded uber
>>> jar
>>> >>> >>                                           - rare/no rebuilds of
>>> the
>>> >>> >> uber jar
>>> >>> >>                                           - can use maven
>>> enforcer to
>>> >>> >> forbid
>>> >>> >>                                         imports like
>>> com.google.common
>>> >>> >>                                           - configuration all in
>>> one
>>> >>> >> place
>>> >>> >>                                           - no automated
>>> rewriting of
>>> >>> >> our real
>>> >>> >>                                         code, which has led to
>>> some
>>> >>> >> major
>>> >>> >>                                         confusion
>>> >>> >>                                           - easy to implement
>>> >>> >> incrementally
>>> >>> >>
>>> >>> >>                                         Downsides:
>>> >>> >>
>>> >>> >>                                           - plenty of effort work
>>> to
>>> >>> >> get there
>>> >>> >>                                           - unclear how many
>>> different
>>> >>> >> such
>>> >>> >>                                         deps modules we need;
>>> sharing
>>> >>> >> them
>>> >>> >>
>>> >>> >>                                     could
>>> >>> >>
>>> >>> >>                                         get weird
>>> >>> >>                                           - if we hit a
>>> roadblock, we
>>> >>> >> will
>>> >>> >>                                         have committed a lot of
>>> time
>>> >>> >>
>>> >>> >>                                         Just something I was
>>> musing as
>>> >>> >> I spent
>>> >>> >>                                         another evening waiting
>>> for
>>> >>> >> slow
>>> >>> >>                                         builds to try to confirm
>>> >>> >> changes to
>>> >>> >>                                         brittle poms.
>>> >>> >>
>>> >>> >>                                         Kenn
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>                         --
>>> >>> >>                         Jean-Baptiste Onofré
>>> >>> >>                         jbonofre@apache.org
>>> >>> >> <ma...@apache.org>
>>> >>> >>                         http://blog.nanthrax.net
>>> >>> >> <http://blog.nanthrax.net/>
>>> >>> >>                         Talend - http://www.talend.com
>>> >>> >> <http://www.talend.com/>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>>
>>> >>> --
>>> >>> Jean-Baptiste Onofré
>>> >>> jbonofre@apache.org
>>> >>> http://blog.nanthrax.net
>>> >>> Talend - http://www.talend.com
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Kenneth Knowles <kl...@google.com>.
Another couple:

 - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
 - ImmutableList and friends and their builders are very widely used and
IMO still add a lot for readability and preventing someone coming along and
adding mistakes to a codebase

So considering it all, I would keep a vendored Guava (but also move off
where we can, and also have our own improvements). I hope it will be a
near-empty build file to generate it so not a maintenance burden.

Kenn

On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <kl...@google.com> wrote:

> Nice. This sounds like a great idea (or two?) and goes along with what I
> just started for futures.
>
> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and assigned
> to Ismaël :-) and converted my futures thing to a subtask.
>
> Specific things for our micro guava:
>
>  - checkArgumentNotNull can throw the right exception
>  - our own Optional because Java's is not Serializable
>  - futures combinators since many are missing, especially Java's don't do
> exceptions right
>
> Protobuf: didn't file an issue because I'm not sure
>
> I was wondering if pre-shading works. We really need to get rid of it from
> public APIs in a 100% reliable way. It is also a problem for Dataflow. I
> was wondering if one approach is to pre-shade gcpio-protobuf-java,
> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
> a Message object. (and do the same for beam-model-protobuf-java since the
> model bits have to depend on each other but nothing else).
>
> Kenn
>
> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Huge +1 to get rid of Guava!
>>
>> This solves annoying dependency issues for some IOs and allow us to
>> get rid of the shading that makes current jars bigger than they should
>> be.
>>
>> We can create our own 'micro guava' package with some classes for
>> things that are hard to migrate, or that we  prefer to still have like
>> the check* methods for example. Given the size of the task we should
>> probably divide it into subtasks, more important is to get rid of it
>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>
>> Other important idea would be to get rid of Protobuf in public APIs
>> like GCPIO and to better shade it from leaking into the runners. An
>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>> that is byting us for the Spark runner, but well that's worth a
>> different discussion.
>>
>>
>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>> <rm...@gmail.com> wrote:
>> > a map of list is fine and not a challenge we'll face long I hope ;)
>> >
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>> >
>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>> >>
>> >> Not sure we'll be able to replace them all. Things like guava Table and
>> >> Multimap don't have great replacements in Java8.
>> >>
>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>> jb@nanthrax.net>
>> >> wrote:
>> >>>
>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>> >>>
>> >>> Regards
>> >>> JB
>> >>>
>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>> >>> > Why not dropping guava for all beam codebase? With java 8 it is
>> quite
>> >>> > easy to do
>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
>> >>> > good result.
>> >>> >
>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>> >>> > <ma...@google.com>> a écrit :
>> >>> >
>> >>> >     Make sure to include the guava version in the artifact name so
>> that
>> >>> > we can
>> >>> >     have multiple vendored versions.
>> >>> >
>> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>> klk@google.com
>> >>> >     <ma...@google.com>> wrote:
>> >>> >
>> >>> >         I didn't have time for this, but it just bit me. We
>> definitely
>> >>> > have
>> >>> >         Guava on the API surface of runner support code in ways that
>> >>> > get
>> >>> >         incompatibly shaded. I will probably start "1a" by making a
>> >>> > shaded
>> >>> >         library org.apache.beam:vendored-guava and starting to use
>> it.
>> >>> > It sounds
>> >>> >         like there is generally unanimous support for that much,
>> >>> > anyhow.
>> >>> >
>> >>> >         Kenn
>> >>> >
>> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>> >>> > <aljoscha@apache.org
>> >>> >         <ma...@apache.org>> wrote:
>> >>> >
>> >>> >             Thanks Ismaël for bringing up this discussion again!
>> >>> >
>> >>> >             I would be in favour of 1) and more specifically of 1a)
>> >>> >
>> >>> >             Aljoscha
>> >>> >
>> >>> >
>> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
>> lcwik@google.com
>> >>> >>             <ma...@google.com>> wrote:
>> >>> >>
>> >>> >>             You can always run tests on post shaded artifacts
>> instead
>> >>> >> of the
>> >>> >>             compiled classes, it just requires us to change our
>> maven
>> >>> >> surefire
>> >>> >>             / gradle test configurations but it is true that most
>> IDEs
>> >>> >> would
>> >>> >>             behave better with a dependency jar unless you delegate
>> >>> >> all the
>> >>> >>             build/test actions to the build system and then it
>> won't
>> >>> >> matter.
>> >>> >>
>> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>> >>> >> <klk@google.com
>> >>> >>             <ma...@google.com>> wrote:
>> >>> >>
>> >>> >>                 There's also, with additional overhead,
>> >>> >>
>> >>> >>                 1a) A relocated and shipped package for each thing
>> we
>> >>> >> want to
>> >>> >>                 relocate. I think this has also been tried outside
>> >>> >> Beam...
>> >>> >>
>> >>> >>                 Pros:
>> >>> >>                 * all the pros of 1) plus no bloat beyond what is
>> >>> >> necessary
>> >>> >>                 Cons:
>> >>> >>                 * abandons whitelist approach for public deps,
>> >>> >> reverting to
>> >>> >>                 blacklist approach for trouble things like guava,
>> so a
>> >>> >> bit
>> >>> >>                 less principled
>> >>> >>
>> >>> >>                 For both 1) and 1a) I would add:
>> >>> >>
>> >>> >>                 Pros:
>> >>> >>                 * clearly readable dependency since code will
>> `import
>> >>> >>                 org.apache.beam.private.guava21` and IDEs will
>> >>> >> understand it
>> >>> >>                 is a distinct lilbrary
>> >>> >>                 * can run tests on unpackaged classes, as long as
>> deps
>> >>> >> are
>> >>> >>                 shaded or provided as jars
>> >>> >>                 * no mysterious action at a distance from inherited
>> >>> >> configuration
>> >>> >>                 Cons:
>> >>> >>                 * need to adjust imports
>> >>> >>
>> >>> >>                 Kenn
>> >>> >>
>> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>> >>> >> <lcwik@google.com
>> >>> >>                 <ma...@google.com>> wrote:
>> >>> >>
>> >>> >>                     I would suggest that either we use:
>> >>> >>                     1) A common deps package containing shaded
>> >>> >> dependencies
>> >>> >>                     allows for
>> >>> >>                     Pros
>> >>> >>                     * doesn't require the user to build an uber jar
>> >>> >>                     Risks
>> >>> >>                     * dependencies package will keep growing even
>> if
>> >>> >> something
>> >>> >>                     is or isn't needed by all of Apache Beam
>> leading
>> >>> >> to a
>> >>> >>                     large jar anyways negating any space savings
>> >>> >>
>> >>> >>                     2) Shade within each module to a common
>> location
>> >>> >> like
>> >>> >>                     org.apache.beam.relocated.guava....
>> >>> >>                     Pros
>> >>> >>                     * you only get the shaded dependencies of the
>> >>> >> things that
>> >>> >>                     are required
>> >>> >>                     * its one less dependency for users to manage
>> >>> >>                     Risks
>> >>> >>                     * requires an uber jar to be built to get the
>> >>> >> space
>> >>> >>                     savings (either by a user or a distribution of
>> >>> >> Apache
>> >>> >>                     Beam) otherwise we negate any space savings.
>> >>> >>
>> >>> >>                     If we either use a common relocation scheme or
>> a
>> >>> >>                     dependencies jar then each relocation should
>> >>> >> specifically
>> >>> >>                     contain the version number of the package
>> because
>> >>> >> we would
>> >>> >>                     like to allow for us to be using two different
>> >>> >> versions of
>> >>> >>                     the same library.
>> >>> >>
>> >>> >>                     For the common deps package approach, should we
>> >>> >> check in
>> >>> >>                     code where the imports contain the relocated
>> >>> >> location
>> >>> >>                     (e.g. import org.apache.beam.guava.20.0.com
>> >>> >>
>> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.colle
>> ct.ImmutableList)?
>> >>> >>
>> >>> >>
>> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste
>> >>> >> Onofré
>> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
>> wrote:
>> >>> >>
>> >>> >>                         Thanks for bringing that back.
>> >>> >>
>> >>> >>                         Indeed guava is shaded in different
>> uber-jar.
>> >>> >> Maybe we
>> >>> >>                         can have a common deps module that we
>> include
>> >>> >> once
>> >>> >>                         (but the user will have to explicitly
>> define
>> >>> >> the dep) ?
>> >>> >>
>> >>> >>                         Basically, what do you propose for protobuf
>> >>> >>                         (unfortunately, I don't see an obvious) ?
>> >>> >>
>> >>> >>                         Regards
>> >>> >>                         JB
>> >>> >>
>> >>> >>
>> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>> >>> >>
>> >>> >>                             Hello, I wanted to bring back this
>> subject
>> >>> >> because
>> >>> >>                             I think we should
>> >>> >>                             take action on this and at least first
>> >>> >> have a
>> >>> >>                             shaded version of guava.
>> >>> >>                             I was playing with a toy project and I
>> did
>> >>> >> the
>> >>> >>                             procedure we use to
>> >>> >>                             submit jars to a Hadoop cluster via
>> >>> >> Flink/Spark
>> >>> >>                             which involves
>> >>> >>                             creating an uber jar and I realized
>> that
>> >>> >> the size
>> >>> >>                             of the jar was way
>> >>> >>                             bigger than I expected, and the fact
>> that
>> >>> >> we shade
>> >>> >>                             guava in every
>> >>> >>                             module contributes to this. I found
>> guava
>> >>> >> shaded on:
>> >>> >>
>> >>> >>                             sdks/java/core
>> >>> >>                             runners/core-construction-java
>> >>> >>                             runners/core-java
>> >>> >>                             model/job-management
>> >>> >>                             runners/spark
>> >>> >>                             sdks/java/io/hadoop-file-system
>> >>> >>                             sdks/java/io/kafka
>> >>> >>
>> >>> >>                             This means at least 6 times more of the
>> >>> >> size it
>> >>> >>                             should which counts in
>> >>> >>                             around 15MB more (2.4MB*6 deps) of
>> extra
>> >>> >> weight
>> >>> >>                             that we can simply
>> >>> >>                             reduce by using a shaded version of the
>> >>> >> library.
>> >>> >>
>> >>> >>                             I add this point to the previous ones
>> >>> >> mentioned
>> >>> >>                             during the discussion
>> >>> >>                             because this goes against the end user
>> >>> >> experience
>> >>> >>                             and affects us all
>> >>> >>                             (devs/users).
>> >>> >>
>> >>> >>                             Another question is if we should shade
>> >>> >> (and how)
>> >>> >>                             protocol buffers
>> >>> >>                             because now with the portability work
>> we
>> >>> >> are
>> >>> >>                             exposing it closer to the
>> >>> >>                             end users. I say this because I also
>> found
>> >>> >> an
>> >>> >>                             issue while running a
>> >>> >>                             job on YARN with the spark runner
>> because
>> >>> >>                             hadoop-common includes
>> >>> >>                             protobuf-java 2 and I had to explicitly
>> >>> >> provide
>> >>> >>                             protocol-buffers 3 as
>> >>> >>                             a dependency to be able to use triggers
>> >>> >> (note the
>> >>> >>                             Spark runner
>> >>> >>                             translates them using some method from
>> >>> >>                             runners/core-java). Since
>> >>> >>                             hadoop-common is provided in the
>> cluster
>> >>> >> with the
>> >>> >>                             older version of
>> >>> >>                             protobuf, I am afraid that this will
>> bite
>> >>> >> us in
>> >>> >>                             the future.
>> >>> >>
>> >>> >>                             Ismaël
>> >>> >>
>> >>> >>                             ps. There is already a JIRA for that
>> >>> >> shading for
>> >>> >>                             protobuf on
>> >>> >>                             hadoop-common but this is not coming
>> until
>> >>> >> version
>> >>> >>                             3 is out.
>> >>> >>
>> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>> >>> >>
>> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>> >>> >>
>> >>> >>                             ps2. Extra curious situation is to see
>> >>> >> that the
>> >>> >>                             dataflow-runner ends
>> >>> >>                             up having guava shaded twice via its
>> >>> >> shaded version on
>> >>> >>                             core-construction-java.
>> >>> >>
>> >>> >>                             ps3. Of course this message means a
>> >>> >> de-facto +1 at
>> >>> >>                             least to do it for
>> >>> >>                             guava and evaluate it for other libs.
>> >>> >>
>> >>> >>
>> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM, Lukasz
>> >>> >> Cwik
>> >>> >>                             <lcwik@google.com.invalid
>> >>> >>                             <ma...@google.com.invalid>>
>> wrote:
>> >>> >>
>> >>> >>                                 An issue to call out is how to deal
>> >>> >> with our
>> >>> >>                                 generated code (.avro and
>> >>> >>                                 .proto) as I don't believe those
>> >>> >> plugins allow
>> >>> >>                                 you to generate code using a
>> >>> >>                                 shaded package prefix on imports.
>> >>> >>
>> >>> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
>> >>> >> Thomas Groh
>> >>> >>                                 <tgroh@google.com.invalid
>> >>> >>                                 <ma...@google.com.invalid>>
>> >>> >>                                 wrote:
>> >>> >>
>> >>> >>                                     +1 to the goal. I'm hugely in
>> >>> >> favor of not
>> >>> >>                                     doing the same shading work
>> >>> >>                                     every time for dependencies we
>> >>> >> know we'll use.
>> >>> >>
>> >>> >>                                     This also means that if we end
>> up
>> >>> >> pulling
>> >>> >>                                     in transitive dependencies we
>> >>> >>                                     don't want in any particular
>> >>> >> module we can
>> >>> >>                                     avoid having to adjust our
>> >>> >>                                     repackaging strategy for that
>> >>> >> module -
>> >>> >>                                     which I have run into
>> face-first
>> >>> >> in
>> >>> >>                                     the past.
>> >>> >>
>> >>> >>                                     On Tue, Oct 17, 2017 at 9:48
>> AM,
>> >>> >> Kenneth
>> >>> >>                                     Knowles <klk@google.com.invalid
>> >>> >>                                     <mailto:klk@google.com.invalid
>> >>
>> >>> >>                                     wrote:
>> >>> >>
>> >>> >>                                         Hi all,
>> >>> >>
>> >>> >>                                         Shading is a big part of
>> how
>> >>> >> we keep
>> >>> >>                                         our dependencies sane in
>> Beam.
>> >>> >> But
>> >>> >>                                         downsides: shading is super
>> >>> >> slow,
>> >>> >>                                         causes massive jar bloat,
>> and
>> >>> >> kind of
>> >>> >>                                         hard to get right because
>> >>> >> artifacts
>> >>> >>                                         and namespaces are not
>> 1-to-1.
>> >>> >>
>> >>> >>                                         I know that some
>> communities
>> >>> >>                                         distribute their own shaded
>> >>> >>                                         distributions of
>> >>> >>                                         dependencies. I had a
>> thought
>> >>> >> about
>> >>> >>                                         doing something similar
>> that I
>> >>> >> wanted
>> >>> >>                                         to throw out there for
>> people
>> >>> >> to poke
>> >>> >>                                         holes in.
>> >>> >>
>> >>> >>                                         To set the scene, here is
>> how
>> >>> >> I view
>> >>> >>                                         shading:
>> >>> >>
>> >>> >>                                           - A module has public
>> >>> >> dependencies
>> >>> >>                                         and private dependencies.
>> >>> >>                                           - Public deps are used
>> for
>> >>> >> data
>> >>> >>                                         interchange; users must
>> share
>> >>> >> these
>> >>> >>
>> >>> >>                                     deps.
>> >>> >>
>> >>> >>                                           - Private deps are just
>> >>> >>                                         functionality and can be
>> >>> >> hidden (in
>> >>> >>                                         our case,
>> >>> >>                                         relocated + bundled)
>> >>> >>                                           - It isn't necessarily
>> that
>> >>> >> simple,
>> >>> >>                                         because public and private
>> >>> >> deps
>> >>> >>
>> >>> >>                                     might
>> >>> >>
>> >>> >>                                         interact in higher-order
>> ways
>> >>> >>                                         ("public" is contagious)
>> >>> >>
>> >>> >>                                         Shading is an
>> implementation
>> >>> >> detail of
>> >>> >>                                         expressing these
>> >>> >> characteristics.
>> >>> >>
>> >>> >>                                     We
>> >>> >>
>> >>> >>                                         use shading selectively
>> >>> >> because of its
>> >>> >>                                         downsides I mentioned
>> above.
>> >>> >>
>> >>> >>                                         But what about this idea:
>> >>> >> Introduce
>> >>> >>                                         shaded deps as a single
>> >>> >> separate
>> >>> >>                                         artifact.
>> >>> >>
>> >>> >>                                           - sdks/java/private-deps:
>> >>> >> bundled
>> >>> >>                                         uber jar with relocated
>> >>> >> versions of
>> >>> >>                                         everything we want to shade
>> >>> >>
>> >>> >>                                           - sdks/java/core and
>> >>> >>                                         sdks/java/harness: no
>> >>> >> relocation or
>> >>> >>                                         bundling -
>> >>> >>                                         depends on
>> >>> >>
>>  `beam-sdks-java-private-deps`
>> >>> >> and
>> >>> >>                                         imports like
>> >>> >>
>> >>> >> `org.apache.beam.sdk.private.com.google.common`
>> >>> >>                                         directly (this is what
>> >>> >>                                         they
>> >>> >>                                         are rewritten to
>> >>> >>
>> >>> >>                                         Some benefits
>> >>> >>
>> >>> >>                                           - much faster builds of
>> >>> >> other modules
>> >>> >>                                           - only one shaded uber
>> jar
>> >>> >>                                           - rare/no rebuilds of the
>> >>> >> uber jar
>> >>> >>                                           - can use maven enforcer
>> to
>> >>> >> forbid
>> >>> >>                                         imports like
>> com.google.common
>> >>> >>                                           - configuration all in
>> one
>> >>> >> place
>> >>> >>                                           - no automated rewriting
>> of
>> >>> >> our real
>> >>> >>                                         code, which has led to some
>> >>> >> major
>> >>> >>                                         confusion
>> >>> >>                                           - easy to implement
>> >>> >> incrementally
>> >>> >>
>> >>> >>                                         Downsides:
>> >>> >>
>> >>> >>                                           - plenty of effort work
>> to
>> >>> >> get there
>> >>> >>                                           - unclear how many
>> different
>> >>> >> such
>> >>> >>                                         deps modules we need;
>> sharing
>> >>> >> them
>> >>> >>
>> >>> >>                                     could
>> >>> >>
>> >>> >>                                         get weird
>> >>> >>                                           - if we hit a roadblock,
>> we
>> >>> >> will
>> >>> >>                                         have committed a lot of
>> time
>> >>> >>
>> >>> >>                                         Just something I was
>> musing as
>> >>> >> I spent
>> >>> >>                                         another evening waiting for
>> >>> >> slow
>> >>> >>                                         builds to try to confirm
>> >>> >> changes to
>> >>> >>                                         brittle poms.
>> >>> >>
>> >>> >>                                         Kenn
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>                         --
>> >>> >>                         Jean-Baptiste Onofré
>> >>> >>                         jbonofre@apache.org
>> >>> >> <ma...@apache.org>
>> >>> >>                         http://blog.nanthrax.net
>> >>> >> <http://blog.nanthrax.net/>
>> >>> >>                         Talend - http://www.talend.com
>> >>> >> <http://www.talend.com/>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>>
>> >>> --
>> >>> Jean-Baptiste Onofré
>> >>> jbonofre@apache.org
>> >>> http://blog.nanthrax.net
>> >>> Talend - http://www.talend.com
>> >>
>> >>
>> >
>>
>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Kenneth Knowles <kl...@google.com>.
Nice. This sounds like a great idea (or two?) and goes along with what I
just started for futures.

Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and assigned
to Ismaël :-) and converted my futures thing to a subtask.

Specific things for our micro guava:

 - checkArgumentNotNull can throw the right exception
 - our own Optional because Java's is not Serializable
 - futures combinators since many are missing, especially Java's don't do
exceptions right

Protobuf: didn't file an issue because I'm not sure

I was wondering if pre-shading works. We really need to get rid of it from
public APIs in a 100% reliable way. It is also a problem for Dataflow. I
was wondering if one approach is to pre-shade gcpio-protobuf-java,
gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
a Message object. (and do the same for beam-model-protobuf-java since the
model bits have to depend on each other but nothing else).

Kenn

On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ie...@gmail.com> wrote:

> Huge +1 to get rid of Guava!
>
> This solves annoying dependency issues for some IOs and allow us to
> get rid of the shading that makes current jars bigger than they should
> be.
>
> We can create our own 'micro guava' package with some classes for
> things that are hard to migrate, or that we  prefer to still have like
> the check* methods for example. Given the size of the task we should
> probably divide it into subtasks, more important is to get rid of it
> for 'sdks/java/core'. We can then attack other areas afterwards.
>
> Other important idea would be to get rid of Protobuf in public APIs
> like GCPIO and to better shade it from leaking into the runners. An
> unexpected side effect of this is a leak of netty via gRPC/protobuf
> that is byting us for the Spark runner, but well that's worth a
> different discussion.
>
>
> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
> <rm...@gmail.com> wrote:
> > a map of list is fine and not a challenge we'll face long I hope ;)
> >
> >
> > Romain Manni-Bucau
> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
> >
> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
> >>
> >> Not sure we'll be able to replace them all. Things like guava Table and
> >> Multimap don't have great replacements in Java8.
> >>
> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> >
> >> wrote:
> >>>
> >>> +1, it was on my TODO for a while waiting the Java8 update.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
> >>> > Why not dropping guava for all beam codebase? With java 8 it is quite
> >>> > easy to do
> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
> >>> > good result.
> >>> >
> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
> >>> > <ma...@google.com>> a écrit :
> >>> >
> >>> >     Make sure to include the guava version in the artifact name so
> that
> >>> > we can
> >>> >     have multiple vendored versions.
> >>> >
> >>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <klk@google.com
> >>> >     <ma...@google.com>> wrote:
> >>> >
> >>> >         I didn't have time for this, but it just bit me. We
> definitely
> >>> > have
> >>> >         Guava on the API surface of runner support code in ways that
> >>> > get
> >>> >         incompatibly shaded. I will probably start "1a" by making a
> >>> > shaded
> >>> >         library org.apache.beam:vendored-guava and starting to use
> it.
> >>> > It sounds
> >>> >         like there is generally unanimous support for that much,
> >>> > anyhow.
> >>> >
> >>> >         Kenn
> >>> >
> >>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
> >>> > <aljoscha@apache.org
> >>> >         <ma...@apache.org>> wrote:
> >>> >
> >>> >             Thanks Ismaël for bringing up this discussion again!
> >>> >
> >>> >             I would be in favour of 1) and more specifically of 1a)
> >>> >
> >>> >             Aljoscha
> >>> >
> >>> >
> >>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <
> lcwik@google.com
> >>> >>             <ma...@google.com>> wrote:
> >>> >>
> >>> >>             You can always run tests on post shaded artifacts
> instead
> >>> >> of the
> >>> >>             compiled classes, it just requires us to change our
> maven
> >>> >> surefire
> >>> >>             / gradle test configurations but it is true that most
> IDEs
> >>> >> would
> >>> >>             behave better with a dependency jar unless you delegate
> >>> >> all the
> >>> >>             build/test actions to the build system and then it won't
> >>> >> matter.
> >>> >>
> >>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
> >>> >> <klk@google.com
> >>> >>             <ma...@google.com>> wrote:
> >>> >>
> >>> >>                 There's also, with additional overhead,
> >>> >>
> >>> >>                 1a) A relocated and shipped package for each thing
> we
> >>> >> want to
> >>> >>                 relocate. I think this has also been tried outside
> >>> >> Beam...
> >>> >>
> >>> >>                 Pros:
> >>> >>                 * all the pros of 1) plus no bloat beyond what is
> >>> >> necessary
> >>> >>                 Cons:
> >>> >>                 * abandons whitelist approach for public deps,
> >>> >> reverting to
> >>> >>                 blacklist approach for trouble things like guava,
> so a
> >>> >> bit
> >>> >>                 less principled
> >>> >>
> >>> >>                 For both 1) and 1a) I would add:
> >>> >>
> >>> >>                 Pros:
> >>> >>                 * clearly readable dependency since code will
> `import
> >>> >>                 org.apache.beam.private.guava21` and IDEs will
> >>> >> understand it
> >>> >>                 is a distinct lilbrary
> >>> >>                 * can run tests on unpackaged classes, as long as
> deps
> >>> >> are
> >>> >>                 shaded or provided as jars
> >>> >>                 * no mysterious action at a distance from inherited
> >>> >> configuration
> >>> >>                 Cons:
> >>> >>                 * need to adjust imports
> >>> >>
> >>> >>                 Kenn
> >>> >>
> >>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
> >>> >> <lcwik@google.com
> >>> >>                 <ma...@google.com>> wrote:
> >>> >>
> >>> >>                     I would suggest that either we use:
> >>> >>                     1) A common deps package containing shaded
> >>> >> dependencies
> >>> >>                     allows for
> >>> >>                     Pros
> >>> >>                     * doesn't require the user to build an uber jar
> >>> >>                     Risks
> >>> >>                     * dependencies package will keep growing even if
> >>> >> something
> >>> >>                     is or isn't needed by all of Apache Beam leading
> >>> >> to a
> >>> >>                     large jar anyways negating any space savings
> >>> >>
> >>> >>                     2) Shade within each module to a common location
> >>> >> like
> >>> >>                     org.apache.beam.relocated.guava....
> >>> >>                     Pros
> >>> >>                     * you only get the shaded dependencies of the
> >>> >> things that
> >>> >>                     are required
> >>> >>                     * its one less dependency for users to manage
> >>> >>                     Risks
> >>> >>                     * requires an uber jar to be built to get the
> >>> >> space
> >>> >>                     savings (either by a user or a distribution of
> >>> >> Apache
> >>> >>                     Beam) otherwise we negate any space savings.
> >>> >>
> >>> >>                     If we either use a common relocation scheme or a
> >>> >>                     dependencies jar then each relocation should
> >>> >> specifically
> >>> >>                     contain the version number of the package
> because
> >>> >> we would
> >>> >>                     like to allow for us to be using two different
> >>> >> versions of
> >>> >>                     the same library.
> >>> >>
> >>> >>                     For the common deps package approach, should we
> >>> >> check in
> >>> >>                     code where the imports contain the relocated
> >>> >> location
> >>> >>                     (e.g. import org.apache.beam.guava.20.0.com
> >>> >>
> >>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.
> collect.ImmutableList)?
> >>> >>
> >>> >>
> >>> >>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste
> >>> >> Onofré
> >>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>>
> wrote:
> >>> >>
> >>> >>                         Thanks for bringing that back.
> >>> >>
> >>> >>                         Indeed guava is shaded in different
> uber-jar.
> >>> >> Maybe we
> >>> >>                         can have a common deps module that we
> include
> >>> >> once
> >>> >>                         (but the user will have to explicitly define
> >>> >> the dep) ?
> >>> >>
> >>> >>                         Basically, what do you propose for protobuf
> >>> >>                         (unfortunately, I don't see an obvious) ?
> >>> >>
> >>> >>                         Regards
> >>> >>                         JB
> >>> >>
> >>> >>
> >>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
> >>> >>
> >>> >>                             Hello, I wanted to bring back this
> subject
> >>> >> because
> >>> >>                             I think we should
> >>> >>                             take action on this and at least first
> >>> >> have a
> >>> >>                             shaded version of guava.
> >>> >>                             I was playing with a toy project and I
> did
> >>> >> the
> >>> >>                             procedure we use to
> >>> >>                             submit jars to a Hadoop cluster via
> >>> >> Flink/Spark
> >>> >>                             which involves
> >>> >>                             creating an uber jar and I realized that
> >>> >> the size
> >>> >>                             of the jar was way
> >>> >>                             bigger than I expected, and the fact
> that
> >>> >> we shade
> >>> >>                             guava in every
> >>> >>                             module contributes to this. I found
> guava
> >>> >> shaded on:
> >>> >>
> >>> >>                             sdks/java/core
> >>> >>                             runners/core-construction-java
> >>> >>                             runners/core-java
> >>> >>                             model/job-management
> >>> >>                             runners/spark
> >>> >>                             sdks/java/io/hadoop-file-system
> >>> >>                             sdks/java/io/kafka
> >>> >>
> >>> >>                             This means at least 6 times more of the
> >>> >> size it
> >>> >>                             should which counts in
> >>> >>                             around 15MB more (2.4MB*6 deps) of extra
> >>> >> weight
> >>> >>                             that we can simply
> >>> >>                             reduce by using a shaded version of the
> >>> >> library.
> >>> >>
> >>> >>                             I add this point to the previous ones
> >>> >> mentioned
> >>> >>                             during the discussion
> >>> >>                             because this goes against the end user
> >>> >> experience
> >>> >>                             and affects us all
> >>> >>                             (devs/users).
> >>> >>
> >>> >>                             Another question is if we should shade
> >>> >> (and how)
> >>> >>                             protocol buffers
> >>> >>                             because now with the portability work we
> >>> >> are
> >>> >>                             exposing it closer to the
> >>> >>                             end users. I say this because I also
> found
> >>> >> an
> >>> >>                             issue while running a
> >>> >>                             job on YARN with the spark runner
> because
> >>> >>                             hadoop-common includes
> >>> >>                             protobuf-java 2 and I had to explicitly
> >>> >> provide
> >>> >>                             protocol-buffers 3 as
> >>> >>                             a dependency to be able to use triggers
> >>> >> (note the
> >>> >>                             Spark runner
> >>> >>                             translates them using some method from
> >>> >>                             runners/core-java). Since
> >>> >>                             hadoop-common is provided in the cluster
> >>> >> with the
> >>> >>                             older version of
> >>> >>                             protobuf, I am afraid that this will
> bite
> >>> >> us in
> >>> >>                             the future.
> >>> >>
> >>> >>                             Ismaël
> >>> >>
> >>> >>                             ps. There is already a JIRA for that
> >>> >> shading for
> >>> >>                             protobuf on
> >>> >>                             hadoop-common but this is not coming
> until
> >>> >> version
> >>> >>                             3 is out.
> >>> >>
> >>> >> https://issues.apache.org/jira/browse/HADOOP-13136
> >>> >>
> >>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
> >>> >>
> >>> >>                             ps2. Extra curious situation is to see
> >>> >> that the
> >>> >>                             dataflow-runner ends
> >>> >>                             up having guava shaded twice via its
> >>> >> shaded version on
> >>> >>                             core-construction-java.
> >>> >>
> >>> >>                             ps3. Of course this message means a
> >>> >> de-facto +1 at
> >>> >>                             least to do it for
> >>> >>                             guava and evaluate it for other libs.
> >>> >>
> >>> >>
> >>> >>                             On Tue, Oct 17, 2017 at 7:29 PM, Lukasz
> >>> >> Cwik
> >>> >>                             <lcwik@google.com.invalid
> >>> >>                             <ma...@google.com.invalid>>
> wrote:
> >>> >>
> >>> >>                                 An issue to call out is how to deal
> >>> >> with our
> >>> >>                                 generated code (.avro and
> >>> >>                                 .proto) as I don't believe those
> >>> >> plugins allow
> >>> >>                                 you to generate code using a
> >>> >>                                 shaded package prefix on imports.
> >>> >>
> >>> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
> >>> >> Thomas Groh
> >>> >>                                 <tgroh@google.com.invalid
> >>> >>                                 <ma...@google.com.invalid>>
> >>> >>                                 wrote:
> >>> >>
> >>> >>                                     +1 to the goal. I'm hugely in
> >>> >> favor of not
> >>> >>                                     doing the same shading work
> >>> >>                                     every time for dependencies we
> >>> >> know we'll use.
> >>> >>
> >>> >>                                     This also means that if we end
> up
> >>> >> pulling
> >>> >>                                     in transitive dependencies we
> >>> >>                                     don't want in any particular
> >>> >> module we can
> >>> >>                                     avoid having to adjust our
> >>> >>                                     repackaging strategy for that
> >>> >> module -
> >>> >>                                     which I have run into face-first
> >>> >> in
> >>> >>                                     the past.
> >>> >>
> >>> >>                                     On Tue, Oct 17, 2017 at 9:48 AM,
> >>> >> Kenneth
> >>> >>                                     Knowles <klk@google.com.invalid
> >>> >>                                     <mailto:klk@google.com.invalid
> >>
> >>> >>                                     wrote:
> >>> >>
> >>> >>                                         Hi all,
> >>> >>
> >>> >>                                         Shading is a big part of how
> >>> >> we keep
> >>> >>                                         our dependencies sane in
> Beam.
> >>> >> But
> >>> >>                                         downsides: shading is super
> >>> >> slow,
> >>> >>                                         causes massive jar bloat,
> and
> >>> >> kind of
> >>> >>                                         hard to get right because
> >>> >> artifacts
> >>> >>                                         and namespaces are not
> 1-to-1.
> >>> >>
> >>> >>                                         I know that some communities
> >>> >>                                         distribute their own shaded
> >>> >>                                         distributions of
> >>> >>                                         dependencies. I had a
> thought
> >>> >> about
> >>> >>                                         doing something similar
> that I
> >>> >> wanted
> >>> >>                                         to throw out there for
> people
> >>> >> to poke
> >>> >>                                         holes in.
> >>> >>
> >>> >>                                         To set the scene, here is
> how
> >>> >> I view
> >>> >>                                         shading:
> >>> >>
> >>> >>                                           - A module has public
> >>> >> dependencies
> >>> >>                                         and private dependencies.
> >>> >>                                           - Public deps are used for
> >>> >> data
> >>> >>                                         interchange; users must
> share
> >>> >> these
> >>> >>
> >>> >>                                     deps.
> >>> >>
> >>> >>                                           - Private deps are just
> >>> >>                                         functionality and can be
> >>> >> hidden (in
> >>> >>                                         our case,
> >>> >>                                         relocated + bundled)
> >>> >>                                           - It isn't necessarily
> that
> >>> >> simple,
> >>> >>                                         because public and private
> >>> >> deps
> >>> >>
> >>> >>                                     might
> >>> >>
> >>> >>                                         interact in higher-order
> ways
> >>> >>                                         ("public" is contagious)
> >>> >>
> >>> >>                                         Shading is an implementation
> >>> >> detail of
> >>> >>                                         expressing these
> >>> >> characteristics.
> >>> >>
> >>> >>                                     We
> >>> >>
> >>> >>                                         use shading selectively
> >>> >> because of its
> >>> >>                                         downsides I mentioned above.
> >>> >>
> >>> >>                                         But what about this idea:
> >>> >> Introduce
> >>> >>                                         shaded deps as a single
> >>> >> separate
> >>> >>                                         artifact.
> >>> >>
> >>> >>                                           - sdks/java/private-deps:
> >>> >> bundled
> >>> >>                                         uber jar with relocated
> >>> >> versions of
> >>> >>                                         everything we want to shade
> >>> >>
> >>> >>                                           - sdks/java/core and
> >>> >>                                         sdks/java/harness: no
> >>> >> relocation or
> >>> >>                                         bundling -
> >>> >>                                         depends on
> >>> >>
>  `beam-sdks-java-private-deps`
> >>> >> and
> >>> >>                                         imports like
> >>> >>
> >>> >> `org.apache.beam.sdk.private.com.google.common`
> >>> >>                                         directly (this is what
> >>> >>                                         they
> >>> >>                                         are rewritten to
> >>> >>
> >>> >>                                         Some benefits
> >>> >>
> >>> >>                                           - much faster builds of
> >>> >> other modules
> >>> >>                                           - only one shaded uber jar
> >>> >>                                           - rare/no rebuilds of the
> >>> >> uber jar
> >>> >>                                           - can use maven enforcer
> to
> >>> >> forbid
> >>> >>                                         imports like
> com.google.common
> >>> >>                                           - configuration all in one
> >>> >> place
> >>> >>                                           - no automated rewriting
> of
> >>> >> our real
> >>> >>                                         code, which has led to some
> >>> >> major
> >>> >>                                         confusion
> >>> >>                                           - easy to implement
> >>> >> incrementally
> >>> >>
> >>> >>                                         Downsides:
> >>> >>
> >>> >>                                           - plenty of effort work to
> >>> >> get there
> >>> >>                                           - unclear how many
> different
> >>> >> such
> >>> >>                                         deps modules we need;
> sharing
> >>> >> them
> >>> >>
> >>> >>                                     could
> >>> >>
> >>> >>                                         get weird
> >>> >>                                           - if we hit a roadblock,
> we
> >>> >> will
> >>> >>                                         have committed a lot of time
> >>> >>
> >>> >>                                         Just something I was musing
> as
> >>> >> I spent
> >>> >>                                         another evening waiting for
> >>> >> slow
> >>> >>                                         builds to try to confirm
> >>> >> changes to
> >>> >>                                         brittle poms.
> >>> >>
> >>> >>                                         Kenn
> >>> >>
> >>> >>
> >>> >>
> >>> >>                         --
> >>> >>                         Jean-Baptiste Onofré
> >>> >>                         jbonofre@apache.org
> >>> >> <ma...@apache.org>
> >>> >>                         http://blog.nanthrax.net
> >>> >> <http://blog.nanthrax.net/>
> >>> >>                         Talend - http://www.talend.com
> >>> >> <http://www.talend.com/>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> >
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>
> >>
> >
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Ismaël Mejía <ie...@gmail.com>.
Huge +1 to get rid of Guava!

This solves annoying dependency issues for some IOs and allow us to
get rid of the shading that makes current jars bigger than they should
be.

We can create our own 'micro guava' package with some classes for
things that are hard to migrate, or that we  prefer to still have like
the check* methods for example. Given the size of the task we should
probably divide it into subtasks, more important is to get rid of it
for 'sdks/java/core'. We can then attack other areas afterwards.

Other important idea would be to get rid of Protobuf in public APIs
like GCPIO and to better shade it from leaking into the runners. An
unexpected side effect of this is a leak of netty via gRPC/protobuf
that is byting us for the Spark runner, but well that's worth a
different discussion.


On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
<rm...@gmail.com> wrote:
> a map of list is fine and not a challenge we'll face long I hope ;)
>
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>
> 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>> Not sure we'll be able to replace them all. Things like guava Table and
>> Multimap don't have great replacements in Java8.
>>
>> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>>
>>> +1, it was on my TODO for a while waiting the Java8 update.
>>>
>>> Regards
>>> JB
>>>
>>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>> > Why not dropping guava for all beam codebase? With java 8 it is quite
>>> > easy to do
>>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
>>> > good result.
>>> >
>>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>>> > <ma...@google.com>> a écrit :
>>> >
>>> >     Make sure to include the guava version in the artifact name so that
>>> > we can
>>> >     have multiple vendored versions.
>>> >
>>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <klk@google.com
>>> >     <ma...@google.com>> wrote:
>>> >
>>> >         I didn't have time for this, but it just bit me. We definitely
>>> > have
>>> >         Guava on the API surface of runner support code in ways that
>>> > get
>>> >         incompatibly shaded. I will probably start "1a" by making a
>>> > shaded
>>> >         library org.apache.beam:vendored-guava and starting to use it.
>>> > It sounds
>>> >         like there is generally unanimous support for that much,
>>> > anyhow.
>>> >
>>> >         Kenn
>>> >
>>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>> > <aljoscha@apache.org
>>> >         <ma...@apache.org>> wrote:
>>> >
>>> >             Thanks Ismaël for bringing up this discussion again!
>>> >
>>> >             I would be in favour of 1) and more specifically of 1a)
>>> >
>>> >             Aljoscha
>>> >
>>> >
>>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <lcwik@google.com
>>> >>             <ma...@google.com>> wrote:
>>> >>
>>> >>             You can always run tests on post shaded artifacts instead
>>> >> of the
>>> >>             compiled classes, it just requires us to change our maven
>>> >> surefire
>>> >>             / gradle test configurations but it is true that most IDEs
>>> >> would
>>> >>             behave better with a dependency jar unless you delegate
>>> >> all the
>>> >>             build/test actions to the build system and then it won't
>>> >> matter.
>>> >>
>>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>> >> <klk@google.com
>>> >>             <ma...@google.com>> wrote:
>>> >>
>>> >>                 There's also, with additional overhead,
>>> >>
>>> >>                 1a) A relocated and shipped package for each thing we
>>> >> want to
>>> >>                 relocate. I think this has also been tried outside
>>> >> Beam...
>>> >>
>>> >>                 Pros:
>>> >>                 * all the pros of 1) plus no bloat beyond what is
>>> >> necessary
>>> >>                 Cons:
>>> >>                 * abandons whitelist approach for public deps,
>>> >> reverting to
>>> >>                 blacklist approach for trouble things like guava, so a
>>> >> bit
>>> >>                 less principled
>>> >>
>>> >>                 For both 1) and 1a) I would add:
>>> >>
>>> >>                 Pros:
>>> >>                 * clearly readable dependency since code will `import
>>> >>                 org.apache.beam.private.guava21` and IDEs will
>>> >> understand it
>>> >>                 is a distinct lilbrary
>>> >>                 * can run tests on unpackaged classes, as long as deps
>>> >> are
>>> >>                 shaded or provided as jars
>>> >>                 * no mysterious action at a distance from inherited
>>> >> configuration
>>> >>                 Cons:
>>> >>                 * need to adjust imports
>>> >>
>>> >>                 Kenn
>>> >>
>>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>> >> <lcwik@google.com
>>> >>                 <ma...@google.com>> wrote:
>>> >>
>>> >>                     I would suggest that either we use:
>>> >>                     1) A common deps package containing shaded
>>> >> dependencies
>>> >>                     allows for
>>> >>                     Pros
>>> >>                     * doesn't require the user to build an uber jar
>>> >>                     Risks
>>> >>                     * dependencies package will keep growing even if
>>> >> something
>>> >>                     is or isn't needed by all of Apache Beam leading
>>> >> to a
>>> >>                     large jar anyways negating any space savings
>>> >>
>>> >>                     2) Shade within each module to a common location
>>> >> like
>>> >>                     org.apache.beam.relocated.guava....
>>> >>                     Pros
>>> >>                     * you only get the shaded dependencies of the
>>> >> things that
>>> >>                     are required
>>> >>                     * its one less dependency for users to manage
>>> >>                     Risks
>>> >>                     * requires an uber jar to be built to get the
>>> >> space
>>> >>                     savings (either by a user or a distribution of
>>> >> Apache
>>> >>                     Beam) otherwise we negate any space savings.
>>> >>
>>> >>                     If we either use a common relocation scheme or a
>>> >>                     dependencies jar then each relocation should
>>> >> specifically
>>> >>                     contain the version number of the package because
>>> >> we would
>>> >>                     like to allow for us to be using two different
>>> >> versions of
>>> >>                     the same library.
>>> >>
>>> >>                     For the common deps package approach, should we
>>> >> check in
>>> >>                     code where the imports contain the relocated
>>> >> location
>>> >>                     (e.g. import org.apache.beam.guava.20.0.com
>>> >>
>>> >> <http://org.apache.beam.guava.20.0.com/>.google.common.collect.ImmutableList)?
>>> >>
>>> >>
>>> >>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste
>>> >> Onofré
>>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>> wrote:
>>> >>
>>> >>                         Thanks for bringing that back.
>>> >>
>>> >>                         Indeed guava is shaded in different uber-jar.
>>> >> Maybe we
>>> >>                         can have a common deps module that we include
>>> >> once
>>> >>                         (but the user will have to explicitly define
>>> >> the dep) ?
>>> >>
>>> >>                         Basically, what do you propose for protobuf
>>> >>                         (unfortunately, I don't see an obvious) ?
>>> >>
>>> >>                         Regards
>>> >>                         JB
>>> >>
>>> >>
>>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>> >>
>>> >>                             Hello, I wanted to bring back this subject
>>> >> because
>>> >>                             I think we should
>>> >>                             take action on this and at least first
>>> >> have a
>>> >>                             shaded version of guava.
>>> >>                             I was playing with a toy project and I did
>>> >> the
>>> >>                             procedure we use to
>>> >>                             submit jars to a Hadoop cluster via
>>> >> Flink/Spark
>>> >>                             which involves
>>> >>                             creating an uber jar and I realized that
>>> >> the size
>>> >>                             of the jar was way
>>> >>                             bigger than I expected, and the fact that
>>> >> we shade
>>> >>                             guava in every
>>> >>                             module contributes to this. I found guava
>>> >> shaded on:
>>> >>
>>> >>                             sdks/java/core
>>> >>                             runners/core-construction-java
>>> >>                             runners/core-java
>>> >>                             model/job-management
>>> >>                             runners/spark
>>> >>                             sdks/java/io/hadoop-file-system
>>> >>                             sdks/java/io/kafka
>>> >>
>>> >>                             This means at least 6 times more of the
>>> >> size it
>>> >>                             should which counts in
>>> >>                             around 15MB more (2.4MB*6 deps) of extra
>>> >> weight
>>> >>                             that we can simply
>>> >>                             reduce by using a shaded version of the
>>> >> library.
>>> >>
>>> >>                             I add this point to the previous ones
>>> >> mentioned
>>> >>                             during the discussion
>>> >>                             because this goes against the end user
>>> >> experience
>>> >>                             and affects us all
>>> >>                             (devs/users).
>>> >>
>>> >>                             Another question is if we should shade
>>> >> (and how)
>>> >>                             protocol buffers
>>> >>                             because now with the portability work we
>>> >> are
>>> >>                             exposing it closer to the
>>> >>                             end users. I say this because I also found
>>> >> an
>>> >>                             issue while running a
>>> >>                             job on YARN with the spark runner because
>>> >>                             hadoop-common includes
>>> >>                             protobuf-java 2 and I had to explicitly
>>> >> provide
>>> >>                             protocol-buffers 3 as
>>> >>                             a dependency to be able to use triggers
>>> >> (note the
>>> >>                             Spark runner
>>> >>                             translates them using some method from
>>> >>                             runners/core-java). Since
>>> >>                             hadoop-common is provided in the cluster
>>> >> with the
>>> >>                             older version of
>>> >>                             protobuf, I am afraid that this will bite
>>> >> us in
>>> >>                             the future.
>>> >>
>>> >>                             Ismaël
>>> >>
>>> >>                             ps. There is already a JIRA for that
>>> >> shading for
>>> >>                             protobuf on
>>> >>                             hadoop-common but this is not coming until
>>> >> version
>>> >>                             3 is out.
>>> >>
>>> >> https://issues.apache.org/jira/browse/HADOOP-13136
>>> >>
>>> >> <https://issues.apache.org/jira/browse/HADOOP-13136>
>>> >>
>>> >>                             ps2. Extra curious situation is to see
>>> >> that the
>>> >>                             dataflow-runner ends
>>> >>                             up having guava shaded twice via its
>>> >> shaded version on
>>> >>                             core-construction-java.
>>> >>
>>> >>                             ps3. Of course this message means a
>>> >> de-facto +1 at
>>> >>                             least to do it for
>>> >>                             guava and evaluate it for other libs.
>>> >>
>>> >>
>>> >>                             On Tue, Oct 17, 2017 at 7:29 PM, Lukasz
>>> >> Cwik
>>> >>                             <lcwik@google.com.invalid
>>> >>                             <ma...@google.com.invalid>> wrote:
>>> >>
>>> >>                                 An issue to call out is how to deal
>>> >> with our
>>> >>                                 generated code (.avro and
>>> >>                                 .proto) as I don't believe those
>>> >> plugins allow
>>> >>                                 you to generate code using a
>>> >>                                 shaded package prefix on imports.
>>> >>
>>> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
>>> >> Thomas Groh
>>> >>                                 <tgroh@google.com.invalid
>>> >>                                 <ma...@google.com.invalid>>
>>> >>                                 wrote:
>>> >>
>>> >>                                     +1 to the goal. I'm hugely in
>>> >> favor of not
>>> >>                                     doing the same shading work
>>> >>                                     every time for dependencies we
>>> >> know we'll use.
>>> >>
>>> >>                                     This also means that if we end up
>>> >> pulling
>>> >>                                     in transitive dependencies we
>>> >>                                     don't want in any particular
>>> >> module we can
>>> >>                                     avoid having to adjust our
>>> >>                                     repackaging strategy for that
>>> >> module -
>>> >>                                     which I have run into face-first
>>> >> in
>>> >>                                     the past.
>>> >>
>>> >>                                     On Tue, Oct 17, 2017 at 9:48 AM,
>>> >> Kenneth
>>> >>                                     Knowles <klk@google.com.invalid
>>> >>                                     <ma...@google.com.invalid>>
>>> >>                                     wrote:
>>> >>
>>> >>                                         Hi all,
>>> >>
>>> >>                                         Shading is a big part of how
>>> >> we keep
>>> >>                                         our dependencies sane in Beam.
>>> >> But
>>> >>                                         downsides: shading is super
>>> >> slow,
>>> >>                                         causes massive jar bloat, and
>>> >> kind of
>>> >>                                         hard to get right because
>>> >> artifacts
>>> >>                                         and namespaces are not 1-to-1.
>>> >>
>>> >>                                         I know that some communities
>>> >>                                         distribute their own shaded
>>> >>                                         distributions of
>>> >>                                         dependencies. I had a thought
>>> >> about
>>> >>                                         doing something similar that I
>>> >> wanted
>>> >>                                         to throw out there for people
>>> >> to poke
>>> >>                                         holes in.
>>> >>
>>> >>                                         To set the scene, here is how
>>> >> I view
>>> >>                                         shading:
>>> >>
>>> >>                                           - A module has public
>>> >> dependencies
>>> >>                                         and private dependencies.
>>> >>                                           - Public deps are used for
>>> >> data
>>> >>                                         interchange; users must share
>>> >> these
>>> >>
>>> >>                                     deps.
>>> >>
>>> >>                                           - Private deps are just
>>> >>                                         functionality and can be
>>> >> hidden (in
>>> >>                                         our case,
>>> >>                                         relocated + bundled)
>>> >>                                           - It isn't necessarily that
>>> >> simple,
>>> >>                                         because public and private
>>> >> deps
>>> >>
>>> >>                                     might
>>> >>
>>> >>                                         interact in higher-order ways
>>> >>                                         ("public" is contagious)
>>> >>
>>> >>                                         Shading is an implementation
>>> >> detail of
>>> >>                                         expressing these
>>> >> characteristics.
>>> >>
>>> >>                                     We
>>> >>
>>> >>                                         use shading selectively
>>> >> because of its
>>> >>                                         downsides I mentioned above.
>>> >>
>>> >>                                         But what about this idea:
>>> >> Introduce
>>> >>                                         shaded deps as a single
>>> >> separate
>>> >>                                         artifact.
>>> >>
>>> >>                                           - sdks/java/private-deps:
>>> >> bundled
>>> >>                                         uber jar with relocated
>>> >> versions of
>>> >>                                         everything we want to shade
>>> >>
>>> >>                                           - sdks/java/core and
>>> >>                                         sdks/java/harness: no
>>> >> relocation or
>>> >>                                         bundling -
>>> >>                                         depends on
>>> >>                                         `beam-sdks-java-private-deps`
>>> >> and
>>> >>                                         imports like
>>> >>
>>> >> `org.apache.beam.sdk.private.com.google.common`
>>> >>                                         directly (this is what
>>> >>                                         they
>>> >>                                         are rewritten to
>>> >>
>>> >>                                         Some benefits
>>> >>
>>> >>                                           - much faster builds of
>>> >> other modules
>>> >>                                           - only one shaded uber jar
>>> >>                                           - rare/no rebuilds of the
>>> >> uber jar
>>> >>                                           - can use maven enforcer to
>>> >> forbid
>>> >>                                         imports like com.google.common
>>> >>                                           - configuration all in one
>>> >> place
>>> >>                                           - no automated rewriting of
>>> >> our real
>>> >>                                         code, which has led to some
>>> >> major
>>> >>                                         confusion
>>> >>                                           - easy to implement
>>> >> incrementally
>>> >>
>>> >>                                         Downsides:
>>> >>
>>> >>                                           - plenty of effort work to
>>> >> get there
>>> >>                                           - unclear how many different
>>> >> such
>>> >>                                         deps modules we need; sharing
>>> >> them
>>> >>
>>> >>                                     could
>>> >>
>>> >>                                         get weird
>>> >>                                           - if we hit a roadblock, we
>>> >> will
>>> >>                                         have committed a lot of time
>>> >>
>>> >>                                         Just something I was musing as
>>> >> I spent
>>> >>                                         another evening waiting for
>>> >> slow
>>> >>                                         builds to try to confirm
>>> >> changes to
>>> >>                                         brittle poms.
>>> >>
>>> >>                                         Kenn
>>> >>
>>> >>
>>> >>
>>> >>                         --
>>> >>                         Jean-Baptiste Onofré
>>> >>                         jbonofre@apache.org
>>> >> <ma...@apache.org>
>>> >>                         http://blog.nanthrax.net
>>> >> <http://blog.nanthrax.net/>
>>> >>                         Talend - http://www.talend.com
>>> >> <http://www.talend.com/>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Romain Manni-Bucau <rm...@gmail.com>.
a map of list is fine and not a challenge we'll face long I hope ;)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:

> Not sure we'll be able to replace them all. Things like guava Table and
> Multimap don't have great replacements in Java8.
>
> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> +1, it was on my TODO for a while waiting the Java8 update.
>>
>> Regards
>> JB
>>
>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>> > Why not dropping guava for all beam codebase? With java 8 it is quite
>> easy to do
>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
>> good result.
>> >
>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
>> > <ma...@google.com>> a écrit :
>> >
>> >     Make sure to include the guava version in the artifact name so that
>> we can
>> >     have multiple vendored versions.
>> >
>> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <klk@google.com
>> >     <ma...@google.com>> wrote:
>> >
>> >         I didn't have time for this, but it just bit me. We definitely
>> have
>> >         Guava on the API surface of runner support code in ways that get
>> >         incompatibly shaded. I will probably start "1a" by making a
>> shaded
>> >         library org.apache.beam:vendored-guava and starting to use it.
>> It sounds
>> >         like there is generally unanimous support for that much, anyhow.
>> >
>> >         Kenn
>> >
>> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <
>> aljoscha@apache.org
>> >         <ma...@apache.org>> wrote:
>> >
>> >             Thanks Ismaël for bringing up this discussion again!
>> >
>> >             I would be in favour of 1) and more specifically of 1a)
>> >
>> >             Aljoscha
>> >
>> >
>> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <lcwik@google.com
>> >>             <ma...@google.com>> wrote:
>> >>
>> >>             You can always run tests on post shaded artifacts instead
>> of the
>> >>             compiled classes, it just requires us to change our maven
>> surefire
>> >>             / gradle test configurations but it is true that most IDEs
>> would
>> >>             behave better with a dependency jar unless you delegate
>> all the
>> >>             build/test actions to the build system and then it won't
>> matter.
>> >>
>> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <
>> klk@google.com
>> >>             <ma...@google.com>> wrote:
>> >>
>> >>                 There's also, with additional overhead,
>> >>
>> >>                 1a) A relocated and shipped package for each thing we
>> want to
>> >>                 relocate. I think this has also been tried outside
>> Beam...
>> >>
>> >>                 Pros:
>> >>                 * all the pros of 1) plus no bloat beyond what is
>> necessary
>> >>                 Cons:
>> >>                 * abandons whitelist approach for public deps,
>> reverting to
>> >>                 blacklist approach for trouble things like guava, so a
>> bit
>> >>                 less principled
>> >>
>> >>                 For both 1) and 1a) I would add:
>> >>
>> >>                 Pros:
>> >>                 * clearly readable dependency since code will `import
>> >>                 org.apache.beam.private.guava21` and IDEs will
>> understand it
>> >>                 is a distinct lilbrary
>> >>                 * can run tests on unpackaged classes, as long as deps
>> are
>> >>                 shaded or provided as jars
>> >>                 * no mysterious action at a distance from inherited
>> configuration
>> >>                 Cons:
>> >>                 * need to adjust imports
>> >>
>> >>                 Kenn
>> >>
>> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <
>> lcwik@google.com
>> >>                 <ma...@google.com>> wrote:
>> >>
>> >>                     I would suggest that either we use:
>> >>                     1) A common deps package containing shaded
>> dependencies
>> >>                     allows for
>> >>                     Pros
>> >>                     * doesn't require the user to build an uber jar
>> >>                     Risks
>> >>                     * dependencies package will keep growing even if
>> something
>> >>                     is or isn't needed by all of Apache Beam leading
>> to a
>> >>                     large jar anyways negating any space savings
>> >>
>> >>                     2) Shade within each module to a common location
>> like
>> >>                     org.apache.beam.relocated.guava....
>> >>                     Pros
>> >>                     * you only get the shaded dependencies of the
>> things that
>> >>                     are required
>> >>                     * its one less dependency for users to manage
>> >>                     Risks
>> >>                     * requires an uber jar to be built to get the space
>> >>                     savings (either by a user or a distribution of
>> Apache
>> >>                     Beam) otherwise we negate any space savings.
>> >>
>> >>                     If we either use a common relocation scheme or a
>> >>                     dependencies jar then each relocation should
>> specifically
>> >>                     contain the version number of the package because
>> we would
>> >>                     like to allow for us to be using two different
>> versions of
>> >>                     the same library.
>> >>
>> >>                     For the common deps package approach, should we
>> check in
>> >>                     code where the imports contain the relocated
>> location
>> >>                     (e.g. import org.apache.beam.guava.20.0.com
>> >>                     <http://org.apache.beam.guava.20.0.com/>.google.
>> common.collect.ImmutableList)?
>> >>
>> >>
>> >>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste
>> Onofré
>> >>                     <jb@nanthrax.net <ma...@nanthrax.net>> wrote:
>> >>
>> >>                         Thanks for bringing that back.
>> >>
>> >>                         Indeed guava is shaded in different uber-jar.
>> Maybe we
>> >>                         can have a common deps module that we include
>> once
>> >>                         (but the user will have to explicitly define
>> the dep) ?
>> >>
>> >>                         Basically, what do you propose for protobuf
>> >>                         (unfortunately, I don't see an obvious) ?
>> >>
>> >>                         Regards
>> >>                         JB
>> >>
>> >>
>> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>> >>
>> >>                             Hello, I wanted to bring back this subject
>> because
>> >>                             I think we should
>> >>                             take action on this and at least first
>> have a
>> >>                             shaded version of guava.
>> >>                             I was playing with a toy project and I did
>> the
>> >>                             procedure we use to
>> >>                             submit jars to a Hadoop cluster via
>> Flink/Spark
>> >>                             which involves
>> >>                             creating an uber jar and I realized that
>> the size
>> >>                             of the jar was way
>> >>                             bigger than I expected, and the fact that
>> we shade
>> >>                             guava in every
>> >>                             module contributes to this. I found guava
>> shaded on:
>> >>
>> >>                             sdks/java/core
>> >>                             runners/core-construction-java
>> >>                             runners/core-java
>> >>                             model/job-management
>> >>                             runners/spark
>> >>                             sdks/java/io/hadoop-file-system
>> >>                             sdks/java/io/kafka
>> >>
>> >>                             This means at least 6 times more of the
>> size it
>> >>                             should which counts in
>> >>                             around 15MB more (2.4MB*6 deps) of extra
>> weight
>> >>                             that we can simply
>> >>                             reduce by using a shaded version of the
>> library.
>> >>
>> >>                             I add this point to the previous ones
>> mentioned
>> >>                             during the discussion
>> >>                             because this goes against the end user
>> experience
>> >>                             and affects us all
>> >>                             (devs/users).
>> >>
>> >>                             Another question is if we should shade
>> (and how)
>> >>                             protocol buffers
>> >>                             because now with the portability work we
>> are
>> >>                             exposing it closer to the
>> >>                             end users. I say this because I also found
>> an
>> >>                             issue while running a
>> >>                             job on YARN with the spark runner because
>> >>                             hadoop-common includes
>> >>                             protobuf-java 2 and I had to explicitly
>> provide
>> >>                             protocol-buffers 3 as
>> >>                             a dependency to be able to use triggers
>> (note the
>> >>                             Spark runner
>> >>                             translates them using some method from
>> >>                             runners/core-java). Since
>> >>                             hadoop-common is provided in the cluster
>> with the
>> >>                             older version of
>> >>                             protobuf, I am afraid that this will bite
>> us in
>> >>                             the future.
>> >>
>> >>                             Ismaël
>> >>
>> >>                             ps. There is already a JIRA for that
>> shading for
>> >>                             protobuf on
>> >>                             hadoop-common but this is not coming until
>> version
>> >>                             3 is out.
>> >>                             https://issues.apache.org/jir
>> a/browse/HADOOP-13136
>> >>                             <https://issues.apache.org/ji
>> ra/browse/HADOOP-13136>
>> >>
>> >>                             ps2. Extra curious situation is to see
>> that the
>> >>                             dataflow-runner ends
>> >>                             up having guava shaded twice via its
>> shaded version on
>> >>                             core-construction-java.
>> >>
>> >>                             ps3. Of course this message means a
>> de-facto +1 at
>> >>                             least to do it for
>> >>                             guava and evaluate it for other libs.
>> >>
>> >>
>> >>                             On Tue, Oct 17, 2017 at 7:29 PM, Lukasz
>> Cwik
>> >>                             <lcwik@google.com.invalid
>> >>                             <ma...@google.com.invalid>> wrote:
>> >>
>> >>                                 An issue to call out is how to deal
>> with our
>> >>                                 generated code (.avro and
>> >>                                 .proto) as I don't believe those
>> plugins allow
>> >>                                 you to generate code using a
>> >>                                 shaded package prefix on imports.
>> >>
>> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
>> Thomas Groh
>> >>                                 <tgroh@google.com.invalid
>> >>                                 <ma...@google.com.invalid>>
>> >>                                 wrote:
>> >>
>> >>                                     +1 to the goal. I'm hugely in
>> favor of not
>> >>                                     doing the same shading work
>> >>                                     every time for dependencies we
>> know we'll use.
>> >>
>> >>                                     This also means that if we end up
>> pulling
>> >>                                     in transitive dependencies we
>> >>                                     don't want in any particular
>> module we can
>> >>                                     avoid having to adjust our
>> >>                                     repackaging strategy for that
>> module -
>> >>                                     which I have run into face-first in
>> >>                                     the past.
>> >>
>> >>                                     On Tue, Oct 17, 2017 at 9:48 AM,
>> Kenneth
>> >>                                     Knowles <klk@google.com.invalid
>> >>                                     <ma...@google.com.invalid>>
>> >>                                     wrote:
>> >>
>> >>                                         Hi all,
>> >>
>> >>                                         Shading is a big part of how
>> we keep
>> >>                                         our dependencies sane in Beam.
>> But
>> >>                                         downsides: shading is super
>> slow,
>> >>                                         causes massive jar bloat, and
>> kind of
>> >>                                         hard to get right because
>> artifacts
>> >>                                         and namespaces are not 1-to-1.
>> >>
>> >>                                         I know that some communities
>> >>                                         distribute their own shaded
>> >>                                         distributions of
>> >>                                         dependencies. I had a thought
>> about
>> >>                                         doing something similar that I
>> wanted
>> >>                                         to throw out there for people
>> to poke
>> >>                                         holes in.
>> >>
>> >>                                         To set the scene, here is how
>> I view
>> >>                                         shading:
>> >>
>> >>                                           - A module has public
>> dependencies
>> >>                                         and private dependencies.
>> >>                                           - Public deps are used for
>> data
>> >>                                         interchange; users must share
>> these
>> >>
>> >>                                     deps.
>> >>
>> >>                                           - Private deps are just
>> >>                                         functionality and can be
>> hidden (in
>> >>                                         our case,
>> >>                                         relocated + bundled)
>> >>                                           - It isn't necessarily that
>> simple,
>> >>                                         because public and private deps
>> >>
>> >>                                     might
>> >>
>> >>                                         interact in higher-order ways
>> >>                                         ("public" is contagious)
>> >>
>> >>                                         Shading is an implementation
>> detail of
>> >>                                         expressing these
>> characteristics.
>> >>
>> >>                                     We
>> >>
>> >>                                         use shading selectively
>> because of its
>> >>                                         downsides I mentioned above.
>> >>
>> >>                                         But what about this idea:
>> Introduce
>> >>                                         shaded deps as a single
>> separate
>> >>                                         artifact.
>> >>
>> >>                                           - sdks/java/private-deps:
>> bundled
>> >>                                         uber jar with relocated
>> versions of
>> >>                                         everything we want to shade
>> >>
>> >>                                           - sdks/java/core and
>> >>                                         sdks/java/harness: no
>> relocation or
>> >>                                         bundling -
>> >>                                         depends on
>> >>                                         `beam-sdks-java-private-deps`
>> and
>> >>                                         imports like
>> >>                                         `org.apache.beam.sdk.private.
>> com.google.common`
>> >>                                         directly (this is what
>> >>                                         they
>> >>                                         are rewritten to
>> >>
>> >>                                         Some benefits
>> >>
>> >>                                           - much faster builds of
>> other modules
>> >>                                           - only one shaded uber jar
>> >>                                           - rare/no rebuilds of the
>> uber jar
>> >>                                           - can use maven enforcer to
>> forbid
>> >>                                         imports like com.google.common
>> >>                                           - configuration all in one
>> place
>> >>                                           - no automated rewriting of
>> our real
>> >>                                         code, which has led to some
>> major
>> >>                                         confusion
>> >>                                           - easy to implement
>> incrementally
>> >>
>> >>                                         Downsides:
>> >>
>> >>                                           - plenty of effort work to
>> get there
>> >>                                           - unclear how many different
>> such
>> >>                                         deps modules we need; sharing
>> them
>> >>
>> >>                                     could
>> >>
>> >>                                         get weird
>> >>                                           - if we hit a roadblock, we
>> will
>> >>                                         have committed a lot of time
>> >>
>> >>                                         Just something I was musing as
>> I spent
>> >>                                         another evening waiting for
>> slow
>> >>                                         builds to try to confirm
>> changes to
>> >>                                         brittle poms.
>> >>
>> >>                                         Kenn
>> >>
>> >>
>> >>
>> >>                         --
>> >>                         Jean-Baptiste Onofré
>> >>                         jbonofre@apache.org <mailto:
>> jbonofre@apache.org>
>> >>                         http://blog.nanthrax.net <
>> http://blog.nanthrax.net/>
>> >>                         Talend - http://www.talend.com <
>> http://www.talend.com/>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Reuven Lax <re...@google.com>.
Not sure we'll be able to replace them all. Things like guava Table and
Multimap don't have great replacements in Java8.

On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> +1, it was on my TODO for a while waiting the Java8 update.
>
> Regards
> JB
>
> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
> > Why not dropping guava for all beam codebase? With java 8 it is quite
> easy to do
> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
> good result.
> >
> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
> > <ma...@google.com>> a écrit :
> >
> >     Make sure to include the guava version in the artifact name so that
> we can
> >     have multiple vendored versions.
> >
> >     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <klk@google.com
> >     <ma...@google.com>> wrote:
> >
> >         I didn't have time for this, but it just bit me. We definitely
> have
> >         Guava on the API surface of runner support code in ways that get
> >         incompatibly shaded. I will probably start "1a" by making a
> shaded
> >         library org.apache.beam:vendored-guava and starting to use it.
> It sounds
> >         like there is generally unanimous support for that much, anyhow.
> >
> >         Kenn
> >
> >         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <
> aljoscha@apache.org
> >         <ma...@apache.org>> wrote:
> >
> >             Thanks Ismaël for bringing up this discussion again!
> >
> >             I would be in favour of 1) and more specifically of 1a)
> >
> >             Aljoscha
> >
> >
> >>             On 12. Dec 2017, at 18:56, Lukasz Cwik <lcwik@google.com
> >>             <ma...@google.com>> wrote:
> >>
> >>             You can always run tests on post shaded artifacts instead
> of the
> >>             compiled classes, it just requires us to change our maven
> surefire
> >>             / gradle test configurations but it is true that most IDEs
> would
> >>             behave better with a dependency jar unless you delegate all
> the
> >>             build/test actions to the build system and then it won't
> matter.
> >>
> >>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <
> klk@google.com
> >>             <ma...@google.com>> wrote:
> >>
> >>                 There's also, with additional overhead,
> >>
> >>                 1a) A relocated and shipped package for each thing we
> want to
> >>                 relocate. I think this has also been tried outside
> Beam...
> >>
> >>                 Pros:
> >>                 * all the pros of 1) plus no bloat beyond what is
> necessary
> >>                 Cons:
> >>                 * abandons whitelist approach for public deps,
> reverting to
> >>                 blacklist approach for trouble things like guava, so a
> bit
> >>                 less principled
> >>
> >>                 For both 1) and 1a) I would add:
> >>
> >>                 Pros:
> >>                 * clearly readable dependency since code will `import
> >>                 org.apache.beam.private.guava21` and IDEs will
> understand it
> >>                 is a distinct lilbrary
> >>                 * can run tests on unpackaged classes, as long as deps
> are
> >>                 shaded or provided as jars
> >>                 * no mysterious action at a distance from inherited
> configuration
> >>                 Cons:
> >>                 * need to adjust imports
> >>
> >>                 Kenn
> >>
> >>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <
> lcwik@google.com
> >>                 <ma...@google.com>> wrote:
> >>
> >>                     I would suggest that either we use:
> >>                     1) A common deps package containing shaded
> dependencies
> >>                     allows for
> >>                     Pros
> >>                     * doesn't require the user to build an uber jar
> >>                     Risks
> >>                     * dependencies package will keep growing even if
> something
> >>                     is or isn't needed by all of Apache Beam leading to
> a
> >>                     large jar anyways negating any space savings
> >>
> >>                     2) Shade within each module to a common location
> like
> >>                     org.apache.beam.relocated.guava....
> >>                     Pros
> >>                     * you only get the shaded dependencies of the
> things that
> >>                     are required
> >>                     * its one less dependency for users to manage
> >>                     Risks
> >>                     * requires an uber jar to be built to get the space
> >>                     savings (either by a user or a distribution of
> Apache
> >>                     Beam) otherwise we negate any space savings.
> >>
> >>                     If we either use a common relocation scheme or a
> >>                     dependencies jar then each relocation should
> specifically
> >>                     contain the version number of the package because
> we would
> >>                     like to allow for us to be using two different
> versions of
> >>                     the same library.
> >>
> >>                     For the common deps package approach, should we
> check in
> >>                     code where the imports contain the relocated
> location
> >>                     (e.g. import org.apache.beam.guava.20.0.com
> >>                     <http://org.apache.beam.guava.20.0.com/
> >.google.common.collect.ImmutableList)?
> >>
> >>
> >>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste
> Onofré
> >>                     <jb@nanthrax.net <ma...@nanthrax.net>> wrote:
> >>
> >>                         Thanks for bringing that back.
> >>
> >>                         Indeed guava is shaded in different uber-jar.
> Maybe we
> >>                         can have a common deps module that we include
> once
> >>                         (but the user will have to explicitly define
> the dep) ?
> >>
> >>                         Basically, what do you propose for protobuf
> >>                         (unfortunately, I don't see an obvious) ?
> >>
> >>                         Regards
> >>                         JB
> >>
> >>
> >>                         On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
> >>
> >>                             Hello, I wanted to bring back this subject
> because
> >>                             I think we should
> >>                             take action on this and at least first have
> a
> >>                             shaded version of guava.
> >>                             I was playing with a toy project and I did
> the
> >>                             procedure we use to
> >>                             submit jars to a Hadoop cluster via
> Flink/Spark
> >>                             which involves
> >>                             creating an uber jar and I realized that
> the size
> >>                             of the jar was way
> >>                             bigger than I expected, and the fact that
> we shade
> >>                             guava in every
> >>                             module contributes to this. I found guava
> shaded on:
> >>
> >>                             sdks/java/core
> >>                             runners/core-construction-java
> >>                             runners/core-java
> >>                             model/job-management
> >>                             runners/spark
> >>                             sdks/java/io/hadoop-file-system
> >>                             sdks/java/io/kafka
> >>
> >>                             This means at least 6 times more of the
> size it
> >>                             should which counts in
> >>                             around 15MB more (2.4MB*6 deps) of extra
> weight
> >>                             that we can simply
> >>                             reduce by using a shaded version of the
> library.
> >>
> >>                             I add this point to the previous ones
> mentioned
> >>                             during the discussion
> >>                             because this goes against the end user
> experience
> >>                             and affects us all
> >>                             (devs/users).
> >>
> >>                             Another question is if we should shade (and
> how)
> >>                             protocol buffers
> >>                             because now with the portability work we are
> >>                             exposing it closer to the
> >>                             end users. I say this because I also found
> an
> >>                             issue while running a
> >>                             job on YARN with the spark runner because
> >>                             hadoop-common includes
> >>                             protobuf-java 2 and I had to explicitly
> provide
> >>                             protocol-buffers 3 as
> >>                             a dependency to be able to use triggers
> (note the
> >>                             Spark runner
> >>                             translates them using some method from
> >>                             runners/core-java). Since
> >>                             hadoop-common is provided in the cluster
> with the
> >>                             older version of
> >>                             protobuf, I am afraid that this will bite
> us in
> >>                             the future.
> >>
> >>                             Ismaël
> >>
> >>                             ps. There is already a JIRA for that
> shading for
> >>                             protobuf on
> >>                             hadoop-common but this is not coming until
> version
> >>                             3 is out.
> >>                             https://issues.apache.org/
> jira/browse/HADOOP-13136
> >>                             <https://issues.apache.org/
> jira/browse/HADOOP-13136>
> >>
> >>                             ps2. Extra curious situation is to see that
> the
> >>                             dataflow-runner ends
> >>                             up having guava shaded twice via its shaded
> version on
> >>                             core-construction-java.
> >>
> >>                             ps3. Of course this message means a
> de-facto +1 at
> >>                             least to do it for
> >>                             guava and evaluate it for other libs.
> >>
> >>
> >>                             On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik
> >>                             <lcwik@google.com.invalid
> >>                             <ma...@google.com.invalid>> wrote:
> >>
> >>                                 An issue to call out is how to deal
> with our
> >>                                 generated code (.avro and
> >>                                 .proto) as I don't believe those
> plugins allow
> >>                                 you to generate code using a
> >>                                 shaded package prefix on imports.
> >>
> >>                                 On Tue, Oct 17, 2017 at 10:28 AM,
> Thomas Groh
> >>                                 <tgroh@google.com.invalid
> >>                                 <ma...@google.com.invalid>>
> >>                                 wrote:
> >>
> >>                                     +1 to the goal. I'm hugely in favor
> of not
> >>                                     doing the same shading work
> >>                                     every time for dependencies we know
> we'll use.
> >>
> >>                                     This also means that if we end up
> pulling
> >>                                     in transitive dependencies we
> >>                                     don't want in any particular module
> we can
> >>                                     avoid having to adjust our
> >>                                     repackaging strategy for that
> module -
> >>                                     which I have run into face-first in
> >>                                     the past.
> >>
> >>                                     On Tue, Oct 17, 2017 at 9:48 AM,
> Kenneth
> >>                                     Knowles <klk@google.com.invalid
> >>                                     <ma...@google.com.invalid>>
> >>                                     wrote:
> >>
> >>                                         Hi all,
> >>
> >>                                         Shading is a big part of how we
> keep
> >>                                         our dependencies sane in Beam.
> But
> >>                                         downsides: shading is super
> slow,
> >>                                         causes massive jar bloat, and
> kind of
> >>                                         hard to get right because
> artifacts
> >>                                         and namespaces are not 1-to-1.
> >>
> >>                                         I know that some communities
> >>                                         distribute their own shaded
> >>                                         distributions of
> >>                                         dependencies. I had a thought
> about
> >>                                         doing something similar that I
> wanted
> >>                                         to throw out there for people
> to poke
> >>                                         holes in.
> >>
> >>                                         To set the scene, here is how I
> view
> >>                                         shading:
> >>
> >>                                           - A module has public
> dependencies
> >>                                         and private dependencies.
> >>                                           - Public deps are used for
> data
> >>                                         interchange; users must share
> these
> >>
> >>                                     deps.
> >>
> >>                                           - Private deps are just
> >>                                         functionality and can be hidden
> (in
> >>                                         our case,
> >>                                         relocated + bundled)
> >>                                           - It isn't necessarily that
> simple,
> >>                                         because public and private deps
> >>
> >>                                     might
> >>
> >>                                         interact in higher-order ways
> >>                                         ("public" is contagious)
> >>
> >>                                         Shading is an implementation
> detail of
> >>                                         expressing these
> characteristics.
> >>
> >>                                     We
> >>
> >>                                         use shading selectively because
> of its
> >>                                         downsides I mentioned above.
> >>
> >>                                         But what about this idea:
> Introduce
> >>                                         shaded deps as a single separate
> >>                                         artifact.
> >>
> >>                                           - sdks/java/private-deps:
> bundled
> >>                                         uber jar with relocated
> versions of
> >>                                         everything we want to shade
> >>
> >>                                           - sdks/java/core and
> >>                                         sdks/java/harness: no
> relocation or
> >>                                         bundling -
> >>                                         depends on
> >>                                         `beam-sdks-java-private-deps`
> and
> >>                                         imports like
> >>                                         `org.apache.beam.sdk.private.
> com.google.common`
> >>                                         directly (this is what
> >>                                         they
> >>                                         are rewritten to
> >>
> >>                                         Some benefits
> >>
> >>                                           - much faster builds of other
> modules
> >>                                           - only one shaded uber jar
> >>                                           - rare/no rebuilds of the
> uber jar
> >>                                           - can use maven enforcer to
> forbid
> >>                                         imports like com.google.common
> >>                                           - configuration all in one
> place
> >>                                           - no automated rewriting of
> our real
> >>                                         code, which has led to some
> major
> >>                                         confusion
> >>                                           - easy to implement
> incrementally
> >>
> >>                                         Downsides:
> >>
> >>                                           - plenty of effort work to
> get there
> >>                                           - unclear how many different
> such
> >>                                         deps modules we need; sharing
> them
> >>
> >>                                     could
> >>
> >>                                         get weird
> >>                                           - if we hit a roadblock, we
> will
> >>                                         have committed a lot of time
> >>
> >>                                         Just something I was musing as
> I spent
> >>                                         another evening waiting for slow
> >>                                         builds to try to confirm
> changes to
> >>                                         brittle poms.
> >>
> >>                                         Kenn
> >>
> >>
> >>
> >>                         --
> >>                         Jean-Baptiste Onofré
> >>                         jbonofre@apache.org <mailto:jbonofre@apache.org
> >
> >>                         http://blog.nanthrax.net <
> http://blog.nanthrax.net/>
> >>                         Talend - http://www.talend.com <
> http://www.talend.com/>
> >>
> >>
> >>
> >>
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
+1, it was on my TODO for a while waiting the Java8 update.

Regards
JB

On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
> Why not dropping guava for all beam codebase? With java 8 it is quite easy to do
> it and avoid a bunch of conflicts. Did it in 2 projects with quite a good result.
> 
> Le 1 févr. 2018 06:50, "Lukasz Cwik" <lcwik@google.com
> <ma...@google.com>> a écrit :
> 
>     Make sure to include the guava version in the artifact name so that we can
>     have multiple vendored versions.
> 
>     On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <klk@google.com
>     <ma...@google.com>> wrote:
> 
>         I didn't have time for this, but it just bit me. We definitely have
>         Guava on the API surface of runner support code in ways that get
>         incompatibly shaded. I will probably start "1a" by making a shaded
>         library org.apache.beam:vendored-guava and starting to use it. It sounds
>         like there is generally unanimous support for that much, anyhow.
> 
>         Kenn
> 
>         On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <aljoscha@apache.org
>         <ma...@apache.org>> wrote:
> 
>             Thanks Ismaël for bringing up this discussion again!
> 
>             I would be in favour of 1) and more specifically of 1a)
> 
>             Aljoscha
> 
> 
>>             On 12. Dec 2017, at 18:56, Lukasz Cwik <lcwik@google.com
>>             <ma...@google.com>> wrote:
>>
>>             You can always run tests on post shaded artifacts instead of the
>>             compiled classes, it just requires us to change our maven surefire
>>             / gradle test configurations but it is true that most IDEs would
>>             behave better with a dependency jar unless you delegate all the
>>             build/test actions to the build system and then it won't matter.
>>
>>             On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <klk@google.com
>>             <ma...@google.com>> wrote:
>>
>>                 There's also, with additional overhead,
>>
>>                 1a) A relocated and shipped package for each thing we want to
>>                 relocate. I think this has also been tried outside Beam...
>>
>>                 Pros:
>>                 * all the pros of 1) plus no bloat beyond what is necessary
>>                 Cons:
>>                 * abandons whitelist approach for public deps, reverting to
>>                 blacklist approach for trouble things like guava, so a bit
>>                 less principled
>>
>>                 For both 1) and 1a) I would add:
>>
>>                 Pros:
>>                 * clearly readable dependency since code will `import
>>                 org.apache.beam.private.guava21` and IDEs will understand it
>>                 is a distinct lilbrary
>>                 * can run tests on unpackaged classes, as long as deps are
>>                 shaded or provided as jars
>>                 * no mysterious action at a distance from inherited configuration
>>                 Cons:
>>                 * need to adjust imports
>>
>>                 Kenn
>>
>>                 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lcwik@google.com
>>                 <ma...@google.com>> wrote:
>>
>>                     I would suggest that either we use:
>>                     1) A common deps package containing shaded dependencies
>>                     allows for 
>>                     Pros
>>                     * doesn't require the user to build an uber jar
>>                     Risks
>>                     * dependencies package will keep growing even if something
>>                     is or isn't needed by all of Apache Beam leading to a
>>                     large jar anyways negating any space savings
>>
>>                     2) Shade within each module to a common location like
>>                     org.apache.beam.relocated.guava....
>>                     Pros
>>                     * you only get the shaded dependencies of the things that
>>                     are required
>>                     * its one less dependency for users to manage
>>                     Risks
>>                     * requires an uber jar to be built to get the space
>>                     savings (either by a user or a distribution of Apache
>>                     Beam) otherwise we negate any space savings.
>>
>>                     If we either use a common relocation scheme or a
>>                     dependencies jar then each relocation should specifically
>>                     contain the version number of the package because we would
>>                     like to allow for us to be using two different versions of
>>                     the same library.
>>
>>                     For the common deps package approach, should we check in
>>                     code where the imports contain the relocated location
>>                     (e.g. import org.apache.beam.guava.20.0.com
>>                     <http://org.apache.beam.guava.20.0.com/>.google.common.collect.ImmutableList)?
>>
>>
>>                     On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré
>>                     <jb@nanthrax.net <ma...@nanthrax.net>> wrote:
>>
>>                         Thanks for bringing that back.
>>
>>                         Indeed guava is shaded in different uber-jar. Maybe we
>>                         can have a common deps module that we include once
>>                         (but the user will have to explicitly define the dep) ?
>>
>>                         Basically, what do you propose for protobuf
>>                         (unfortunately, I don't see an obvious) ?
>>
>>                         Regards
>>                         JB
>>
>>
>>                         On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>
>>                             Hello, I wanted to bring back this subject because
>>                             I think we should
>>                             take action on this and at least first have a
>>                             shaded version of guava.
>>                             I was playing with a toy project and I did the
>>                             procedure we use to
>>                             submit jars to a Hadoop cluster via Flink/Spark
>>                             which involves
>>                             creating an uber jar and I realized that the size
>>                             of the jar was way
>>                             bigger than I expected, and the fact that we shade
>>                             guava in every
>>                             module contributes to this. I found guava shaded on:
>>
>>                             sdks/java/core
>>                             runners/core-construction-java
>>                             runners/core-java
>>                             model/job-management
>>                             runners/spark
>>                             sdks/java/io/hadoop-file-system
>>                             sdks/java/io/kafka
>>
>>                             This means at least 6 times more of the size it
>>                             should which counts in
>>                             around 15MB more (2.4MB*6 deps) of extra weight
>>                             that we can simply
>>                             reduce by using a shaded version of the library.
>>
>>                             I add this point to the previous ones mentioned
>>                             during the discussion
>>                             because this goes against the end user experience
>>                             and affects us all
>>                             (devs/users).
>>
>>                             Another question is if we should shade (and how)
>>                             protocol buffers
>>                             because now with the portability work we are
>>                             exposing it closer to the
>>                             end users. I say this because I also found an
>>                             issue while running a
>>                             job on YARN with the spark runner because
>>                             hadoop-common includes
>>                             protobuf-java 2 and I had to explicitly provide
>>                             protocol-buffers 3 as
>>                             a dependency to be able to use triggers (note the
>>                             Spark runner
>>                             translates them using some method from
>>                             runners/core-java). Since
>>                             hadoop-common is provided in the cluster with the
>>                             older version of
>>                             protobuf, I am afraid that this will bite us in
>>                             the future.
>>
>>                             Ismaël
>>
>>                             ps. There is already a JIRA for that shading for
>>                             protobuf on
>>                             hadoop-common but this is not coming until version
>>                             3 is out.
>>                             https://issues.apache.org/jira/browse/HADOOP-13136
>>                             <https://issues.apache.org/jira/browse/HADOOP-13136>
>>
>>                             ps2. Extra curious situation is to see that the
>>                             dataflow-runner ends
>>                             up having guava shaded twice via its shaded version on
>>                             core-construction-java.
>>
>>                             ps3. Of course this message means a de-facto +1 at
>>                             least to do it for
>>                             guava and evaluate it for other libs.
>>
>>
>>                             On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik
>>                             <lcwik@google.com.invalid
>>                             <ma...@google.com.invalid>> wrote:
>>
>>                                 An issue to call out is how to deal with our
>>                                 generated code (.avro and
>>                                 .proto) as I don't believe those plugins allow
>>                                 you to generate code using a
>>                                 shaded package prefix on imports.
>>
>>                                 On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh
>>                                 <tgroh@google.com.invalid
>>                                 <ma...@google.com.invalid>>
>>                                 wrote:
>>
>>                                     +1 to the goal. I'm hugely in favor of not
>>                                     doing the same shading work
>>                                     every time for dependencies we know we'll use.
>>
>>                                     This also means that if we end up pulling
>>                                     in transitive dependencies we
>>                                     don't want in any particular module we can
>>                                     avoid having to adjust our
>>                                     repackaging strategy for that module -
>>                                     which I have run into face-first in
>>                                     the past.
>>
>>                                     On Tue, Oct 17, 2017 at 9:48 AM, Kenneth
>>                                     Knowles <klk@google.com.invalid
>>                                     <ma...@google.com.invalid>>
>>                                     wrote:
>>
>>                                         Hi all,
>>
>>                                         Shading is a big part of how we keep
>>                                         our dependencies sane in Beam. But
>>                                         downsides: shading is super slow,
>>                                         causes massive jar bloat, and kind of
>>                                         hard to get right because artifacts
>>                                         and namespaces are not 1-to-1.
>>
>>                                         I know that some communities
>>                                         distribute their own shaded
>>                                         distributions of
>>                                         dependencies. I had a thought about
>>                                         doing something similar that I wanted
>>                                         to throw out there for people to poke
>>                                         holes in.
>>
>>                                         To set the scene, here is how I view
>>                                         shading:
>>
>>                                           - A module has public dependencies
>>                                         and private dependencies.
>>                                           - Public deps are used for data
>>                                         interchange; users must share these
>>
>>                                     deps.
>>
>>                                           - Private deps are just
>>                                         functionality and can be hidden (in
>>                                         our case,
>>                                         relocated + bundled)
>>                                           - It isn't necessarily that simple,
>>                                         because public and private deps
>>
>>                                     might
>>
>>                                         interact in higher-order ways
>>                                         ("public" is contagious)
>>
>>                                         Shading is an implementation detail of
>>                                         expressing these characteristics.
>>
>>                                     We
>>
>>                                         use shading selectively because of its
>>                                         downsides I mentioned above.
>>
>>                                         But what about this idea: Introduce
>>                                         shaded deps as a single separate
>>                                         artifact.
>>
>>                                           - sdks/java/private-deps: bundled
>>                                         uber jar with relocated versions of
>>                                         everything we want to shade
>>
>>                                           - sdks/java/core and
>>                                         sdks/java/harness: no relocation or
>>                                         bundling -
>>                                         depends on
>>                                         `beam-sdks-java-private-deps` and
>>                                         imports like
>>                                         `org.apache.beam.sdk.private.com.google.common`
>>                                         directly (this is what
>>                                         they
>>                                         are rewritten to
>>
>>                                         Some benefits
>>
>>                                           - much faster builds of other modules
>>                                           - only one shaded uber jar
>>                                           - rare/no rebuilds of the uber jar
>>                                           - can use maven enforcer to forbid
>>                                         imports like com.google.common
>>                                           - configuration all in one place
>>                                           - no automated rewriting of our real
>>                                         code, which has led to some major
>>                                         confusion
>>                                           - easy to implement incrementally
>>
>>                                         Downsides:
>>
>>                                           - plenty of effort work to get there
>>                                           - unclear how many different such
>>                                         deps modules we need; sharing them
>>
>>                                     could
>>
>>                                         get weird
>>                                           - if we hit a roadblock, we will
>>                                         have committed a lot of time
>>
>>                                         Just something I was musing as I spent
>>                                         another evening waiting for slow
>>                                         builds to try to confirm changes to
>>                                         brittle poms.
>>
>>                                         Kenn
>>
>>
>>
>>                         -- 
>>                         Jean-Baptiste Onofré
>>                         jbonofre@apache.org <ma...@apache.org>
>>                         http://blog.nanthrax.net <http://blog.nanthrax.net/>
>>                         Talend - http://www.talend.com <http://www.talend.com/>
>>
>>
>>
>>
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Lukasz Cwik <lc...@google.com>.
That is an even better idea. A lot of guava constructs have been superseded
by stuff that was added to Java 8.

On Wed, Jan 31, 2018 at 9:56 PM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Why not dropping guava for all beam codebase? With java 8 it is quite easy
> to do it and avoid a bunch of conflicts. Did it in 2 projects with quite a
> good result.
>
> Le 1 févr. 2018 06:50, "Lukasz Cwik" <lc...@google.com> a écrit :
>
>> Make sure to include the guava version in the artifact name so that we
>> can have multiple vendored versions.
>>
>> On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <kl...@google.com> wrote:
>>
>>> I didn't have time for this, but it just bit me. We definitely have
>>> Guava on the API surface of runner support code in ways that get
>>> incompatibly shaded. I will probably start "1a" by making a shaded library
>>> org.apache.beam:vendored-guava and starting to use it. It sounds like there
>>> is generally unanimous support for that much, anyhow.
>>>
>>> Kenn
>>>
>>> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <al...@apache.org>
>>> wrote:
>>>
>>>> Thanks Ismaël for bringing up this discussion again!
>>>>
>>>> I would be in favour of 1) and more specifically of 1a)
>>>>
>>>> Aljoscha
>>>>
>>>>
>>>> On 12. Dec 2017, at 18:56, Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>> You can always run tests on post shaded artifacts instead of the
>>>> compiled classes, it just requires us to change our maven surefire / gradle
>>>> test configurations but it is true that most IDEs would behave better with
>>>> a dependency jar unless you delegate all the build/test actions to the
>>>> build system and then it won't matter.
>>>>
>>>> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <kl...@google.com>
>>>> wrote:
>>>>
>>>>> There's also, with additional overhead,
>>>>>
>>>>> 1a) A relocated and shipped package for each thing we want to
>>>>> relocate. I think this has also been tried outside Beam...
>>>>>
>>>>> Pros:
>>>>> * all the pros of 1) plus no bloat beyond what is necessary
>>>>> Cons:
>>>>> * abandons whitelist approach for public deps, reverting to blacklist
>>>>> approach for trouble things like guava, so a bit less principled
>>>>>
>>>>> For both 1) and 1a) I would add:
>>>>>
>>>>> Pros:
>>>>> * clearly readable dependency since code will `import
>>>>> org.apache.beam.private.guava21` and IDEs will understand it is a
>>>>> distinct lilbrary
>>>>> * can run tests on unpackaged classes, as long as deps are shaded or
>>>>> provided as jars
>>>>> * no mysterious action at a distance from inherited configuration
>>>>> Cons:
>>>>> * need to adjust imports
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I would suggest that either we use:
>>>>>> 1) A common deps package containing shaded dependencies allows for
>>>>>> Pros
>>>>>> * doesn't require the user to build an uber jar
>>>>>> Risks
>>>>>> * dependencies package will keep growing even if something is or
>>>>>> isn't needed by all of Apache Beam leading to a large jar anyways negating
>>>>>> any space savings
>>>>>>
>>>>>> 2) Shade within each module to a common location like
>>>>>> org.apache.beam.relocated.guava....
>>>>>> Pros
>>>>>> * you only get the shaded dependencies of the things that are required
>>>>>> * its one less dependency for users to manage
>>>>>> Risks
>>>>>> * requires an uber jar to be built to get the space savings (either
>>>>>> by a user or a distribution of Apache Beam) otherwise we negate any space
>>>>>> savings.
>>>>>>
>>>>>> If we either use a common relocation scheme or a dependencies jar
>>>>>> then each relocation should specifically contain the version number of the
>>>>>> package because we would like to allow for us to be using two different
>>>>>> versions of the same library.
>>>>>>
>>>>>> For the common deps package approach, should we check in code where
>>>>>> the imports contain the relocated location (e.g. import
>>>>>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <
>>>>>> jb@nanthrax.net> wrote:
>>>>>>
>>>>>>> Thanks for bringing that back.
>>>>>>>
>>>>>>> Indeed guava is shaded in different uber-jar. Maybe we can have a
>>>>>>> common deps module that we include once (but the user will have to
>>>>>>> explicitly define the dep) ?
>>>>>>>
>>>>>>> Basically, what do you propose for protobuf (unfortunately, I don't
>>>>>>> see an obvious) ?
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>>>>>>
>>>>>>>> Hello, I wanted to bring back this subject because I think we should
>>>>>>>> take action on this and at least first have a shaded version of
>>>>>>>> guava.
>>>>>>>> I was playing with a toy project and I did the procedure we use to
>>>>>>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>>>>>>> creating an uber jar and I realized that the size of the jar was way
>>>>>>>> bigger than I expected, and the fact that we shade guava in every
>>>>>>>> module contributes to this. I found guava shaded on:
>>>>>>>>
>>>>>>>> sdks/java/core
>>>>>>>> runners/core-construction-java
>>>>>>>> runners/core-java
>>>>>>>> model/job-management
>>>>>>>> runners/spark
>>>>>>>> sdks/java/io/hadoop-file-system
>>>>>>>> sdks/java/io/kafka
>>>>>>>>
>>>>>>>> This means at least 6 times more of the size it should which counts
>>>>>>>> in
>>>>>>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>>>>>>> reduce by using a shaded version of the library.
>>>>>>>>
>>>>>>>> I add this point to the previous ones mentioned during the
>>>>>>>> discussion
>>>>>>>> because this goes against the end user experience and affects us all
>>>>>>>> (devs/users).
>>>>>>>>
>>>>>>>> Another question is if we should shade (and how) protocol buffers
>>>>>>>> because now with the portability work we are exposing it closer to
>>>>>>>> the
>>>>>>>> end users. I say this because I also found an issue while running a
>>>>>>>> job on YARN with the spark runner because hadoop-common includes
>>>>>>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3
>>>>>>>> as
>>>>>>>> a dependency to be able to use triggers (note the Spark runner
>>>>>>>> translates them using some method from runners/core-java). Since
>>>>>>>> hadoop-common is provided in the cluster with the older version of
>>>>>>>> protobuf, I am afraid that this will bite us in the future.
>>>>>>>>
>>>>>>>> Ismaël
>>>>>>>>
>>>>>>>> ps. There is already a JIRA for that shading for protobuf on
>>>>>>>> hadoop-common but this is not coming until version 3 is out.
>>>>>>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>>>>
>>>>>>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>>>>>>> up having guava shaded twice via its shaded version on
>>>>>>>> core-construction-java.
>>>>>>>>
>>>>>>>> ps3. Of course this message means a de-facto +1 at least to do it
>>>>>>>> for
>>>>>>>> guava and evaluate it for other libs.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <
>>>>>>>> lcwik@google.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> An issue to call out is how to deal with our generated code (.avro
>>>>>>>>> and
>>>>>>>>> .proto) as I don't believe those plugins allow you to generate
>>>>>>>>> code using a
>>>>>>>>> shaded package prefix on imports.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <
>>>>>>>>> tgroh@google.com.invalid>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> +1 to the goal. I'm hugely in favor of not doing the same shading
>>>>>>>>>> work
>>>>>>>>>> every time for dependencies we know we'll use.
>>>>>>>>>>
>>>>>>>>>> This also means that if we end up pulling in transitive
>>>>>>>>>> dependencies we
>>>>>>>>>> don't want in any particular module we can avoid having to adjust
>>>>>>>>>> our
>>>>>>>>>> repackaging strategy for that module - which I have run into
>>>>>>>>>> face-first in
>>>>>>>>>> the past.
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <
>>>>>>>>>> klk@google.com.invalid>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Shading is a big part of how we keep our dependencies sane in
>>>>>>>>>>> Beam. But
>>>>>>>>>>> downsides: shading is super slow, causes massive jar bloat, and
>>>>>>>>>>> kind of
>>>>>>>>>>> hard to get right because artifacts and namespaces are not
>>>>>>>>>>> 1-to-1.
>>>>>>>>>>>
>>>>>>>>>>> I know that some communities distribute their own shaded
>>>>>>>>>>> distributions of
>>>>>>>>>>> dependencies. I had a thought about doing something similar that
>>>>>>>>>>> I wanted
>>>>>>>>>>> to throw out there for people to poke holes in.
>>>>>>>>>>>
>>>>>>>>>>> To set the scene, here is how I view shading:
>>>>>>>>>>>
>>>>>>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>>>>>>   - Public deps are used for data interchange; users must share
>>>>>>>>>>> these
>>>>>>>>>>>
>>>>>>>>>> deps.
>>>>>>>>>>
>>>>>>>>>>>   - Private deps are just functionality and can be hidden (in
>>>>>>>>>>> our case,
>>>>>>>>>>> relocated + bundled)
>>>>>>>>>>>   - It isn't necessarily that simple, because public and private
>>>>>>>>>>> deps
>>>>>>>>>>>
>>>>>>>>>> might
>>>>>>>>>>
>>>>>>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>>>>>>
>>>>>>>>>>> Shading is an implementation detail of expressing these
>>>>>>>>>>> characteristics.
>>>>>>>>>>>
>>>>>>>>>> We
>>>>>>>>>>
>>>>>>>>>>> use shading selectively because of its downsides I mentioned
>>>>>>>>>>> above.
>>>>>>>>>>>
>>>>>>>>>>> But what about this idea: Introduce shaded deps as a single
>>>>>>>>>>> separate
>>>>>>>>>>> artifact.
>>>>>>>>>>>
>>>>>>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated
>>>>>>>>>>> versions of
>>>>>>>>>>> everything we want to shade
>>>>>>>>>>>
>>>>>>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or
>>>>>>>>>>> bundling -
>>>>>>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this
>>>>>>>>>>> is what
>>>>>>>>>>> they
>>>>>>>>>>> are rewritten to
>>>>>>>>>>>
>>>>>>>>>>> Some benefits
>>>>>>>>>>>
>>>>>>>>>>>   - much faster builds of other modules
>>>>>>>>>>>   - only one shaded uber jar
>>>>>>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>>>>>>   - can use maven enforcer to forbid imports like
>>>>>>>>>>> com.google.common
>>>>>>>>>>>   - configuration all in one place
>>>>>>>>>>>   - no automated rewriting of our real code, which has led to
>>>>>>>>>>> some major
>>>>>>>>>>> confusion
>>>>>>>>>>>   - easy to implement incrementally
>>>>>>>>>>>
>>>>>>>>>>> Downsides:
>>>>>>>>>>>
>>>>>>>>>>>   - plenty of effort work to get there
>>>>>>>>>>>   - unclear how many different such deps modules we need;
>>>>>>>>>>> sharing them
>>>>>>>>>>>
>>>>>>>>>> could
>>>>>>>>>>
>>>>>>>>>>> get weird
>>>>>>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>>>>>>
>>>>>>>>>>> Just something I was musing as I spent another evening waiting
>>>>>>>>>>> for slow
>>>>>>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>>>>>>
>>>>>>>>>>> Kenn
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbonofre@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Romain Manni-Bucau <rm...@gmail.com>.
Why not dropping guava for all beam codebase? With java 8 it is quite easy
to do it and avoid a bunch of conflicts. Did it in 2 projects with quite a
good result.

Le 1 févr. 2018 06:50, "Lukasz Cwik" <lc...@google.com> a écrit :

> Make sure to include the guava version in the artifact name so that we can
> have multiple vendored versions.
>
> On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <kl...@google.com> wrote:
>
>> I didn't have time for this, but it just bit me. We definitely have Guava
>> on the API surface of runner support code in ways that get incompatibly
>> shaded. I will probably start "1a" by making a shaded library
>> org.apache.beam:vendored-guava and starting to use it. It sounds like there
>> is generally unanimous support for that much, anyhow.
>>
>> Kenn
>>
>> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> Thanks Ismaël for bringing up this discussion again!
>>>
>>> I would be in favour of 1) and more specifically of 1a)
>>>
>>> Aljoscha
>>>
>>>
>>> On 12. Dec 2017, at 18:56, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>> You can always run tests on post shaded artifacts instead of the
>>> compiled classes, it just requires us to change our maven surefire / gradle
>>> test configurations but it is true that most IDEs would behave better with
>>> a dependency jar unless you delegate all the build/test actions to the
>>> build system and then it won't matter.
>>>
>>> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> There's also, with additional overhead,
>>>>
>>>> 1a) A relocated and shipped package for each thing we want to relocate.
>>>> I think this has also been tried outside Beam...
>>>>
>>>> Pros:
>>>> * all the pros of 1) plus no bloat beyond what is necessary
>>>> Cons:
>>>> * abandons whitelist approach for public deps, reverting to blacklist
>>>> approach for trouble things like guava, so a bit less principled
>>>>
>>>> For both 1) and 1a) I would add:
>>>>
>>>> Pros:
>>>> * clearly readable dependency since code will `import
>>>> org.apache.beam.private.guava21` and IDEs will understand it is a
>>>> distinct lilbrary
>>>> * can run tests on unpackaged classes, as long as deps are shaded or
>>>> provided as jars
>>>> * no mysterious action at a distance from inherited configuration
>>>> Cons:
>>>> * need to adjust imports
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> I would suggest that either we use:
>>>>> 1) A common deps package containing shaded dependencies allows for
>>>>> Pros
>>>>> * doesn't require the user to build an uber jar
>>>>> Risks
>>>>> * dependencies package will keep growing even if something is or isn't
>>>>> needed by all of Apache Beam leading to a large jar anyways negating any
>>>>> space savings
>>>>>
>>>>> 2) Shade within each module to a common location like
>>>>> org.apache.beam.relocated.guava....
>>>>> Pros
>>>>> * you only get the shaded dependencies of the things that are required
>>>>> * its one less dependency for users to manage
>>>>> Risks
>>>>> * requires an uber jar to be built to get the space savings (either by
>>>>> a user or a distribution of Apache Beam) otherwise we negate any space
>>>>> savings.
>>>>>
>>>>> If we either use a common relocation scheme or a dependencies jar then
>>>>> each relocation should specifically contain the version number of the
>>>>> package because we would like to allow for us to be using two different
>>>>> versions of the same library.
>>>>>
>>>>> For the common deps package approach, should we check in code where
>>>>> the imports contain the relocated location (e.g. import
>>>>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>>>>
>>>>>
>>>>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>>>>> > wrote:
>>>>>
>>>>>> Thanks for bringing that back.
>>>>>>
>>>>>> Indeed guava is shaded in different uber-jar. Maybe we can have a
>>>>>> common deps module that we include once (but the user will have to
>>>>>> explicitly define the dep) ?
>>>>>>
>>>>>> Basically, what do you propose for protobuf (unfortunately, I don't
>>>>>> see an obvious) ?
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>>>>>
>>>>>>> Hello, I wanted to bring back this subject because I think we should
>>>>>>> take action on this and at least first have a shaded version of
>>>>>>> guava.
>>>>>>> I was playing with a toy project and I did the procedure we use to
>>>>>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>>>>>> creating an uber jar and I realized that the size of the jar was way
>>>>>>> bigger than I expected, and the fact that we shade guava in every
>>>>>>> module contributes to this. I found guava shaded on:
>>>>>>>
>>>>>>> sdks/java/core
>>>>>>> runners/core-construction-java
>>>>>>> runners/core-java
>>>>>>> model/job-management
>>>>>>> runners/spark
>>>>>>> sdks/java/io/hadoop-file-system
>>>>>>> sdks/java/io/kafka
>>>>>>>
>>>>>>> This means at least 6 times more of the size it should which counts
>>>>>>> in
>>>>>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>>>>>> reduce by using a shaded version of the library.
>>>>>>>
>>>>>>> I add this point to the previous ones mentioned during the discussion
>>>>>>> because this goes against the end user experience and affects us all
>>>>>>> (devs/users).
>>>>>>>
>>>>>>> Another question is if we should shade (and how) protocol buffers
>>>>>>> because now with the portability work we are exposing it closer to
>>>>>>> the
>>>>>>> end users. I say this because I also found an issue while running a
>>>>>>> job on YARN with the spark runner because hadoop-common includes
>>>>>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>>>>>>> a dependency to be able to use triggers (note the Spark runner
>>>>>>> translates them using some method from runners/core-java). Since
>>>>>>> hadoop-common is provided in the cluster with the older version of
>>>>>>> protobuf, I am afraid that this will bite us in the future.
>>>>>>>
>>>>>>> Ismaël
>>>>>>>
>>>>>>> ps. There is already a JIRA for that shading for protobuf on
>>>>>>> hadoop-common but this is not coming until version 3 is out.
>>>>>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>>>
>>>>>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>>>>>> up having guava shaded twice via its shaded version on
>>>>>>> core-construction-java.
>>>>>>>
>>>>>>> ps3. Of course this message means a de-facto +1 at least to do it for
>>>>>>> guava and evaluate it for other libs.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <
>>>>>>> lcwik@google.com.invalid> wrote:
>>>>>>>
>>>>>>>> An issue to call out is how to deal with our generated code (.avro
>>>>>>>> and
>>>>>>>> .proto) as I don't believe those plugins allow you to generate code
>>>>>>>> using a
>>>>>>>> shaded package prefix on imports.
>>>>>>>>
>>>>>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <
>>>>>>>> tgroh@google.com.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1 to the goal. I'm hugely in favor of not doing the same shading
>>>>>>>>> work
>>>>>>>>> every time for dependencies we know we'll use.
>>>>>>>>>
>>>>>>>>> This also means that if we end up pulling in transitive
>>>>>>>>> dependencies we
>>>>>>>>> don't want in any particular module we can avoid having to adjust
>>>>>>>>> our
>>>>>>>>> repackaging strategy for that module - which I have run into
>>>>>>>>> face-first in
>>>>>>>>> the past.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <
>>>>>>>>> klk@google.com.invalid>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Shading is a big part of how we keep our dependencies sane in
>>>>>>>>>> Beam. But
>>>>>>>>>> downsides: shading is super slow, causes massive jar bloat, and
>>>>>>>>>> kind of
>>>>>>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>>>>>>
>>>>>>>>>> I know that some communities distribute their own shaded
>>>>>>>>>> distributions of
>>>>>>>>>> dependencies. I had a thought about doing something similar that
>>>>>>>>>> I wanted
>>>>>>>>>> to throw out there for people to poke holes in.
>>>>>>>>>>
>>>>>>>>>> To set the scene, here is how I view shading:
>>>>>>>>>>
>>>>>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>>>>>   - Public deps are used for data interchange; users must share
>>>>>>>>>> these
>>>>>>>>>>
>>>>>>>>> deps.
>>>>>>>>>
>>>>>>>>>>   - Private deps are just functionality and can be hidden (in our
>>>>>>>>>> case,
>>>>>>>>>> relocated + bundled)
>>>>>>>>>>   - It isn't necessarily that simple, because public and private
>>>>>>>>>> deps
>>>>>>>>>>
>>>>>>>>> might
>>>>>>>>>
>>>>>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>>>>>
>>>>>>>>>> Shading is an implementation detail of expressing these
>>>>>>>>>> characteristics.
>>>>>>>>>>
>>>>>>>>> We
>>>>>>>>>
>>>>>>>>>> use shading selectively because of its downsides I mentioned
>>>>>>>>>> above.
>>>>>>>>>>
>>>>>>>>>> But what about this idea: Introduce shaded deps as a single
>>>>>>>>>> separate
>>>>>>>>>> artifact.
>>>>>>>>>>
>>>>>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated
>>>>>>>>>> versions of
>>>>>>>>>> everything we want to shade
>>>>>>>>>>
>>>>>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or
>>>>>>>>>> bundling -
>>>>>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this
>>>>>>>>>> is what
>>>>>>>>>> they
>>>>>>>>>> are rewritten to
>>>>>>>>>>
>>>>>>>>>> Some benefits
>>>>>>>>>>
>>>>>>>>>>   - much faster builds of other modules
>>>>>>>>>>   - only one shaded uber jar
>>>>>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>>>>>   - can use maven enforcer to forbid imports like
>>>>>>>>>> com.google.common
>>>>>>>>>>   - configuration all in one place
>>>>>>>>>>   - no automated rewriting of our real code, which has led to
>>>>>>>>>> some major
>>>>>>>>>> confusion
>>>>>>>>>>   - easy to implement incrementally
>>>>>>>>>>
>>>>>>>>>> Downsides:
>>>>>>>>>>
>>>>>>>>>>   - plenty of effort work to get there
>>>>>>>>>>   - unclear how many different such deps modules we need; sharing
>>>>>>>>>> them
>>>>>>>>>>
>>>>>>>>> could
>>>>>>>>>
>>>>>>>>>> get weird
>>>>>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>>>>>
>>>>>>>>>> Just something I was musing as I spent another evening waiting
>>>>>>>>>> for slow
>>>>>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>> --
>>>>>> Jean-Baptiste Onofré
>>>>>> jbonofre@apache.org
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Lukasz Cwik <lc...@google.com>.
Make sure to include the guava version in the artifact name so that we can
have multiple vendored versions.

On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <kl...@google.com> wrote:

> I didn't have time for this, but it just bit me. We definitely have Guava
> on the API surface of runner support code in ways that get incompatibly
> shaded. I will probably start "1a" by making a shaded library
> org.apache.beam:vendored-guava and starting to use it. It sounds like there
> is generally unanimous support for that much, anyhow.
>
> Kenn
>
> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Thanks Ismaël for bringing up this discussion again!
>>
>> I would be in favour of 1) and more specifically of 1a)
>>
>> Aljoscha
>>
>>
>> On 12. Dec 2017, at 18:56, Lukasz Cwik <lc...@google.com> wrote:
>>
>> You can always run tests on post shaded artifacts instead of the compiled
>> classes, it just requires us to change our maven surefire / gradle test
>> configurations but it is true that most IDEs would behave better with a
>> dependency jar unless you delegate all the build/test actions to the build
>> system and then it won't matter.
>>
>> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <kl...@google.com> wrote:
>>
>>> There's also, with additional overhead,
>>>
>>> 1a) A relocated and shipped package for each thing we want to relocate.
>>> I think this has also been tried outside Beam...
>>>
>>> Pros:
>>> * all the pros of 1) plus no bloat beyond what is necessary
>>> Cons:
>>> * abandons whitelist approach for public deps, reverting to blacklist
>>> approach for trouble things like guava, so a bit less principled
>>>
>>> For both 1) and 1a) I would add:
>>>
>>> Pros:
>>> * clearly readable dependency since code will `import
>>> org.apache.beam.private.guava21` and IDEs will understand it is a
>>> distinct lilbrary
>>> * can run tests on unpackaged classes, as long as deps are shaded or
>>> provided as jars
>>> * no mysterious action at a distance from inherited configuration
>>> Cons:
>>> * need to adjust imports
>>>
>>> Kenn
>>>
>>> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> I would suggest that either we use:
>>>> 1) A common deps package containing shaded dependencies allows for
>>>> Pros
>>>> * doesn't require the user to build an uber jar
>>>> Risks
>>>> * dependencies package will keep growing even if something is or isn't
>>>> needed by all of Apache Beam leading to a large jar anyways negating any
>>>> space savings
>>>>
>>>> 2) Shade within each module to a common location like
>>>> org.apache.beam.relocated.guava....
>>>> Pros
>>>> * you only get the shaded dependencies of the things that are required
>>>> * its one less dependency for users to manage
>>>> Risks
>>>> * requires an uber jar to be built to get the space savings (either by
>>>> a user or a distribution of Apache Beam) otherwise we negate any space
>>>> savings.
>>>>
>>>> If we either use a common relocation scheme or a dependencies jar then
>>>> each relocation should specifically contain the version number of the
>>>> package because we would like to allow for us to be using two different
>>>> versions of the same library.
>>>>
>>>> For the common deps package approach, should we check in code where the
>>>> imports contain the relocated location (e.g. import
>>>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>>>
>>>>
>>>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Thanks for bringing that back.
>>>>>
>>>>> Indeed guava is shaded in different uber-jar. Maybe we can have a
>>>>> common deps module that we include once (but the user will have to
>>>>> explicitly define the dep) ?
>>>>>
>>>>> Basically, what do you propose for protobuf (unfortunately, I don't
>>>>> see an obvious) ?
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>>>>
>>>>>> Hello, I wanted to bring back this subject because I think we should
>>>>>> take action on this and at least first have a shaded version of guava.
>>>>>> I was playing with a toy project and I did the procedure we use to
>>>>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>>>>> creating an uber jar and I realized that the size of the jar was way
>>>>>> bigger than I expected, and the fact that we shade guava in every
>>>>>> module contributes to this. I found guava shaded on:
>>>>>>
>>>>>> sdks/java/core
>>>>>> runners/core-construction-java
>>>>>> runners/core-java
>>>>>> model/job-management
>>>>>> runners/spark
>>>>>> sdks/java/io/hadoop-file-system
>>>>>> sdks/java/io/kafka
>>>>>>
>>>>>> This means at least 6 times more of the size it should which counts in
>>>>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>>>>> reduce by using a shaded version of the library.
>>>>>>
>>>>>> I add this point to the previous ones mentioned during the discussion
>>>>>> because this goes against the end user experience and affects us all
>>>>>> (devs/users).
>>>>>>
>>>>>> Another question is if we should shade (and how) protocol buffers
>>>>>> because now with the portability work we are exposing it closer to the
>>>>>> end users. I say this because I also found an issue while running a
>>>>>> job on YARN with the spark runner because hadoop-common includes
>>>>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>>>>>> a dependency to be able to use triggers (note the Spark runner
>>>>>> translates them using some method from runners/core-java). Since
>>>>>> hadoop-common is provided in the cluster with the older version of
>>>>>> protobuf, I am afraid that this will bite us in the future.
>>>>>>
>>>>>> Ismaël
>>>>>>
>>>>>> ps. There is already a JIRA for that shading for protobuf on
>>>>>> hadoop-common but this is not coming until version 3 is out.
>>>>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>>
>>>>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>>>>> up having guava shaded twice via its shaded version on
>>>>>> core-construction-java.
>>>>>>
>>>>>> ps3. Of course this message means a de-facto +1 at least to do it for
>>>>>> guava and evaluate it for other libs.
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <
>>>>>> lcwik@google.com.invalid> wrote:
>>>>>>
>>>>>>> An issue to call out is how to deal with our generated code (.avro
>>>>>>> and
>>>>>>> .proto) as I don't believe those plugins allow you to generate code
>>>>>>> using a
>>>>>>> shaded package prefix on imports.
>>>>>>>
>>>>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <
>>>>>>> tgroh@google.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>> +1 to the goal. I'm hugely in favor of not doing the same shading
>>>>>>>> work
>>>>>>>> every time for dependencies we know we'll use.
>>>>>>>>
>>>>>>>> This also means that if we end up pulling in transitive
>>>>>>>> dependencies we
>>>>>>>> don't want in any particular module we can avoid having to adjust
>>>>>>>> our
>>>>>>>> repackaging strategy for that module - which I have run into
>>>>>>>> face-first in
>>>>>>>> the past.
>>>>>>>>
>>>>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <
>>>>>>>> klk@google.com.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Shading is a big part of how we keep our dependencies sane in
>>>>>>>>> Beam. But
>>>>>>>>> downsides: shading is super slow, causes massive jar bloat, and
>>>>>>>>> kind of
>>>>>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>>>>>
>>>>>>>>> I know that some communities distribute their own shaded
>>>>>>>>> distributions of
>>>>>>>>> dependencies. I had a thought about doing something similar that I
>>>>>>>>> wanted
>>>>>>>>> to throw out there for people to poke holes in.
>>>>>>>>>
>>>>>>>>> To set the scene, here is how I view shading:
>>>>>>>>>
>>>>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>>>>   - Public deps are used for data interchange; users must share
>>>>>>>>> these
>>>>>>>>>
>>>>>>>> deps.
>>>>>>>>
>>>>>>>>>   - Private deps are just functionality and can be hidden (in our
>>>>>>>>> case,
>>>>>>>>> relocated + bundled)
>>>>>>>>>   - It isn't necessarily that simple, because public and private
>>>>>>>>> deps
>>>>>>>>>
>>>>>>>> might
>>>>>>>>
>>>>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>>>>
>>>>>>>>> Shading is an implementation detail of expressing these
>>>>>>>>> characteristics.
>>>>>>>>>
>>>>>>>> We
>>>>>>>>
>>>>>>>>> use shading selectively because of its downsides I mentioned above.
>>>>>>>>>
>>>>>>>>> But what about this idea: Introduce shaded deps as a single
>>>>>>>>> separate
>>>>>>>>> artifact.
>>>>>>>>>
>>>>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated
>>>>>>>>> versions of
>>>>>>>>> everything we want to shade
>>>>>>>>>
>>>>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or
>>>>>>>>> bundling -
>>>>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is
>>>>>>>>> what
>>>>>>>>> they
>>>>>>>>> are rewritten to
>>>>>>>>>
>>>>>>>>> Some benefits
>>>>>>>>>
>>>>>>>>>   - much faster builds of other modules
>>>>>>>>>   - only one shaded uber jar
>>>>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>>>>>>   - configuration all in one place
>>>>>>>>>   - no automated rewriting of our real code, which has led to some
>>>>>>>>> major
>>>>>>>>> confusion
>>>>>>>>>   - easy to implement incrementally
>>>>>>>>>
>>>>>>>>> Downsides:
>>>>>>>>>
>>>>>>>>>   - plenty of effort work to get there
>>>>>>>>>   - unclear how many different such deps modules we need; sharing
>>>>>>>>> them
>>>>>>>>>
>>>>>>>> could
>>>>>>>>
>>>>>>>>> get weird
>>>>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>>>>
>>>>>>>>> Just something I was musing as I spent another evening waiting for
>>>>>>>>> slow
>>>>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Kenneth Knowles <kl...@google.com>.
I didn't have time for this, but it just bit me. We definitely have Guava
on the API surface of runner support code in ways that get incompatibly
shaded. I will probably start "1a" by making a shaded library
org.apache.beam:vendored-guava and starting to use it. It sounds like there
is generally unanimous support for that much, anyhow.

Kenn

On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> Thanks Ismaël for bringing up this discussion again!
>
> I would be in favour of 1) and more specifically of 1a)
>
> Aljoscha
>
>
> On 12. Dec 2017, at 18:56, Lukasz Cwik <lc...@google.com> wrote:
>
> You can always run tests on post shaded artifacts instead of the compiled
> classes, it just requires us to change our maven surefire / gradle test
> configurations but it is true that most IDEs would behave better with a
> dependency jar unless you delegate all the build/test actions to the build
> system and then it won't matter.
>
> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <kl...@google.com> wrote:
>
>> There's also, with additional overhead,
>>
>> 1a) A relocated and shipped package for each thing we want to relocate. I
>> think this has also been tried outside Beam...
>>
>> Pros:
>> * all the pros of 1) plus no bloat beyond what is necessary
>> Cons:
>> * abandons whitelist approach for public deps, reverting to blacklist
>> approach for trouble things like guava, so a bit less principled
>>
>> For both 1) and 1a) I would add:
>>
>> Pros:
>> * clearly readable dependency since code will `import
>> org.apache.beam.private.guava21` and IDEs will understand it is a
>> distinct lilbrary
>> * can run tests on unpackaged classes, as long as deps are shaded or
>> provided as jars
>> * no mysterious action at a distance from inherited configuration
>> Cons:
>> * need to adjust imports
>>
>> Kenn
>>
>> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lc...@google.com> wrote:
>>
>>> I would suggest that either we use:
>>> 1) A common deps package containing shaded dependencies allows for
>>> Pros
>>> * doesn't require the user to build an uber jar
>>> Risks
>>> * dependencies package will keep growing even if something is or isn't
>>> needed by all of Apache Beam leading to a large jar anyways negating any
>>> space savings
>>>
>>> 2) Shade within each module to a common location like
>>> org.apache.beam.relocated.guava....
>>> Pros
>>> * you only get the shaded dependencies of the things that are required
>>> * its one less dependency for users to manage
>>> Risks
>>> * requires an uber jar to be built to get the space savings (either by a
>>> user or a distribution of Apache Beam) otherwise we negate any space
>>> savings.
>>>
>>> If we either use a common relocation scheme or a dependencies jar then
>>> each relocation should specifically contain the version number of the
>>> package because we would like to allow for us to be using two different
>>> versions of the same library.
>>>
>>> For the common deps package approach, should we check in code where the
>>> imports contain the relocated location (e.g. import
>>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>>
>>>
>>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>>
>>>> Thanks for bringing that back.
>>>>
>>>> Indeed guava is shaded in different uber-jar. Maybe we can have a
>>>> common deps module that we include once (but the user will have to
>>>> explicitly define the dep) ?
>>>>
>>>> Basically, what do you propose for protobuf (unfortunately, I don't see
>>>> an obvious) ?
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>>>
>>>>> Hello, I wanted to bring back this subject because I think we should
>>>>> take action on this and at least first have a shaded version of guava.
>>>>> I was playing with a toy project and I did the procedure we use to
>>>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>>>> creating an uber jar and I realized that the size of the jar was way
>>>>> bigger than I expected, and the fact that we shade guava in every
>>>>> module contributes to this. I found guava shaded on:
>>>>>
>>>>> sdks/java/core
>>>>> runners/core-construction-java
>>>>> runners/core-java
>>>>> model/job-management
>>>>> runners/spark
>>>>> sdks/java/io/hadoop-file-system
>>>>> sdks/java/io/kafka
>>>>>
>>>>> This means at least 6 times more of the size it should which counts in
>>>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>>>> reduce by using a shaded version of the library.
>>>>>
>>>>> I add this point to the previous ones mentioned during the discussion
>>>>> because this goes against the end user experience and affects us all
>>>>> (devs/users).
>>>>>
>>>>> Another question is if we should shade (and how) protocol buffers
>>>>> because now with the portability work we are exposing it closer to the
>>>>> end users. I say this because I also found an issue while running a
>>>>> job on YARN with the spark runner because hadoop-common includes
>>>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>>>>> a dependency to be able to use triggers (note the Spark runner
>>>>> translates them using some method from runners/core-java). Since
>>>>> hadoop-common is provided in the cluster with the older version of
>>>>> protobuf, I am afraid that this will bite us in the future.
>>>>>
>>>>> Ismaël
>>>>>
>>>>> ps. There is already a JIRA for that shading for protobuf on
>>>>> hadoop-common but this is not coming until version 3 is out.
>>>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>>
>>>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>>>> up having guava shaded twice via its shaded version on
>>>>> core-construction-java.
>>>>>
>>>>> ps3. Of course this message means a de-facto +1 at least to do it for
>>>>> guava and evaluate it for other libs.
>>>>>
>>>>>
>>>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> An issue to call out is how to deal with our generated code (.avro and
>>>>>> .proto) as I don't believe those plugins allow you to generate code
>>>>>> using a
>>>>>> shaded package prefix on imports.
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <
>>>>>> tgroh@google.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>>>>>>> every time for dependencies we know we'll use.
>>>>>>>
>>>>>>> This also means that if we end up pulling in transitive dependencies
>>>>>>> we
>>>>>>> don't want in any particular module we can avoid having to adjust our
>>>>>>> repackaging strategy for that module - which I have run into
>>>>>>> face-first in
>>>>>>> the past.
>>>>>>>
>>>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <
>>>>>>> klk@google.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Shading is a big part of how we keep our dependencies sane in Beam.
>>>>>>>> But
>>>>>>>> downsides: shading is super slow, causes massive jar bloat, and
>>>>>>>> kind of
>>>>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>>>>
>>>>>>>> I know that some communities distribute their own shaded
>>>>>>>> distributions of
>>>>>>>> dependencies. I had a thought about doing something similar that I
>>>>>>>> wanted
>>>>>>>> to throw out there for people to poke holes in.
>>>>>>>>
>>>>>>>> To set the scene, here is how I view shading:
>>>>>>>>
>>>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>>>   - Public deps are used for data interchange; users must share
>>>>>>>> these
>>>>>>>>
>>>>>>> deps.
>>>>>>>
>>>>>>>>   - Private deps are just functionality and can be hidden (in our
>>>>>>>> case,
>>>>>>>> relocated + bundled)
>>>>>>>>   - It isn't necessarily that simple, because public and private
>>>>>>>> deps
>>>>>>>>
>>>>>>> might
>>>>>>>
>>>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>>>
>>>>>>>> Shading is an implementation detail of expressing these
>>>>>>>> characteristics.
>>>>>>>>
>>>>>>> We
>>>>>>>
>>>>>>>> use shading selectively because of its downsides I mentioned above.
>>>>>>>>
>>>>>>>> But what about this idea: Introduce shaded deps as a single separate
>>>>>>>> artifact.
>>>>>>>>
>>>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated
>>>>>>>> versions of
>>>>>>>> everything we want to shade
>>>>>>>>
>>>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or bundling
>>>>>>>> -
>>>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is
>>>>>>>> what
>>>>>>>> they
>>>>>>>> are rewritten to
>>>>>>>>
>>>>>>>> Some benefits
>>>>>>>>
>>>>>>>>   - much faster builds of other modules
>>>>>>>>   - only one shaded uber jar
>>>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>>>>>   - configuration all in one place
>>>>>>>>   - no automated rewriting of our real code, which has led to some
>>>>>>>> major
>>>>>>>> confusion
>>>>>>>>   - easy to implement incrementally
>>>>>>>>
>>>>>>>> Downsides:
>>>>>>>>
>>>>>>>>   - plenty of effort work to get there
>>>>>>>>   - unclear how many different such deps modules we need; sharing
>>>>>>>> them
>>>>>>>>
>>>>>>> could
>>>>>>>
>>>>>>>> get weird
>>>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>>>
>>>>>>>> Just something I was musing as I spent another evening waiting for
>>>>>>>> slow
>>>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>
>>>
>>
>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Aljoscha Krettek <al...@apache.org>.
Thanks Ismaël for bringing up this discussion again!

I would be in favour of 1) and more specifically of 1a)

Aljoscha

> On 12. Dec 2017, at 18:56, Lukasz Cwik <lc...@google.com> wrote:
> 
> You can always run tests on post shaded artifacts instead of the compiled classes, it just requires us to change our maven surefire / gradle test configurations but it is true that most IDEs would behave better with a dependency jar unless you delegate all the build/test actions to the build system and then it won't matter.
> 
> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <klk@google.com <ma...@google.com>> wrote:
> There's also, with additional overhead,
> 
> 1a) A relocated and shipped package for each thing we want to relocate. I think this has also been tried outside Beam...
> 
> Pros:
> * all the pros of 1) plus no bloat beyond what is necessary
> Cons:
> * abandons whitelist approach for public deps, reverting to blacklist approach for trouble things like guava, so a bit less principled
> 
> For both 1) and 1a) I would add:
> 
> Pros:
> * clearly readable dependency since code will `import org.apache.beam.private.guava21` and IDEs will understand it is a distinct lilbrary
> * can run tests on unpackaged classes, as long as deps are shaded or provided as jars
> * no mysterious action at a distance from inherited configuration
> Cons:
> * need to adjust imports
> 
> Kenn
> 
> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lcwik@google.com <ma...@google.com>> wrote:
> I would suggest that either we use:
> 1) A common deps package containing shaded dependencies allows for 
> Pros
> * doesn't require the user to build an uber jar
> Risks
> * dependencies package will keep growing even if something is or isn't needed by all of Apache Beam leading to a large jar anyways negating any space savings
> 
> 2) Shade within each module to a common location like org.apache.beam.relocated.guava....
> Pros
> * you only get the shaded dependencies of the things that are required
> * its one less dependency for users to manage
> Risks
> * requires an uber jar to be built to get the space savings (either by a user or a distribution of Apache Beam) otherwise we negate any space savings.
> 
> If we either use a common relocation scheme or a dependencies jar then each relocation should specifically contain the version number of the package because we would like to allow for us to be using two different versions of the same library.
> 
> For the common deps package approach, should we check in code where the imports contain the relocated location (e.g. import org.apache.beam.guava.20.0.com <http://org.apache.beam.guava.20.0.com/>.google.common.collect.ImmutableList)?
> 
> 
> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <jb@nanthrax.net <ma...@nanthrax.net>> wrote:
> Thanks for bringing that back.
> 
> Indeed guava is shaded in different uber-jar. Maybe we can have a common deps module that we include once (but the user will have to explicitly define the dep) ?
> 
> Basically, what do you propose for protobuf (unfortunately, I don't see an obvious) ?
> 
> Regards
> JB
> 
> 
> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
> Hello, I wanted to bring back this subject because I think we should
> take action on this and at least first have a shaded version of guava.
> I was playing with a toy project and I did the procedure we use to
> submit jars to a Hadoop cluster via Flink/Spark which involves
> creating an uber jar and I realized that the size of the jar was way
> bigger than I expected, and the fact that we shade guava in every
> module contributes to this. I found guava shaded on:
> 
> sdks/java/core
> runners/core-construction-java
> runners/core-java
> model/job-management
> runners/spark
> sdks/java/io/hadoop-file-system
> sdks/java/io/kafka
> 
> This means at least 6 times more of the size it should which counts in
> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
> reduce by using a shaded version of the library.
> 
> I add this point to the previous ones mentioned during the discussion
> because this goes against the end user experience and affects us all
> (devs/users).
> 
> Another question is if we should shade (and how) protocol buffers
> because now with the portability work we are exposing it closer to the
> end users. I say this because I also found an issue while running a
> job on YARN with the spark runner because hadoop-common includes
> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
> a dependency to be able to use triggers (note the Spark runner
> translates them using some method from runners/core-java). Since
> hadoop-common is provided in the cluster with the older version of
> protobuf, I am afraid that this will bite us in the future.
> 
> Ismaël
> 
> ps. There is already a JIRA for that shading for protobuf on
> hadoop-common but this is not coming until version 3 is out.
> https://issues.apache.org/jira/browse/HADOOP-13136 <https://issues.apache.org/jira/browse/HADOOP-13136>
> 
> ps2. Extra curious situation is to see that the dataflow-runner ends
> up having guava shaded twice via its shaded version on
> core-construction-java.
> 
> ps3. Of course this message means a de-facto +1 at least to do it for
> guava and evaluate it for other libs.
> 
> 
> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid> wrote:
> An issue to call out is how to deal with our generated code (.avro and
> .proto) as I don't believe those plugins allow you to generate code using a
> shaded package prefix on imports.
> 
> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tg...@google.com.invalid>
> wrote:
> 
> +1 to the goal. I'm hugely in favor of not doing the same shading work
> every time for dependencies we know we'll use.
> 
> This also means that if we end up pulling in transitive dependencies we
> don't want in any particular module we can avoid having to adjust our
> repackaging strategy for that module - which I have run into face-first in
> the past.
> 
> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
> 
> Hi all,
> 
> Shading is a big part of how we keep our dependencies sane in Beam. But
> downsides: shading is super slow, causes massive jar bloat, and kind of
> hard to get right because artifacts and namespaces are not 1-to-1.
> 
> I know that some communities distribute their own shaded distributions of
> dependencies. I had a thought about doing something similar that I wanted
> to throw out there for people to poke holes in.
> 
> To set the scene, here is how I view shading:
> 
>   - A module has public dependencies and private dependencies.
>   - Public deps are used for data interchange; users must share these
> deps.
>   - Private deps are just functionality and can be hidden (in our case,
> relocated + bundled)
>   - It isn't necessarily that simple, because public and private deps
> might
> interact in higher-order ways ("public" is contagious)
> 
> Shading is an implementation detail of expressing these characteristics.
> We
> use shading selectively because of its downsides I mentioned above.
> 
> But what about this idea: Introduce shaded deps as a single separate
> artifact.
> 
>   - sdks/java/private-deps: bundled uber jar with relocated versions of
> everything we want to shade
> 
>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
> depends on `beam-sdks-java-private-deps` and imports like
> `org.apache.beam.sdk.private.com.google.common` directly (this is what
> they
> are rewritten to
> 
> Some benefits
> 
>   - much faster builds of other modules
>   - only one shaded uber jar
>   - rare/no rebuilds of the uber jar
>   - can use maven enforcer to forbid imports like com.google.common
>   - configuration all in one place
>   - no automated rewriting of our real code, which has led to some major
> confusion
>   - easy to implement incrementally
> 
> Downsides:
> 
>   - plenty of effort work to get there
>   - unclear how many different such deps modules we need; sharing them
> could
> get weird
>   - if we hit a roadblock, we will have committed a lot of time
> 
> Just something I was musing as I spent another evening waiting for slow
> builds to try to confirm changes to brittle poms.
> 
> Kenn
> 
> 
> 
> -- 
> Jean-Baptiste Onofré
> jbonofre@apache.org <ma...@apache.org>
> http://blog.nanthrax.net <http://blog.nanthrax.net/>
> Talend - http://www.talend.com <http://www.talend.com/>
> 
> 
> 


Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Lukasz Cwik <lc...@google.com>.
You can always run tests on post shaded artifacts instead of the compiled
classes, it just requires us to change our maven surefire / gradle test
configurations but it is true that most IDEs would behave better with a
dependency jar unless you delegate all the build/test actions to the build
system and then it won't matter.

On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <kl...@google.com> wrote:

> There's also, with additional overhead,
>
> 1a) A relocated and shipped package for each thing we want to relocate. I
> think this has also been tried outside Beam...
>
> Pros:
> * all the pros of 1) plus no bloat beyond what is necessary
> Cons:
> * abandons whitelist approach for public deps, reverting to blacklist
> approach for trouble things like guava, so a bit less principled
>
> For both 1) and 1a) I would add:
>
> Pros:
> * clearly readable dependency since code will `import
> org.apache.beam.private.guava21` and IDEs will understand it is a
> distinct lilbrary
> * can run tests on unpackaged classes, as long as deps are shaded or
> provided as jars
> * no mysterious action at a distance from inherited configuration
> Cons:
> * need to adjust imports
>
> Kenn
>
> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lc...@google.com> wrote:
>
>> I would suggest that either we use:
>> 1) A common deps package containing shaded dependencies allows for
>> Pros
>> * doesn't require the user to build an uber jar
>> Risks
>> * dependencies package will keep growing even if something is or isn't
>> needed by all of Apache Beam leading to a large jar anyways negating any
>> space savings
>>
>> 2) Shade within each module to a common location like
>> org.apache.beam.relocated.guava....
>> Pros
>> * you only get the shaded dependencies of the things that are required
>> * its one less dependency for users to manage
>> Risks
>> * requires an uber jar to be built to get the space savings (either by a
>> user or a distribution of Apache Beam) otherwise we negate any space
>> savings.
>>
>> If we either use a common relocation scheme or a dependencies jar then
>> each relocation should specifically contain the version number of the
>> package because we would like to allow for us to be using two different
>> versions of the same library.
>>
>> For the common deps package approach, should we check in code where the
>> imports contain the relocated location (e.g. import
>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>
>>
>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>> Thanks for bringing that back.
>>>
>>> Indeed guava is shaded in different uber-jar. Maybe we can have a common
>>> deps module that we include once (but the user will have to explicitly
>>> define the dep) ?
>>>
>>> Basically, what do you propose for protobuf (unfortunately, I don't see
>>> an obvious) ?
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>>
>>>> Hello, I wanted to bring back this subject because I think we should
>>>> take action on this and at least first have a shaded version of guava.
>>>> I was playing with a toy project and I did the procedure we use to
>>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>>> creating an uber jar and I realized that the size of the jar was way
>>>> bigger than I expected, and the fact that we shade guava in every
>>>> module contributes to this. I found guava shaded on:
>>>>
>>>> sdks/java/core
>>>> runners/core-construction-java
>>>> runners/core-java
>>>> model/job-management
>>>> runners/spark
>>>> sdks/java/io/hadoop-file-system
>>>> sdks/java/io/kafka
>>>>
>>>> This means at least 6 times more of the size it should which counts in
>>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>>> reduce by using a shaded version of the library.
>>>>
>>>> I add this point to the previous ones mentioned during the discussion
>>>> because this goes against the end user experience and affects us all
>>>> (devs/users).
>>>>
>>>> Another question is if we should shade (and how) protocol buffers
>>>> because now with the portability work we are exposing it closer to the
>>>> end users. I say this because I also found an issue while running a
>>>> job on YARN with the spark runner because hadoop-common includes
>>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>>>> a dependency to be able to use triggers (note the Spark runner
>>>> translates them using some method from runners/core-java). Since
>>>> hadoop-common is provided in the cluster with the older version of
>>>> protobuf, I am afraid that this will bite us in the future.
>>>>
>>>> Ismaël
>>>>
>>>> ps. There is already a JIRA for that shading for protobuf on
>>>> hadoop-common but this is not coming until version 3 is out.
>>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>>
>>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>>> up having guava shaded twice via its shaded version on
>>>> core-construction-java.
>>>>
>>>> ps3. Of course this message means a de-facto +1 at least to do it for
>>>> guava and evaluate it for other libs.
>>>>
>>>>
>>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid>
>>>> wrote:
>>>>
>>>>> An issue to call out is how to deal with our generated code (.avro and
>>>>> .proto) as I don't believe those plugins allow you to generate code
>>>>> using a
>>>>> shaded package prefix on imports.
>>>>>
>>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tgroh@google.com.invalid
>>>>> >
>>>>> wrote:
>>>>>
>>>>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>>>>>> every time for dependencies we know we'll use.
>>>>>>
>>>>>> This also means that if we end up pulling in transitive dependencies
>>>>>> we
>>>>>> don't want in any particular module we can avoid having to adjust our
>>>>>> repackaging strategy for that module - which I have run into
>>>>>> face-first in
>>>>>> the past.
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles
>>>>>> <kl...@google.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>>
>>>>>>> Shading is a big part of how we keep our dependencies sane in Beam.
>>>>>>> But
>>>>>>> downsides: shading is super slow, causes massive jar bloat, and kind
>>>>>>> of
>>>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>>>
>>>>>>> I know that some communities distribute their own shaded
>>>>>>> distributions of
>>>>>>> dependencies. I had a thought about doing something similar that I
>>>>>>> wanted
>>>>>>> to throw out there for people to poke holes in.
>>>>>>>
>>>>>>> To set the scene, here is how I view shading:
>>>>>>>
>>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>>   - Public deps are used for data interchange; users must share these
>>>>>>>
>>>>>> deps.
>>>>>>
>>>>>>>   - Private deps are just functionality and can be hidden (in our
>>>>>>> case,
>>>>>>> relocated + bundled)
>>>>>>>   - It isn't necessarily that simple, because public and private deps
>>>>>>>
>>>>>> might
>>>>>>
>>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>>
>>>>>>> Shading is an implementation detail of expressing these
>>>>>>> characteristics.
>>>>>>>
>>>>>> We
>>>>>>
>>>>>>> use shading selectively because of its downsides I mentioned above.
>>>>>>>
>>>>>>> But what about this idea: Introduce shaded deps as a single separate
>>>>>>> artifact.
>>>>>>>
>>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated versions
>>>>>>> of
>>>>>>> everything we want to shade
>>>>>>>
>>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
>>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is
>>>>>>> what
>>>>>>> they
>>>>>>> are rewritten to
>>>>>>>
>>>>>>> Some benefits
>>>>>>>
>>>>>>>   - much faster builds of other modules
>>>>>>>   - only one shaded uber jar
>>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>>>>   - configuration all in one place
>>>>>>>   - no automated rewriting of our real code, which has led to some
>>>>>>> major
>>>>>>> confusion
>>>>>>>   - easy to implement incrementally
>>>>>>>
>>>>>>> Downsides:
>>>>>>>
>>>>>>>   - plenty of effort work to get there
>>>>>>>   - unclear how many different such deps modules we need; sharing
>>>>>>> them
>>>>>>>
>>>>>> could
>>>>>>
>>>>>>> get weird
>>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>>
>>>>>>> Just something I was musing as I spent another evening waiting for
>>>>>>> slow
>>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>>
>>>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Kenneth Knowles <kl...@google.com>.
There's also, with additional overhead,

1a) A relocated and shipped package for each thing we want to relocate. I
think this has also been tried outside Beam...

Pros:
* all the pros of 1) plus no bloat beyond what is necessary
Cons:
* abandons whitelist approach for public deps, reverting to blacklist
approach for trouble things like guava, so a bit less principled

For both 1) and 1a) I would add:

Pros:
* clearly readable dependency since code will `import
org.apache.beam.private.guava21` and IDEs will understand it is a distinct
lilbrary
* can run tests on unpackaged classes, as long as deps are shaded or
provided as jars
* no mysterious action at a distance from inherited configuration
Cons:
* need to adjust imports

Kenn

On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <lc...@google.com> wrote:

> I would suggest that either we use:
> 1) A common deps package containing shaded dependencies allows for
> Pros
> * doesn't require the user to build an uber jar
> Risks
> * dependencies package will keep growing even if something is or isn't
> needed by all of Apache Beam leading to a large jar anyways negating any
> space savings
>
> 2) Shade within each module to a common location like
> org.apache.beam.relocated.guava....
> Pros
> * you only get the shaded dependencies of the things that are required
> * its one less dependency for users to manage
> Risks
> * requires an uber jar to be built to get the space savings (either by a
> user or a distribution of Apache Beam) otherwise we negate any space
> savings.
>
> If we either use a common relocation scheme or a dependencies jar then
> each relocation should specifically contain the version number of the
> package because we would like to allow for us to be using two different
> versions of the same library.
>
> For the common deps package approach, should we check in code where the
> imports contain the relocated location (e.g. import
> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>
>
> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Thanks for bringing that back.
>>
>> Indeed guava is shaded in different uber-jar. Maybe we can have a common
>> deps module that we include once (but the user will have to explicitly
>> define the dep) ?
>>
>> Basically, what do you propose for protobuf (unfortunately, I don't see
>> an obvious) ?
>>
>> Regards
>> JB
>>
>>
>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>
>>> Hello, I wanted to bring back this subject because I think we should
>>> take action on this and at least first have a shaded version of guava.
>>> I was playing with a toy project and I did the procedure we use to
>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>> creating an uber jar and I realized that the size of the jar was way
>>> bigger than I expected, and the fact that we shade guava in every
>>> module contributes to this. I found guava shaded on:
>>>
>>> sdks/java/core
>>> runners/core-construction-java
>>> runners/core-java
>>> model/job-management
>>> runners/spark
>>> sdks/java/io/hadoop-file-system
>>> sdks/java/io/kafka
>>>
>>> This means at least 6 times more of the size it should which counts in
>>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>>> reduce by using a shaded version of the library.
>>>
>>> I add this point to the previous ones mentioned during the discussion
>>> because this goes against the end user experience and affects us all
>>> (devs/users).
>>>
>>> Another question is if we should shade (and how) protocol buffers
>>> because now with the portability work we are exposing it closer to the
>>> end users. I say this because I also found an issue while running a
>>> job on YARN with the spark runner because hadoop-common includes
>>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>>> a dependency to be able to use triggers (note the Spark runner
>>> translates them using some method from runners/core-java). Since
>>> hadoop-common is provided in the cluster with the older version of
>>> protobuf, I am afraid that this will bite us in the future.
>>>
>>> Ismaël
>>>
>>> ps. There is already a JIRA for that shading for protobuf on
>>> hadoop-common but this is not coming until version 3 is out.
>>> https://issues.apache.org/jira/browse/HADOOP-13136
>>>
>>> ps2. Extra curious situation is to see that the dataflow-runner ends
>>> up having guava shaded twice via its shaded version on
>>> core-construction-java.
>>>
>>> ps3. Of course this message means a de-facto +1 at least to do it for
>>> guava and evaluate it for other libs.
>>>
>>>
>>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid>
>>> wrote:
>>>
>>>> An issue to call out is how to deal with our generated code (.avro and
>>>> .proto) as I don't believe those plugins allow you to generate code
>>>> using a
>>>> shaded package prefix on imports.
>>>>
>>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tgroh@google.com.invalid
>>>> >
>>>> wrote:
>>>>
>>>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>>>>> every time for dependencies we know we'll use.
>>>>>
>>>>> This also means that if we end up pulling in transitive dependencies we
>>>>> don't want in any particular module we can avoid having to adjust our
>>>>> repackaging strategy for that module - which I have run into
>>>>> face-first in
>>>>> the past.
>>>>>
>>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles
>>>>> <kl...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>>
>>>>>> Shading is a big part of how we keep our dependencies sane in Beam.
>>>>>> But
>>>>>> downsides: shading is super slow, causes massive jar bloat, and kind
>>>>>> of
>>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>>
>>>>>> I know that some communities distribute their own shaded
>>>>>> distributions of
>>>>>> dependencies. I had a thought about doing something similar that I
>>>>>> wanted
>>>>>> to throw out there for people to poke holes in.
>>>>>>
>>>>>> To set the scene, here is how I view shading:
>>>>>>
>>>>>>   - A module has public dependencies and private dependencies.
>>>>>>   - Public deps are used for data interchange; users must share these
>>>>>>
>>>>> deps.
>>>>>
>>>>>>   - Private deps are just functionality and can be hidden (in our
>>>>>> case,
>>>>>> relocated + bundled)
>>>>>>   - It isn't necessarily that simple, because public and private deps
>>>>>>
>>>>> might
>>>>>
>>>>>> interact in higher-order ways ("public" is contagious)
>>>>>>
>>>>>> Shading is an implementation detail of expressing these
>>>>>> characteristics.
>>>>>>
>>>>> We
>>>>>
>>>>>> use shading selectively because of its downsides I mentioned above.
>>>>>>
>>>>>> But what about this idea: Introduce shaded deps as a single separate
>>>>>> artifact.
>>>>>>
>>>>>>   - sdks/java/private-deps: bundled uber jar with relocated versions
>>>>>> of
>>>>>> everything we want to shade
>>>>>>
>>>>>>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
>>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is
>>>>>> what
>>>>>> they
>>>>>> are rewritten to
>>>>>>
>>>>>> Some benefits
>>>>>>
>>>>>>   - much faster builds of other modules
>>>>>>   - only one shaded uber jar
>>>>>>   - rare/no rebuilds of the uber jar
>>>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>>>   - configuration all in one place
>>>>>>   - no automated rewriting of our real code, which has led to some
>>>>>> major
>>>>>> confusion
>>>>>>   - easy to implement incrementally
>>>>>>
>>>>>> Downsides:
>>>>>>
>>>>>>   - plenty of effort work to get there
>>>>>>   - unclear how many different such deps modules we need; sharing them
>>>>>>
>>>>> could
>>>>>
>>>>>> get weird
>>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>>
>>>>>> Just something I was musing as I spent another evening waiting for
>>>>>> slow
>>>>>> builds to try to confirm changes to brittle poms.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Lukasz Cwik <lc...@google.com>.
I would suggest that either we use:
1) A common deps package containing shaded dependencies allows for
Pros
* doesn't require the user to build an uber jar
Risks
* dependencies package will keep growing even if something is or isn't
needed by all of Apache Beam leading to a large jar anyways negating any
space savings

2) Shade within each module to a common location like
org.apache.beam.relocated.guava....
Pros
* you only get the shaded dependencies of the things that are required
* its one less dependency for users to manage
Risks
* requires an uber jar to be built to get the space savings (either by a
user or a distribution of Apache Beam) otherwise we negate any space
savings.

If we either use a common relocation scheme or a dependencies jar then each
relocation should specifically contain the version number of the package
because we would like to allow for us to be using two different versions of
the same library.

For the common deps package approach, should we check in code where the
imports contain the relocated location (e.g. import
org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?


On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Thanks for bringing that back.
>
> Indeed guava is shaded in different uber-jar. Maybe we can have a common
> deps module that we include once (but the user will have to explicitly
> define the dep) ?
>
> Basically, what do you propose for protobuf (unfortunately, I don't see an
> obvious) ?
>
> Regards
> JB
>
>
> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>
>> Hello, I wanted to bring back this subject because I think we should
>> take action on this and at least first have a shaded version of guava.
>> I was playing with a toy project and I did the procedure we use to
>> submit jars to a Hadoop cluster via Flink/Spark which involves
>> creating an uber jar and I realized that the size of the jar was way
>> bigger than I expected, and the fact that we shade guava in every
>> module contributes to this. I found guava shaded on:
>>
>> sdks/java/core
>> runners/core-construction-java
>> runners/core-java
>> model/job-management
>> runners/spark
>> sdks/java/io/hadoop-file-system
>> sdks/java/io/kafka
>>
>> This means at least 6 times more of the size it should which counts in
>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>> reduce by using a shaded version of the library.
>>
>> I add this point to the previous ones mentioned during the discussion
>> because this goes against the end user experience and affects us all
>> (devs/users).
>>
>> Another question is if we should shade (and how) protocol buffers
>> because now with the portability work we are exposing it closer to the
>> end users. I say this because I also found an issue while running a
>> job on YARN with the spark runner because hadoop-common includes
>> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
>> a dependency to be able to use triggers (note the Spark runner
>> translates them using some method from runners/core-java). Since
>> hadoop-common is provided in the cluster with the older version of
>> protobuf, I am afraid that this will bite us in the future.
>>
>> Ismaël
>>
>> ps. There is already a JIRA for that shading for protobuf on
>> hadoop-common but this is not coming until version 3 is out.
>> https://issues.apache.org/jira/browse/HADOOP-13136
>>
>> ps2. Extra curious situation is to see that the dataflow-runner ends
>> up having guava shaded twice via its shaded version on
>> core-construction-java.
>>
>> ps3. Of course this message means a de-facto +1 at least to do it for
>> guava and evaluate it for other libs.
>>
>>
>> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid>
>> wrote:
>>
>>> An issue to call out is how to deal with our generated code (.avro and
>>> .proto) as I don't believe those plugins allow you to generate code
>>> using a
>>> shaded package prefix on imports.
>>>
>>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tg...@google.com.invalid>
>>> wrote:
>>>
>>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>>>> every time for dependencies we know we'll use.
>>>>
>>>> This also means that if we end up pulling in transitive dependencies we
>>>> don't want in any particular module we can avoid having to adjust our
>>>> repackaging strategy for that module - which I have run into face-first
>>>> in
>>>> the past.
>>>>
>>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <klk@google.com.invalid
>>>> >
>>>> wrote:
>>>>
>>>> Hi all,
>>>>>
>>>>> Shading is a big part of how we keep our dependencies sane in Beam. But
>>>>> downsides: shading is super slow, causes massive jar bloat, and kind of
>>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>>
>>>>> I know that some communities distribute their own shaded distributions
>>>>> of
>>>>> dependencies. I had a thought about doing something similar that I
>>>>> wanted
>>>>> to throw out there for people to poke holes in.
>>>>>
>>>>> To set the scene, here is how I view shading:
>>>>>
>>>>>   - A module has public dependencies and private dependencies.
>>>>>   - Public deps are used for data interchange; users must share these
>>>>>
>>>> deps.
>>>>
>>>>>   - Private deps are just functionality and can be hidden (in our case,
>>>>> relocated + bundled)
>>>>>   - It isn't necessarily that simple, because public and private deps
>>>>>
>>>> might
>>>>
>>>>> interact in higher-order ways ("public" is contagious)
>>>>>
>>>>> Shading is an implementation detail of expressing these
>>>>> characteristics.
>>>>>
>>>> We
>>>>
>>>>> use shading selectively because of its downsides I mentioned above.
>>>>>
>>>>> But what about this idea: Introduce shaded deps as a single separate
>>>>> artifact.
>>>>>
>>>>>   - sdks/java/private-deps: bundled uber jar with relocated versions of
>>>>> everything we want to shade
>>>>>
>>>>>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
>>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is what
>>>>> they
>>>>> are rewritten to
>>>>>
>>>>> Some benefits
>>>>>
>>>>>   - much faster builds of other modules
>>>>>   - only one shaded uber jar
>>>>>   - rare/no rebuilds of the uber jar
>>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>>   - configuration all in one place
>>>>>   - no automated rewriting of our real code, which has led to some
>>>>> major
>>>>> confusion
>>>>>   - easy to implement incrementally
>>>>>
>>>>> Downsides:
>>>>>
>>>>>   - plenty of effort work to get there
>>>>>   - unclear how many different such deps modules we need; sharing them
>>>>>
>>>> could
>>>>
>>>>> get weird
>>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>>
>>>>> Just something I was musing as I spent another evening waiting for slow
>>>>> builds to try to confirm changes to brittle poms.
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Thanks for bringing that back.

Indeed guava is shaded in different uber-jar. Maybe we can have a common deps 
module that we include once (but the user will have to explicitly define the dep) ?

Basically, what do you propose for protobuf (unfortunately, I don't see an 
obvious) ?

Regards
JB

On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
> Hello, I wanted to bring back this subject because I think we should
> take action on this and at least first have a shaded version of guava.
> I was playing with a toy project and I did the procedure we use to
> submit jars to a Hadoop cluster via Flink/Spark which involves
> creating an uber jar and I realized that the size of the jar was way
> bigger than I expected, and the fact that we shade guava in every
> module contributes to this. I found guava shaded on:
> 
> sdks/java/core
> runners/core-construction-java
> runners/core-java
> model/job-management
> runners/spark
> sdks/java/io/hadoop-file-system
> sdks/java/io/kafka
> 
> This means at least 6 times more of the size it should which counts in
> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
> reduce by using a shaded version of the library.
> 
> I add this point to the previous ones mentioned during the discussion
> because this goes against the end user experience and affects us all
> (devs/users).
> 
> Another question is if we should shade (and how) protocol buffers
> because now with the portability work we are exposing it closer to the
> end users. I say this because I also found an issue while running a
> job on YARN with the spark runner because hadoop-common includes
> protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
> a dependency to be able to use triggers (note the Spark runner
> translates them using some method from runners/core-java). Since
> hadoop-common is provided in the cluster with the older version of
> protobuf, I am afraid that this will bite us in the future.
> 
> Ismaël
> 
> ps. There is already a JIRA for that shading for protobuf on
> hadoop-common but this is not coming until version 3 is out.
> https://issues.apache.org/jira/browse/HADOOP-13136
> 
> ps2. Extra curious situation is to see that the dataflow-runner ends
> up having guava shaded twice via its shaded version on
> core-construction-java.
> 
> ps3. Of course this message means a de-facto +1 at least to do it for
> guava and evaluate it for other libs.
> 
> 
> On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik <lc...@google.com.invalid> wrote:
>> An issue to call out is how to deal with our generated code (.avro and
>> .proto) as I don't believe those plugins allow you to generate code using a
>> shaded package prefix on imports.
>>
>> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh <tg...@google.com.invalid>
>> wrote:
>>
>>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>>> every time for dependencies we know we'll use.
>>>
>>> This also means that if we end up pulling in transitive dependencies we
>>> don't want in any particular module we can avoid having to adjust our
>>> repackaging strategy for that module - which I have run into face-first in
>>> the past.
>>>
>>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles <kl...@google.com.invalid>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Shading is a big part of how we keep our dependencies sane in Beam. But
>>>> downsides: shading is super slow, causes massive jar bloat, and kind of
>>>> hard to get right because artifacts and namespaces are not 1-to-1.
>>>>
>>>> I know that some communities distribute their own shaded distributions of
>>>> dependencies. I had a thought about doing something similar that I wanted
>>>> to throw out there for people to poke holes in.
>>>>
>>>> To set the scene, here is how I view shading:
>>>>
>>>>   - A module has public dependencies and private dependencies.
>>>>   - Public deps are used for data interchange; users must share these
>>> deps.
>>>>   - Private deps are just functionality and can be hidden (in our case,
>>>> relocated + bundled)
>>>>   - It isn't necessarily that simple, because public and private deps
>>> might
>>>> interact in higher-order ways ("public" is contagious)
>>>>
>>>> Shading is an implementation detail of expressing these characteristics.
>>> We
>>>> use shading selectively because of its downsides I mentioned above.
>>>>
>>>> But what about this idea: Introduce shaded deps as a single separate
>>>> artifact.
>>>>
>>>>   - sdks/java/private-deps: bundled uber jar with relocated versions of
>>>> everything we want to shade
>>>>
>>>>   - sdks/java/core and sdks/java/harness: no relocation or bundling -
>>>> depends on `beam-sdks-java-private-deps` and imports like
>>>> `org.apache.beam.sdk.private.com.google.common` directly (this is what
>>>> they
>>>> are rewritten to
>>>>
>>>> Some benefits
>>>>
>>>>   - much faster builds of other modules
>>>>   - only one shaded uber jar
>>>>   - rare/no rebuilds of the uber jar
>>>>   - can use maven enforcer to forbid imports like com.google.common
>>>>   - configuration all in one place
>>>>   - no automated rewriting of our real code, which has led to some major
>>>> confusion
>>>>   - easy to implement incrementally
>>>>
>>>> Downsides:
>>>>
>>>>   - plenty of effort work to get there
>>>>   - unclear how many different such deps modules we need; sharing them
>>> could
>>>> get weird
>>>>   - if we hit a roadblock, we will have committed a lot of time
>>>>
>>>> Just something I was musing as I spent another evening waiting for slow
>>>> builds to try to confirm changes to brittle poms.
>>>>
>>>> Kenn
>>>>
>>>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com