You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2020/09/04 01:18:09 UTC

[DISCUSS] Move Avro dependency out of core Beam

Hi everyone,
The fact that core Beam has a dependency on Avro has led to a lot of
headaches when users (or runners) are using a different version. zeidoo [1]
was generous enough to put up a WIP PR [2] that moves everything that
depends on Avro (primarily AvroCoder and the Avro SchemaProvider I believe)
out of core Beam and into a separate extensions module. This way we could
have multiple extensions for different versions of Avro in the future.

As I understand it, the downsides to making this change are:
1) It's a breaking change, users with AvroCoder in their pipeline will need
to change their build dependencies and import statements.
2) AvroCoder is the only (non-experimental) coder in core Beam that can
encode complex user types. So new users will need to dabble with the
Experimental SchemaCoder or add a second dependency to build a pipeline
with their own types.

I think these costs are outweighed by the benefit of removing the
dependency in core Beam, but I wanted to reach out to the community to see
if there are any objections.

Brian

[1] github.com/zeidoo
[2] https://github.com/apache/beam/pull/12748

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Reuven Lax <re...@google.com>.
On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com> wrote:

>
>
> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>> this break if we just removed it?
>>
>
> I think Dataflow would just need to declare a dependency on the new
> extension.
>
>
>>
>> On Thu, Sep 3, 2020 at 10:51 PM Reuven Lax <re...@google.com> wrote:
>> >
>> > As for 2, maybe it's time to remove @Experimental from SchemaCoder?
>> >
>>
>
> Probably worth a separate thread about dropping `@Experimental` on
> SchemaCoder. I'd be ok with that, the only breaking change I have in mind
> is that I think we should deprecate and remove the DATETIME primitive type,
> replacing it with a logical type.
>
>
>> > 1 is tricky though. Changes like this have caused a lot of trouble for
>> users in the past, and I think some users still have unpleasant memories of
>> being told "you just have to change some package names and imports."
>> >
>>
>
> We could mitigate this by first adding the new extension module and
> deprecating the core Beam counterpart for a release (or multiple releases).
>

The last time this happened (when moving from the old Dataflow SDK to
Beam), users were still struggling over a year later to migrate. Many of
those users still have bad memories of this. While the Beam upgrade was a
little bit more than just "update packages," it turned out that updating
dependencies and packages was a pain for many users.


>
>
>> > On Thu, Sep 3, 2020 at 6:18 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >>
>> >> Hi everyone,
>> >> The fact that core Beam has a dependency on Avro has led to a lot of
>> headaches when users (or runners) are using a different version. zeidoo [1]
>> was generous enough to put up a WIP PR [2] that moves everything that
>> depends on Avro (primarily AvroCoder and the Avro SchemaProvider I believe)
>> out of core Beam and into a separate extensions module. This way we could
>> have multiple extensions for different versions of Avro in the future.
>> >>
>> >> As I understand it, the downsides to making this change are:
>> >> 1) It's a breaking change, users with AvroCoder in their pipeline will
>> need to change their build dependencies and import statements.
>> >> 2) AvroCoder is the only (non-experimental) coder in core Beam that
>> can encode complex user types. So new users will need to dabble with the
>> Experimental SchemaCoder or add a second dependency to build a pipeline
>> with their own types.
>> >>
>> >> I think these costs are outweighed by the benefit of removing the
>> dependency in core Beam, but I wanted to reach out to the community to see
>> if there are any objections.
>> >>
>> >> Brian
>> >>
>> >> [1] github.com/zeidoo
>> >> [2] https://github.com/apache/beam/pull/12748
>>
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Robert Bradshaw <ro...@google.com>.
What if we introduced a core-lite package without avro? (We could take
inventory and see if there are other dependencies we could/should make
optional as well.) The existing core module would remain the same, but it
would provide a way for users to use other avros with Beam.

On Fri, Sep 11, 2020 at 10:28 AM Ismaël Mejía <ie...@gmail.com> wrote:

> Getting Avro out of core is a good idea in principle but it has
> consequences for users.
>
> The cleanest solution implies moving the packages as done in the PR
> but this will essentially break every existing user, so we should
> measure the impact on users and agree if it is worth to break SDK core
> backwards compatibility (or wait until Beam 3 to do this). Users tend
> to be really frustrated with such kind of breakage in particular if
> they don’t see a concrete benefit [1]. I thought for a moment that a
> solution could be to have the same packages in the extension but that
> won’t work because we will end up with broken packages/modules and
> that will get us issues with Java 11.
>
> There are two other ‘issues’ about the change:
>
> We MUST guarantee that such upgrade does not break users of the Spark
> runner. Spark leaks Avro 1.8.x by default on the recent versions 2.4.x
> / 3.x.x so Beam’s code that uses a different version of Avro should at
> least be source compatible between Avro versions or otherwise it will
> break users in that runner. In concrete terms this means that we
> should stay by default in Avro 1.8.2 and have the other modules to
> only provide the upgraded versions of the Avro dependencies, and it
> will be up to the users to provide the compatible versions they want
> to use.
>
> And for the concrete case of Confluent Schema Registry in KafkaIO this
> will imply that we need to find a way to make the Avro dependency
> aligned between core and KafkaIO otherwise users can have issues
> because of not available classes/methods in particular if users’s code
> depend on an unaligned version of Avro.
>
> I have to say I have mixed feelings. I am essentially pro removal from
> core because it should not have ever been there in the first place,
> but I am afraid of the impact vs the potential gains.
>
> [1]
> https://medium.com/@steve.yegge/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc
>
> On Fri, Sep 11, 2020 at 7:05 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > Top-post: I'm generally in favor of moving Avro out of core specifically
> because it is something where different users (and dep chains) want
> different versions. The pain caused by having it in core has come up a lot
> to me. I don't think backwards-compatibility absolutism helps our users in
> this case. I do think gradual migration to ease pain is important.
> >
> > On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com>
> wrote:
> >>>
> >>>
> >>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>>
> >>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
> >>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
> >>>> this break if we just removed it?
> >>>
> >>>
> >>> I think Dataflow would just need to declare a dependency on the new
> extension.
> >>
> >>
> >> I'm not sure this would solve the underlying problem (it just pushes it
> onto users and makes it more obscure). Maybe my reasoning is incorrect, but
> from what I see
> >>
> >> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka,
> parquet, ...) depend on Avro.
> >> * Using Avro 1.9 with the above modules doesn't work.
> >
> >
> > I suggest taking these on case-by-case.
> >
> >  - Dataflow: implementation detail, probably not a major problem (we can
> just upgrade the pre-portability worker while for portability it is a
> non-issue)
> >  - Spark: probably need to use whatever version of Avro works for each
> version of Spark (portability mitigates)
> >  - SQL: happy to upgrade lib version, just needs to be able to read the
> data, Avro version not user-facing
> >  - IOs: I'm guessing that we have a diamond dep getting resolved by
> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
> Avro serde is a separate thing distributed by Confluent with Avro version
> obfuscated by use of parent poms and properties, but their examples use
> Avro 1.9.1.
> >
> >> Doesn't this mean that, even if we remove avro from Beam core, a user
> that uses Beam + Avro 1.9 will have issues with any of the above (fairly
> fundamental) modules?
> >>
> >>>  We could mitigate this by first adding the new extension module and
> deprecating the core Beam counterpart for a release (or multiple releases).
> >>
> >>
> >> +1 to Reuven's concerns here.
> >
> >
> > Agree we should add the module and release it for at least one release,
> probably a few because users tend to hop a few releases. We have some
> precedent for breaking changes with the Python/Flink version dropping after
> asking users on user@ and polling on Twitter, etc.
> >
> > Kenn
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Ismaël Mejía <ie...@gmail.com>.
Getting Avro out of core is a good idea in principle but it has
consequences for users.

The cleanest solution implies moving the packages as done in the PR
but this will essentially break every existing user, so we should
measure the impact on users and agree if it is worth to break SDK core
backwards compatibility (or wait until Beam 3 to do this). Users tend
to be really frustrated with such kind of breakage in particular if
they don’t see a concrete benefit [1]. I thought for a moment that a
solution could be to have the same packages in the extension but that
won’t work because we will end up with broken packages/modules and
that will get us issues with Java 11.

There are two other ‘issues’ about the change:

We MUST guarantee that such upgrade does not break users of the Spark
runner. Spark leaks Avro 1.8.x by default on the recent versions 2.4.x
/ 3.x.x so Beam’s code that uses a different version of Avro should at
least be source compatible between Avro versions or otherwise it will
break users in that runner. In concrete terms this means that we
should stay by default in Avro 1.8.2 and have the other modules to
only provide the upgraded versions of the Avro dependencies, and it
will be up to the users to provide the compatible versions they want
to use.

And for the concrete case of Confluent Schema Registry in KafkaIO this
will imply that we need to find a way to make the Avro dependency
aligned between core and KafkaIO otherwise users can have issues
because of not available classes/methods in particular if users’s code
depend on an unaligned version of Avro.

I have to say I have mixed feelings. I am essentially pro removal from
core because it should not have ever been there in the first place,
but I am afraid of the impact vs the potential gains.

[1] https://medium.com/@steve.yegge/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc

On Fri, Sep 11, 2020 at 7:05 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> Top-post: I'm generally in favor of moving Avro out of core specifically because it is something where different users (and dep chains) want different versions. The pain caused by having it in core has come up a lot to me. I don't think backwards-compatibility absolutism helps our users in this case. I do think gradual migration to ease pain is important.
>
> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com> wrote:
>>>
>>>
>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com> wrote:
>>>>
>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>>>> this break if we just removed it?
>>>
>>>
>>> I think Dataflow would just need to declare a dependency on the new extension.
>>
>>
>> I'm not sure this would solve the underlying problem (it just pushes it onto users and makes it more obscure). Maybe my reasoning is incorrect, but from what I see
>>
>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka, parquet, ...) depend on Avro.
>> * Using Avro 1.9 with the above modules doesn't work.
>
>
> I suggest taking these on case-by-case.
>
>  - Dataflow: implementation detail, probably not a major problem (we can just upgrade the pre-portability worker while for portability it is a non-issue)
>  - Spark: probably need to use whatever version of Avro works for each version of Spark (portability mitigates)
>  - SQL: happy to upgrade lib version, just needs to be able to read the data, Avro version not user-facing
>  - IOs: I'm guessing that we have a diamond dep getting resolved by clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's Avro serde is a separate thing distributed by Confluent with Avro version obfuscated by use of parent poms and properties, but their examples use Avro 1.9.1.
>
>> Doesn't this mean that, even if we remove avro from Beam core, a user that uses Beam + Avro 1.9 will have issues with any of the above (fairly fundamental) modules?
>>
>>>  We could mitigate this by first adding the new extension module and deprecating the core Beam counterpart for a release (or multiple releases).
>>
>>
>> +1 to Reuven's concerns here.
>
>
> Agree we should add the module and release it for at least one release, probably a few because users tend to hop a few releases. We have some precedent for breaking changes with the Python/Flink version dropping after asking users on user@ and polling on Twitter, etc.
>
> Kenn

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Tyson Hamilton <ty...@google.com>.
On Mon, Sep 21, 2020 at 7:56 AM Cristian Constantinescu <ze...@gmail.com>
wrote:

> All the proposed solutions seem reasonable. I'm not sure if one has an
> edge over the other. I guess it depends on how cautiously the community
> would like to move.
>
> Maybe it's just my impression, but it seems to me that there are a few
> changes that are held back for the sake of backwards compatibility. If this
> is truly the case, maybe we can start thinking about the next major version
> of Beam where:
> - the dependency report email (Beam Dependency Check Report (2020-09-21))
> can be tackled. There are quite a few deps in there that need to be
> updated. I'm sure there will be more breaking changes.
>

A lot of these dependency updates are being tackled by the folk at Polidea,
working on non-GCP related dependencies, and someone from Google will pick
up BEAM-8963 (GCP related deps) this quarter.


> - review some architectural decisions that are difficult to correct
> without breaking things (eg: Avro in Core)
> - compatibility with Java 11+
>

Dataflow will be compatible with Java 11 by the end of 2020. There are some
other runners that need major version upgrades before being able to support
Java 11 though (e.g. Spark) so community support here would be fantastic.


> - features that we can't implement with our current code base
>
> I'm not sure what Beam's roadmap is, but maybe we could set up a branch in
> the main repo (and the checks) and try to tackle all this work so we get a
> better idea of the scope (and unforeseen issues that will come forward)
> that's really needed for a potential RC build in the short term future.
>
> On Wed, Sep 16, 2020 at 6:40 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> An adapter seems a reasonable approach, and shouldn't be too hard.
>>
>> If the breakage is "we no longer provide Avro 1.8 by default; please
>> depend on it explicitly if this breaks you" that seems reasonable to me, as
>> it's easy to detect and remedy.
>>
>> On Tue, Sep 15, 2020 at 2:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> Avro differences in our implementation are pretty minimal if you look at
>>> the PR,
>>> to the point that an Adapter should be really tiny if even needed.
>>>
>>> The big backwards incompatible changes in Avro > 1.8 were to remove
>>> external
>>> libraries from the public APIs e.g. guava, jackson and joda-time. Of
>>> course this
>>> does not seem to be much but almost every project using Avro was using
>>> some of
>>> these dependencies, luckily for Beam it was only joda-time and that is
>>> already
>>> fixed.
>>>
>>> Keeping backwards compatibility by making Avro part of an extension that
>>> is
>>> optional for core and using only Avro 1.8 compatible features on Beam's
>>> source
>>> code is the simplest path, and allow us to avoid all the issues, notice
>>> that the
>>> dependency that triggered the need for Avro 1.9 (and this thread) is a
>>> runtime
>>> dependency used by Confluent Schema Registry and this is an issue because
>>> sdks-java-core is leaking Avro. Apart from this I am not aware of any
>>> feature in
>>> any other project that obliges anyone to use Avro 1.9 or 1.10 specific
>>> code.
>>>
>>> Of course a really valid reason to want to use a more recent version of
>>> Avro is
>>> that Avro 1.8 leaks also unmaintained dependencies with serious security
>>> issues
>>> (Jackson 1.x).
>>>
>>> So in the end my main concern is breaking existing users code, this has
>>> less
>>> impact for us (Talend) but probably more for the rest of the community,
>>> but if
>>> we agree to break backwards compatibility for the sake of cleanliness
>>> well we
>>> should probably proceed, and of course give users also some warning.
>>>
>>> On Mon, Sep 14, 2020 at 7:13 PM Luke Cwik <lc...@google.com> wrote:
>>> >
>>> > In the Kafka module we reflectively figure out which version of Kafka
>>> exists[1] on the classpath and then reflectively invoke some APIs to work
>>> around differences in Kafka allowing our users to bring whichever version
>>> they want.
>>> >
>>> > We could do something similar here and reflectively figure out which
>>> Avro is on the classpath and invoke the appropriate methods. If we are
>>> worried about performance of using reflection, we can write and compile two
>>> different versions of an Avro adapter class and choose which one to use
>>> (using reflection only once during class loading).
>>> >
>>> > e.g.
>>> > AvroAdapter {
>>> >   static final AvroAdapter INSTANCE;
>>> >   static {
>>> >     if (avro19?) {
>>> >       INSTANCE = new Avro19Adapater();
>>> >     } else {
>>> >       INSTANCE = new Avro18Adapter();
>>> >   }
>>> >
>>> >   ... methods needed for AvroAdapter implementations ...
>>> > }
>>> >
>>> > Using reflection allows the user to choose which version they use and
>>> pushes down the incompatibility issue from Apache Beam to our deps (e.g.
>>> Spark).
>>> >
>>> > 1:
>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java
>>> >
>>> > On Fri, Sep 11, 2020 at 11:42 AM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >>
>>> >> I am not deep on the details myself but have reviewed various Avro
>>> upgrade changes such as https://github.com/apache/beam/pull/9779 and
>>> also some internal that I cannot link to. I believe the changes are small
>>> and quite possibly we can create sdks/java/extensions/avro that works with
>>> both Avro 1.8 and 1.9 and make Dataflow worker compatible with whatever the
>>> user chooses. (I would expect Spark is trying to get to that point too?)
>>> >>
>>> >> So then if we have that can we achieve the goals? Spark runner users
>>> that do not use Avro in their own code get Spark's version, Spark runner
>>> users that do use Avro have to choose anyhow, and we make Dataflow worker
>>> work with 1.8 and 1.9.
>>> >>
>>> >> We can probably achieve the same goals by just making the core
>>> compatible with both 1.8 and 1.9. Users who don't want the dep can exclude
>>> it, too. It doesn't have a bunch of transitive deps so there isn't a lot of
>>> value in trying to exclude it though. So core vs extensions is more of a
>>> clean engineering thing, but having compatibility with 1.9 is a user-driven
>>> thing.
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ie...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> > The concern here is that Avro 1.9 is not backwards compatible with
>>> Avro 1.8, so would the future world would not be a simple "bring your own
>>> avro" but might require separate dataflow-with-avro-1.8 and
>>> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
>>> mistaken here? Maybe we could solve this with vending?)
>>> >>>
>>> >>> Thinking a bit about it looks similar to what I mentioned with Spark
>>> >>> runner save that we cannot control those targets so for that reason I
>>> >>> talked about source code compatibility.
>>> >>> Avro is really hard to shade correctly because of the way the code
>>> >>> generation works, otherwise it could have been the best solution.
>>> >>>
>>> >>> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>> >>> >
>>> >>> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >>> >>
>>> >>> >> Top-post: I'm generally in favor of moving Avro out of core
>>> specifically because it is something where different users (and dep chains)
>>> want different versions. The pain caused by having it in core has come up a
>>> lot to me. I don't think backwards-compatibility absolutism helps our users
>>> in this case. I do think gradual migration to ease pain is important.
>>> >>> >
>>> >>> >
>>> >>> > Agree. Backwards compatibility is not the absolute goal; whatever
>>> is best for existing and new users is what we should go for. That being
>>> said, this whole issue is caused by one of our dependencies not being
>>> backwards compatible itself...
>>> >>> >
>>> >>> >>
>>> >>> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >>> >>>
>>> >>> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <
>>> bhulette@google.com> wrote:
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >>> >>>>>
>>> >>> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro
>>> to write
>>> >>> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks).
>>> Would
>>> >>> >>>>> this break if we just removed it?
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> I think Dataflow would just need to declare a dependency on the
>>> new extension.
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> I'm not sure this would solve the underlying problem (it just
>>> pushes it onto users and makes it more obscure). Maybe my reasoning is
>>> incorrect, but from what I see
>>> >>> >>>
>>> >>> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql,
>>> kafka, parquet, ...) depend on Avro.
>>> >>> >>> * Using Avro 1.9 with the above modules doesn't work.
>>> >>> >>
>>> >>> >>
>>> >>> >> I suggest taking these on case-by-case.
>>> >>> >>
>>> >>> >>  - Dataflow: implementation detail, probably not a major problem
>>> (we can just upgrade the pre-portability worker while for portability it is
>>> a non-issue)
>>> >>> >>  - Spark: probably need to use whatever version of Avro works for
>>> each version of Spark (portability mitigates)
>>> >>> >>  - SQL: happy to upgrade lib version, just needs to be able to
>>> read the data, Avro version not user-facing
>>> >>> >>  - IOs: I'm guessing that we have a diamond dep getting resolved
>>> by clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
>>> Avro serde is a separate thing distributed by Confluent with Avro version
>>> obfuscated by use of parent poms and properties, but their examples use
>>> Avro 1.9.1.
>>> >>> >
>>> >>> >
>>> >>> > The concern here is that Avro 1.9 is not backwards compatible with
>>> Avro 1.8, so would the future world would not be a simple "bring your own
>>> avro" but might require separate dataflow-with-avro-1.8 and
>>> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
>>> mistaken here? Maybe we could solve this with vending?)
>>> >>> >
>>> >>> >>> Doesn't this mean that, even if we remove avro from Beam core, a
>>> user that uses Beam + Avro 1.9 will have issues with any of the above
>>> (fairly fundamental) modules?
>>> >>> >>>
>>> >>> >>>>  We could mitigate this by first adding the new extension
>>> module and deprecating the core Beam counterpart for a release (or multiple
>>> releases).
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> +1 to Reuven's concerns here.
>>> >>> >>
>>> >>> >>
>>> >>> >> Agree we should add the module and release it for at least one
>>> release, probably a few because users tend to hop a few releases. We have
>>> some precedent for breaking changes with the Python/Flink version dropping
>>> after asking users on user@ and polling on Twitter, etc.
>>> >>> >>
>>> >>> >> Kenn
>>>
>>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Brian Hulette <bh...@google.com>.
It sounds like there's a consensus around making Beam core work with either
Avro 1.8 or 1.9. It looks like +Tomo Suzuki <su...@google.com> actually
accomplished this already back in January for BEAM-9144 [1] and a user has
run pipelines on Dataflow with Avro 1.9.

Would we need to do anything else to unblock using Confluent Schema
Registry 5.5 (BEAM-9330 [2])?

[1] https://issues.apache.org/jira/browse/BEAM-9144
[2] https://issues.apache.org/jira/browse/BEAM-9330

On Tue, Oct 13, 2020 at 8:34 AM Piotr Szuberski <pi...@polidea.com>
wrote:

> We are working on IOs dependencies, most of them are pending in PRs right
> now.
>
> Any further thoughts/decisions on moving Avro out of Beam core?
>
> On 2020/09/21 14:56:16, Cristian Constantinescu <ze...@gmail.com> wrote:
> > All the proposed solutions seem reasonable. I'm not sure if one has an
> edge
> > over the other. I guess it depends on how cautiously the community would
> > like to move.
> >
> > Maybe it's just my impression, but it seems to me that there are a few
> > changes that are held back for the sake of backwards compatibility. If
> this
> > is truly the case, maybe we can start thinking about the next major
> version
> > of Beam where:
> > - the dependency report email (Beam Dependency Check Report (2020-09-21))
> > can be tackled. There are quite a few deps in there that need to be
> > updated. I'm sure there will be more breaking changes.
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Piotr Szuberski <pi...@polidea.com>.
We are working on IOs dependencies, most of them are pending in PRs right now.

Any further thoughts/decisions on moving Avro out of Beam core?

On 2020/09/21 14:56:16, Cristian Constantinescu <ze...@gmail.com> wrote: 
> All the proposed solutions seem reasonable. I'm not sure if one has an edge
> over the other. I guess it depends on how cautiously the community would
> like to move.
> 
> Maybe it's just my impression, but it seems to me that there are a few
> changes that are held back for the sake of backwards compatibility. If this
> is truly the case, maybe we can start thinking about the next major version
> of Beam where:
> - the dependency report email (Beam Dependency Check Report (2020-09-21))
> can be tackled. There are quite a few deps in there that need to be
> updated. I'm sure there will be more breaking changes.

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Cristian Constantinescu <ze...@gmail.com>.
All the proposed solutions seem reasonable. I'm not sure if one has an edge
over the other. I guess it depends on how cautiously the community would
like to move.

Maybe it's just my impression, but it seems to me that there are a few
changes that are held back for the sake of backwards compatibility. If this
is truly the case, maybe we can start thinking about the next major version
of Beam where:
- the dependency report email (Beam Dependency Check Report (2020-09-21))
can be tackled. There are quite a few deps in there that need to be
updated. I'm sure there will be more breaking changes.
- review some architectural decisions that are difficult to correct without
breaking things (eg: Avro in Core)
- compatibility with Java 11+
- features that we can't implement with our current code base

I'm not sure what Beam's roadmap is, but maybe we could set up a branch in
the main repo (and the checks) and try to tackle all this work so we get a
better idea of the scope (and unforeseen issues that will come forward)
that's really needed for a potential RC build in the short term future.

On Wed, Sep 16, 2020 at 6:40 AM Robert Bradshaw <ro...@google.com> wrote:

> An adapter seems a reasonable approach, and shouldn't be too hard.
>
> If the breakage is "we no longer provide Avro 1.8 by default; please
> depend on it explicitly if this breaks you" that seems reasonable to me, as
> it's easy to detect and remedy.
>
> On Tue, Sep 15, 2020 at 2:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Avro differences in our implementation are pretty minimal if you look at
>> the PR,
>> to the point that an Adapter should be really tiny if even needed.
>>
>> The big backwards incompatible changes in Avro > 1.8 were to remove
>> external
>> libraries from the public APIs e.g. guava, jackson and joda-time. Of
>> course this
>> does not seem to be much but almost every project using Avro was using
>> some of
>> these dependencies, luckily for Beam it was only joda-time and that is
>> already
>> fixed.
>>
>> Keeping backwards compatibility by making Avro part of an extension that
>> is
>> optional for core and using only Avro 1.8 compatible features on Beam's
>> source
>> code is the simplest path, and allow us to avoid all the issues, notice
>> that the
>> dependency that triggered the need for Avro 1.9 (and this thread) is a
>> runtime
>> dependency used by Confluent Schema Registry and this is an issue because
>> sdks-java-core is leaking Avro. Apart from this I am not aware of any
>> feature in
>> any other project that obliges anyone to use Avro 1.9 or 1.10 specific
>> code.
>>
>> Of course a really valid reason to want to use a more recent version of
>> Avro is
>> that Avro 1.8 leaks also unmaintained dependencies with serious security
>> issues
>> (Jackson 1.x).
>>
>> So in the end my main concern is breaking existing users code, this has
>> less
>> impact for us (Talend) but probably more for the rest of the community,
>> but if
>> we agree to break backwards compatibility for the sake of cleanliness
>> well we
>> should probably proceed, and of course give users also some warning.
>>
>> On Mon, Sep 14, 2020 at 7:13 PM Luke Cwik <lc...@google.com> wrote:
>> >
>> > In the Kafka module we reflectively figure out which version of Kafka
>> exists[1] on the classpath and then reflectively invoke some APIs to work
>> around differences in Kafka allowing our users to bring whichever version
>> they want.
>> >
>> > We could do something similar here and reflectively figure out which
>> Avro is on the classpath and invoke the appropriate methods. If we are
>> worried about performance of using reflection, we can write and compile two
>> different versions of an Avro adapter class and choose which one to use
>> (using reflection only once during class loading).
>> >
>> > e.g.
>> > AvroAdapter {
>> >   static final AvroAdapter INSTANCE;
>> >   static {
>> >     if (avro19?) {
>> >       INSTANCE = new Avro19Adapater();
>> >     } else {
>> >       INSTANCE = new Avro18Adapter();
>> >   }
>> >
>> >   ... methods needed for AvroAdapter implementations ...
>> > }
>> >
>> > Using reflection allows the user to choose which version they use and
>> pushes down the incompatibility issue from Apache Beam to our deps (e.g.
>> Spark).
>> >
>> > 1:
>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java
>> >
>> > On Fri, Sep 11, 2020 at 11:42 AM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >>
>> >> I am not deep on the details myself but have reviewed various Avro
>> upgrade changes such as https://github.com/apache/beam/pull/9779 and
>> also some internal that I cannot link to. I believe the changes are small
>> and quite possibly we can create sdks/java/extensions/avro that works with
>> both Avro 1.8 and 1.9 and make Dataflow worker compatible with whatever the
>> user chooses. (I would expect Spark is trying to get to that point too?)
>> >>
>> >> So then if we have that can we achieve the goals? Spark runner users
>> that do not use Avro in their own code get Spark's version, Spark runner
>> users that do use Avro have to choose anyhow, and we make Dataflow worker
>> work with 1.8 and 1.9.
>> >>
>> >> We can probably achieve the same goals by just making the core
>> compatible with both 1.8 and 1.9. Users who don't want the dep can exclude
>> it, too. It doesn't have a bunch of transitive deps so there isn't a lot of
>> value in trying to exclude it though. So core vs extensions is more of a
>> clean engineering thing, but having compatibility with 1.9 is a user-driven
>> thing.
>> >>
>> >> Kenn
>> >>
>> >> On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ie...@gmail.com>
>> wrote:
>> >>>
>> >>> > The concern here is that Avro 1.9 is not backwards compatible with
>> Avro 1.8, so would the future world would not be a simple "bring your own
>> avro" but might require separate dataflow-with-avro-1.8 and
>> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
>> mistaken here? Maybe we could solve this with vending?)
>> >>>
>> >>> Thinking a bit about it looks similar to what I mentioned with Spark
>> >>> runner save that we cannot control those targets so for that reason I
>> >>> talked about source code compatibility.
>> >>> Avro is really hard to shade correctly because of the way the code
>> >>> generation works, otherwise it could have been the best solution.
>> >>>
>> >>> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>> >
>> >>> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >>> >>
>> >>> >> Top-post: I'm generally in favor of moving Avro out of core
>> specifically because it is something where different users (and dep chains)
>> want different versions. The pain caused by having it in core has come up a
>> lot to me. I don't think backwards-compatibility absolutism helps our users
>> in this case. I do think gradual migration to ease pain is important.
>> >>> >
>> >>> >
>> >>> > Agree. Backwards compatibility is not the absolute goal; whatever
>> is best for existing and new users is what we should go for. That being
>> said, this whole issue is caused by one of our dependencies not being
>> backwards compatible itself...
>> >>> >
>> >>> >>
>> >>> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >>> >>>
>> >>> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <
>> bhulette@google.com> wrote:
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >>> >>>>>
>> >>> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to
>> write
>> >>> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks).
>> Would
>> >>> >>>>> this break if we just removed it?
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> I think Dataflow would just need to declare a dependency on the
>> new extension.
>> >>> >>>
>> >>> >>>
>> >>> >>> I'm not sure this would solve the underlying problem (it just
>> pushes it onto users and makes it more obscure). Maybe my reasoning is
>> incorrect, but from what I see
>> >>> >>>
>> >>> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql,
>> kafka, parquet, ...) depend on Avro.
>> >>> >>> * Using Avro 1.9 with the above modules doesn't work.
>> >>> >>
>> >>> >>
>> >>> >> I suggest taking these on case-by-case.
>> >>> >>
>> >>> >>  - Dataflow: implementation detail, probably not a major problem
>> (we can just upgrade the pre-portability worker while for portability it is
>> a non-issue)
>> >>> >>  - Spark: probably need to use whatever version of Avro works for
>> each version of Spark (portability mitigates)
>> >>> >>  - SQL: happy to upgrade lib version, just needs to be able to
>> read the data, Avro version not user-facing
>> >>> >>  - IOs: I'm guessing that we have a diamond dep getting resolved
>> by clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
>> Avro serde is a separate thing distributed by Confluent with Avro version
>> obfuscated by use of parent poms and properties, but their examples use
>> Avro 1.9.1.
>> >>> >
>> >>> >
>> >>> > The concern here is that Avro 1.9 is not backwards compatible with
>> Avro 1.8, so would the future world would not be a simple "bring your own
>> avro" but might require separate dataflow-with-avro-1.8 and
>> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
>> mistaken here? Maybe we could solve this with vending?)
>> >>> >
>> >>> >>> Doesn't this mean that, even if we remove avro from Beam core, a
>> user that uses Beam + Avro 1.9 will have issues with any of the above
>> (fairly fundamental) modules?
>> >>> >>>
>> >>> >>>>  We could mitigate this by first adding the new extension module
>> and deprecating the core Beam counterpart for a release (or multiple
>> releases).
>> >>> >>>
>> >>> >>>
>> >>> >>> +1 to Reuven's concerns here.
>> >>> >>
>> >>> >>
>> >>> >> Agree we should add the module and release it for at least one
>> release, probably a few because users tend to hop a few releases. We have
>> some precedent for breaking changes with the Python/Flink version dropping
>> after asking users on user@ and polling on Twitter, etc.
>> >>> >>
>> >>> >> Kenn
>>
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Robert Bradshaw <ro...@google.com>.
An adapter seems a reasonable approach, and shouldn't be too hard.

If the breakage is "we no longer provide Avro 1.8 by default; please depend
on it explicitly if this breaks you" that seems reasonable to me, as it's
easy to detect and remedy.

On Tue, Sep 15, 2020 at 2:42 PM Ismaël Mejía <ie...@gmail.com> wrote:

> Avro differences in our implementation are pretty minimal if you look at
> the PR,
> to the point that an Adapter should be really tiny if even needed.
>
> The big backwards incompatible changes in Avro > 1.8 were to remove
> external
> libraries from the public APIs e.g. guava, jackson and joda-time. Of
> course this
> does not seem to be much but almost every project using Avro was using
> some of
> these dependencies, luckily for Beam it was only joda-time and that is
> already
> fixed.
>
> Keeping backwards compatibility by making Avro part of an extension that is
> optional for core and using only Avro 1.8 compatible features on Beam's
> source
> code is the simplest path, and allow us to avoid all the issues, notice
> that the
> dependency that triggered the need for Avro 1.9 (and this thread) is a
> runtime
> dependency used by Confluent Schema Registry and this is an issue because
> sdks-java-core is leaking Avro. Apart from this I am not aware of any
> feature in
> any other project that obliges anyone to use Avro 1.9 or 1.10 specific
> code.
>
> Of course a really valid reason to want to use a more recent version of
> Avro is
> that Avro 1.8 leaks also unmaintained dependencies with serious security
> issues
> (Jackson 1.x).
>
> So in the end my main concern is breaking existing users code, this has
> less
> impact for us (Talend) but probably more for the rest of the community,
> but if
> we agree to break backwards compatibility for the sake of cleanliness well
> we
> should probably proceed, and of course give users also some warning.
>
> On Mon, Sep 14, 2020 at 7:13 PM Luke Cwik <lc...@google.com> wrote:
> >
> > In the Kafka module we reflectively figure out which version of Kafka
> exists[1] on the classpath and then reflectively invoke some APIs to work
> around differences in Kafka allowing our users to bring whichever version
> they want.
> >
> > We could do something similar here and reflectively figure out which
> Avro is on the classpath and invoke the appropriate methods. If we are
> worried about performance of using reflection, we can write and compile two
> different versions of an Avro adapter class and choose which one to use
> (using reflection only once during class loading).
> >
> > e.g.
> > AvroAdapter {
> >   static final AvroAdapter INSTANCE;
> >   static {
> >     if (avro19?) {
> >       INSTANCE = new Avro19Adapater();
> >     } else {
> >       INSTANCE = new Avro18Adapter();
> >   }
> >
> >   ... methods needed for AvroAdapter implementations ...
> > }
> >
> > Using reflection allows the user to choose which version they use and
> pushes down the incompatibility issue from Apache Beam to our deps (e.g.
> Spark).
> >
> > 1:
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java
> >
> > On Fri, Sep 11, 2020 at 11:42 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>
> >> I am not deep on the details myself but have reviewed various Avro
> upgrade changes such as https://github.com/apache/beam/pull/9779 and also
> some internal that I cannot link to. I believe the changes are small and
> quite possibly we can create sdks/java/extensions/avro that works with both
> Avro 1.8 and 1.9 and make Dataflow worker compatible with whatever the user
> chooses. (I would expect Spark is trying to get to that point too?)
> >>
> >> So then if we have that can we achieve the goals? Spark runner users
> that do not use Avro in their own code get Spark's version, Spark runner
> users that do use Avro have to choose anyhow, and we make Dataflow worker
> work with 1.8 and 1.9.
> >>
> >> We can probably achieve the same goals by just making the core
> compatible with both 1.8 and 1.9. Users who don't want the dep can exclude
> it, too. It doesn't have a bunch of transitive deps so there isn't a lot of
> value in trying to exclude it though. So core vs extensions is more of a
> clean engineering thing, but having compatibility with 1.9 is a user-driven
> thing.
> >>
> >> Kenn
> >>
> >> On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ie...@gmail.com>
> wrote:
> >>>
> >>> > The concern here is that Avro 1.9 is not backwards compatible with
> Avro 1.8, so would the future world would not be a simple "bring your own
> avro" but might require separate dataflow-with-avro-1.8 and
> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
> mistaken here? Maybe we could solve this with vending?)
> >>>
> >>> Thinking a bit about it looks similar to what I mentioned with Spark
> >>> runner save that we cannot control those targets so for that reason I
> >>> talked about source code compatibility.
> >>> Avro is really hard to shade correctly because of the way the code
> >>> generation works, otherwise it could have been the best solution.
> >>>
> >>> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>> >
> >>> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>> >>
> >>> >> Top-post: I'm generally in favor of moving Avro out of core
> specifically because it is something where different users (and dep chains)
> want different versions. The pain caused by having it in core has come up a
> lot to me. I don't think backwards-compatibility absolutism helps our users
> in this case. I do think gradual migration to ease pain is important.
> >>> >
> >>> >
> >>> > Agree. Backwards compatibility is not the absolute goal; whatever is
> best for existing and new users is what we should go for. That being said,
> this whole issue is caused by one of our dependencies not being backwards
> compatible itself...
> >>> >
> >>> >>
> >>> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>> >>>
> >>> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com>
> wrote:
> >>> >>>>
> >>> >>>>
> >>> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>> >>>>>
> >>> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to
> write
> >>> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks).
> Would
> >>> >>>>> this break if we just removed it?
> >>> >>>>
> >>> >>>>
> >>> >>>> I think Dataflow would just need to declare a dependency on the
> new extension.
> >>> >>>
> >>> >>>
> >>> >>> I'm not sure this would solve the underlying problem (it just
> pushes it onto users and makes it more obscure). Maybe my reasoning is
> incorrect, but from what I see
> >>> >>>
> >>> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql,
> kafka, parquet, ...) depend on Avro.
> >>> >>> * Using Avro 1.9 with the above modules doesn't work.
> >>> >>
> >>> >>
> >>> >> I suggest taking these on case-by-case.
> >>> >>
> >>> >>  - Dataflow: implementation detail, probably not a major problem
> (we can just upgrade the pre-portability worker while for portability it is
> a non-issue)
> >>> >>  - Spark: probably need to use whatever version of Avro works for
> each version of Spark (portability mitigates)
> >>> >>  - SQL: happy to upgrade lib version, just needs to be able to read
> the data, Avro version not user-facing
> >>> >>  - IOs: I'm guessing that we have a diamond dep getting resolved by
> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
> Avro serde is a separate thing distributed by Confluent with Avro version
> obfuscated by use of parent poms and properties, but their examples use
> Avro 1.9.1.
> >>> >
> >>> >
> >>> > The concern here is that Avro 1.9 is not backwards compatible with
> Avro 1.8, so would the future world would not be a simple "bring your own
> avro" but might require separate dataflow-with-avro-1.8 and
> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
> mistaken here? Maybe we could solve this with vending?)
> >>> >
> >>> >>> Doesn't this mean that, even if we remove avro from Beam core, a
> user that uses Beam + Avro 1.9 will have issues with any of the above
> (fairly fundamental) modules?
> >>> >>>
> >>> >>>>  We could mitigate this by first adding the new extension module
> and deprecating the core Beam counterpart for a release (or multiple
> releases).
> >>> >>>
> >>> >>>
> >>> >>> +1 to Reuven's concerns here.
> >>> >>
> >>> >>
> >>> >> Agree we should add the module and release it for at least one
> release, probably a few because users tend to hop a few releases. We have
> some precedent for breaking changes with the Python/Flink version dropping
> after asking users on user@ and polling on Twitter, etc.
> >>> >>
> >>> >> Kenn
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Ismaël Mejía <ie...@gmail.com>.
Avro differences in our implementation are pretty minimal if you look at the PR,
to the point that an Adapter should be really tiny if even needed.

The big backwards incompatible changes in Avro > 1.8 were to remove external
libraries from the public APIs e.g. guava, jackson and joda-time. Of course this
does not seem to be much but almost every project using Avro was using some of
these dependencies, luckily for Beam it was only joda-time and that is already
fixed.

Keeping backwards compatibility by making Avro part of an extension that is
optional for core and using only Avro 1.8 compatible features on Beam's source
code is the simplest path, and allow us to avoid all the issues, notice that the
dependency that triggered the need for Avro 1.9 (and this thread) is a runtime
dependency used by Confluent Schema Registry and this is an issue because
sdks-java-core is leaking Avro. Apart from this I am not aware of any feature in
any other project that obliges anyone to use Avro 1.9 or 1.10 specific code.

Of course a really valid reason to want to use a more recent version of Avro is
that Avro 1.8 leaks also unmaintained dependencies with serious security issues
(Jackson 1.x).

So in the end my main concern is breaking existing users code, this has less
impact for us (Talend) but probably more for the rest of the community, but if
we agree to break backwards compatibility for the sake of cleanliness well we
should probably proceed, and of course give users also some warning.

On Mon, Sep 14, 2020 at 7:13 PM Luke Cwik <lc...@google.com> wrote:
>
> In the Kafka module we reflectively figure out which version of Kafka exists[1] on the classpath and then reflectively invoke some APIs to work around differences in Kafka allowing our users to bring whichever version they want.
>
> We could do something similar here and reflectively figure out which Avro is on the classpath and invoke the appropriate methods. If we are worried about performance of using reflection, we can write and compile two different versions of an Avro adapter class and choose which one to use (using reflection only once during class loading).
>
> e.g.
> AvroAdapter {
>   static final AvroAdapter INSTANCE;
>   static {
>     if (avro19?) {
>       INSTANCE = new Avro19Adapater();
>     } else {
>       INSTANCE = new Avro18Adapter();
>   }
>
>   ... methods needed for AvroAdapter implementations ...
> }
>
> Using reflection allows the user to choose which version they use and pushes down the incompatibility issue from Apache Beam to our deps (e.g. Spark).
>
> 1: https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java
>
> On Fri, Sep 11, 2020 at 11:42 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>> I am not deep on the details myself but have reviewed various Avro upgrade changes such as https://github.com/apache/beam/pull/9779 and also some internal that I cannot link to. I believe the changes are small and quite possibly we can create sdks/java/extensions/avro that works with both Avro 1.8 and 1.9 and make Dataflow worker compatible with whatever the user chooses. (I would expect Spark is trying to get to that point too?)
>>
>> So then if we have that can we achieve the goals? Spark runner users that do not use Avro in their own code get Spark's version, Spark runner users that do use Avro have to choose anyhow, and we make Dataflow worker work with 1.8 and 1.9.
>>
>> We can probably achieve the same goals by just making the core compatible with both 1.8 and 1.9. Users who don't want the dep can exclude it, too. It doesn't have a bunch of transitive deps so there isn't a lot of value in trying to exclude it though. So core vs extensions is more of a clean engineering thing, but having compatibility with 1.9 is a user-driven thing.
>>
>> Kenn
>>
>> On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>> > The concern here is that Avro 1.9 is not backwards compatible with Avro 1.8, so would the future world would not be a simple "bring your own avro" but might require separate dataflow-with-avro-1.8 and dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I mistaken here? Maybe we could solve this with vending?)
>>>
>>> Thinking a bit about it looks similar to what I mentioned with Spark
>>> runner save that we cannot control those targets so for that reason I
>>> talked about source code compatibility.
>>> Avro is really hard to shade correctly because of the way the code
>>> generation works, otherwise it could have been the best solution.
>>>
>>> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <ro...@google.com> wrote:
>>> >
>>> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org> wrote:
>>> >>
>>> >> Top-post: I'm generally in favor of moving Avro out of core specifically because it is something where different users (and dep chains) want different versions. The pain caused by having it in core has come up a lot to me. I don't think backwards-compatibility absolutism helps our users in this case. I do think gradual migration to ease pain is important.
>>> >
>>> >
>>> > Agree. Backwards compatibility is not the absolute goal; whatever is best for existing and new users is what we should go for. That being said, this whole issue is caused by one of our dependencies not being backwards compatible itself...
>>> >
>>> >>
>>> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com> wrote:
>>> >>>
>>> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com> wrote:
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com> wrote:
>>> >>>>>
>>> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>>> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>>> >>>>> this break if we just removed it?
>>> >>>>
>>> >>>>
>>> >>>> I think Dataflow would just need to declare a dependency on the new extension.
>>> >>>
>>> >>>
>>> >>> I'm not sure this would solve the underlying problem (it just pushes it onto users and makes it more obscure). Maybe my reasoning is incorrect, but from what I see
>>> >>>
>>> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka, parquet, ...) depend on Avro.
>>> >>> * Using Avro 1.9 with the above modules doesn't work.
>>> >>
>>> >>
>>> >> I suggest taking these on case-by-case.
>>> >>
>>> >>  - Dataflow: implementation detail, probably not a major problem (we can just upgrade the pre-portability worker while for portability it is a non-issue)
>>> >>  - Spark: probably need to use whatever version of Avro works for each version of Spark (portability mitigates)
>>> >>  - SQL: happy to upgrade lib version, just needs to be able to read the data, Avro version not user-facing
>>> >>  - IOs: I'm guessing that we have a diamond dep getting resolved by clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's Avro serde is a separate thing distributed by Confluent with Avro version obfuscated by use of parent poms and properties, but their examples use Avro 1.9.1.
>>> >
>>> >
>>> > The concern here is that Avro 1.9 is not backwards compatible with Avro 1.8, so would the future world would not be a simple "bring your own avro" but might require separate dataflow-with-avro-1.8 and dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I mistaken here? Maybe we could solve this with vending?)
>>> >
>>> >>> Doesn't this mean that, even if we remove avro from Beam core, a user that uses Beam + Avro 1.9 will have issues with any of the above (fairly fundamental) modules?
>>> >>>
>>> >>>>  We could mitigate this by first adding the new extension module and deprecating the core Beam counterpart for a release (or multiple releases).
>>> >>>
>>> >>>
>>> >>> +1 to Reuven's concerns here.
>>> >>
>>> >>
>>> >> Agree we should add the module and release it for at least one release, probably a few because users tend to hop a few releases. We have some precedent for breaking changes with the Python/Flink version dropping after asking users on user@ and polling on Twitter, etc.
>>> >>
>>> >> Kenn

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Luke Cwik <lc...@google.com>.
In the Kafka module we reflectively figure out which version of Kafka
exists[1] on the classpath and then reflectively invoke some APIs to work
around differences in Kafka allowing our users to bring whichever version
they want.

We could do something similar here and reflectively figure out which Avro
is on the classpath and invoke the appropriate methods. If we are worried
about performance of using reflection, we can write and compile two
different versions of an Avro adapter class and choose which one to use
(using reflection only once during class loading).

e.g.
AvroAdapter {
  static final AvroAdapter INSTANCE;
  static {
    if (avro19?) {
      INSTANCE = new Avro19Adapater();
    } else {
      INSTANCE = new Avro18Adapter();
  }

  ... methods needed for AvroAdapter implementations ...
}

Using reflection allows the user to choose which version they use and
pushes down the incompatibility issue from Apache Beam to our deps (e.g.
Spark).

1:
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java

On Fri, Sep 11, 2020 at 11:42 AM Kenneth Knowles <ke...@apache.org> wrote:

> I am not deep on the details myself but have reviewed various Avro upgrade
> changes such as https://github.com/apache/beam/pull/9779 and also some
> internal that I cannot link to. I believe the changes are small and quite
> possibly we can create sdks/java/extensions/avro that works with both Avro
> 1.8 and 1.9 and make Dataflow worker compatible with whatever the user
> chooses. (I would expect Spark is trying to get to that point too?)
>
> So then if we have that can we achieve the goals? Spark runner users that
> do not use Avro in their own code get Spark's version, Spark runner users
> that do use Avro have to choose anyhow, and we make Dataflow worker work
> with 1.8 and 1.9.
>
> We can probably achieve the same goals by just making the core compatible
> with both 1.8 and 1.9. Users who don't want the dep can exclude it, too. It
> doesn't have a bunch of transitive deps so there isn't a lot of value in
> trying to exclude it though. So core vs extensions is more of a clean
> engineering thing, but having compatibility with 1.9 is a user-driven thing.
>
> Kenn
>
> On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> > The concern here is that Avro 1.9 is not backwards compatible with Avro
>> 1.8, so would the future world would not be a simple "bring your own avro"
>> but might require separate dataflow-with-avro-1.8 and
>> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
>> mistaken here? Maybe we could solve this with vending?)
>>
>> Thinking a bit about it looks similar to what I mentioned with Spark
>> runner save that we cannot control those targets so for that reason I
>> talked about source code compatibility.
>> Avro is really hard to shade correctly because of the way the code
>> generation works, otherwise it could have been the best solution.
>>
>> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >
>> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >>
>> >> Top-post: I'm generally in favor of moving Avro out of core
>> specifically because it is something where different users (and dep chains)
>> want different versions. The pain caused by having it in core has come up a
>> lot to me. I don't think backwards-compatibility absolutism helps our users
>> in this case. I do think gradual migration to ease pain is important.
>> >
>> >
>> > Agree. Backwards compatibility is not the absolute goal; whatever is
>> best for existing and new users is what we should go for. That being said,
>> this whole issue is caused by one of our dependencies not being backwards
>> compatible itself...
>> >
>> >>
>> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>>
>> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >>>>
>> >>>>
>> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>>>>
>> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to
>> write
>> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>> >>>>> this break if we just removed it?
>> >>>>
>> >>>>
>> >>>> I think Dataflow would just need to declare a dependency on the new
>> extension.
>> >>>
>> >>>
>> >>> I'm not sure this would solve the underlying problem (it just pushes
>> it onto users and makes it more obscure). Maybe my reasoning is incorrect,
>> but from what I see
>> >>>
>> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka,
>> parquet, ...) depend on Avro.
>> >>> * Using Avro 1.9 with the above modules doesn't work.
>> >>
>> >>
>> >> I suggest taking these on case-by-case.
>> >>
>> >>  - Dataflow: implementation detail, probably not a major problem (we
>> can just upgrade the pre-portability worker while for portability it is a
>> non-issue)
>> >>  - Spark: probably need to use whatever version of Avro works for each
>> version of Spark (portability mitigates)
>> >>  - SQL: happy to upgrade lib version, just needs to be able to read
>> the data, Avro version not user-facing
>> >>  - IOs: I'm guessing that we have a diamond dep getting resolved by
>> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
>> Avro serde is a separate thing distributed by Confluent with Avro version
>> obfuscated by use of parent poms and properties, but their examples use
>> Avro 1.9.1.
>> >
>> >
>> > The concern here is that Avro 1.9 is not backwards compatible with Avro
>> 1.8, so would the future world would not be a simple "bring your own avro"
>> but might require separate dataflow-with-avro-1.8 and
>> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
>> mistaken here? Maybe we could solve this with vending?)
>> >
>> >>> Doesn't this mean that, even if we remove avro from Beam core, a user
>> that uses Beam + Avro 1.9 will have issues with any of the above (fairly
>> fundamental) modules?
>> >>>
>> >>>>  We could mitigate this by first adding the new extension module and
>> deprecating the core Beam counterpart for a release (or multiple releases).
>> >>>
>> >>>
>> >>> +1 to Reuven's concerns here.
>> >>
>> >>
>> >> Agree we should add the module and release it for at least one
>> release, probably a few because users tend to hop a few releases. We have
>> some precedent for breaking changes with the Python/Flink version dropping
>> after asking users on user@ and polling on Twitter, etc.
>> >>
>> >> Kenn
>>
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Kenneth Knowles <ke...@apache.org>.
I am not deep on the details myself but have reviewed various Avro upgrade
changes such as https://github.com/apache/beam/pull/9779 and also some
internal that I cannot link to. I believe the changes are small and quite
possibly we can create sdks/java/extensions/avro that works with both Avro
1.8 and 1.9 and make Dataflow worker compatible with whatever the user
chooses. (I would expect Spark is trying to get to that point too?)

So then if we have that can we achieve the goals? Spark runner users that
do not use Avro in their own code get Spark's version, Spark runner users
that do use Avro have to choose anyhow, and we make Dataflow worker work
with 1.8 and 1.9.

We can probably achieve the same goals by just making the core compatible
with both 1.8 and 1.9. Users who don't want the dep can exclude it, too. It
doesn't have a bunch of transitive deps so there isn't a lot of value in
trying to exclude it though. So core vs extensions is more of a clean
engineering thing, but having compatibility with 1.9 is a user-driven thing.

Kenn

On Fri, Sep 11, 2020 at 10:49 AM Ismaël Mejía <ie...@gmail.com> wrote:

> > The concern here is that Avro 1.9 is not backwards compatible with Avro
> 1.8, so would the future world would not be a simple "bring your own avro"
> but might require separate dataflow-with-avro-1.8 and
> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
> mistaken here? Maybe we could solve this with vending?)
>
> Thinking a bit about it looks similar to what I mentioned with Spark
> runner save that we cannot control those targets so for that reason I
> talked about source code compatibility.
> Avro is really hard to shade correctly because of the way the code
> generation works, otherwise it could have been the best solution.
>
> On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >
> > On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>
> >> Top-post: I'm generally in favor of moving Avro out of core
> specifically because it is something where different users (and dep chains)
> want different versions. The pain caused by having it in core has come up a
> lot to me. I don't think backwards-compatibility absolutism helps our users
> in this case. I do think gradual migration to ease pain is important.
> >
> >
> > Agree. Backwards compatibility is not the absolute goal; whatever is
> best for existing and new users is what we should go for. That being said,
> this whole issue is caused by one of our dependencies not being backwards
> compatible itself...
> >
> >>
> >> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>
> >>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com>
> wrote:
> >>>>
> >>>>
> >>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>>>
> >>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
> >>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
> >>>>> this break if we just removed it?
> >>>>
> >>>>
> >>>> I think Dataflow would just need to declare a dependency on the new
> extension.
> >>>
> >>>
> >>> I'm not sure this would solve the underlying problem (it just pushes
> it onto users and makes it more obscure). Maybe my reasoning is incorrect,
> but from what I see
> >>>
> >>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka,
> parquet, ...) depend on Avro.
> >>> * Using Avro 1.9 with the above modules doesn't work.
> >>
> >>
> >> I suggest taking these on case-by-case.
> >>
> >>  - Dataflow: implementation detail, probably not a major problem (we
> can just upgrade the pre-portability worker while for portability it is a
> non-issue)
> >>  - Spark: probably need to use whatever version of Avro works for each
> version of Spark (portability mitigates)
> >>  - SQL: happy to upgrade lib version, just needs to be able to read the
> data, Avro version not user-facing
> >>  - IOs: I'm guessing that we have a diamond dep getting resolved by
> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
> Avro serde is a separate thing distributed by Confluent with Avro version
> obfuscated by use of parent poms and properties, but their examples use
> Avro 1.9.1.
> >
> >
> > The concern here is that Avro 1.9 is not backwards compatible with Avro
> 1.8, so would the future world would not be a simple "bring your own avro"
> but might require separate dataflow-with-avro-1.8 and
> dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
> mistaken here? Maybe we could solve this with vending?)
> >
> >>> Doesn't this mean that, even if we remove avro from Beam core, a user
> that uses Beam + Avro 1.9 will have issues with any of the above (fairly
> fundamental) modules?
> >>>
> >>>>  We could mitigate this by first adding the new extension module and
> deprecating the core Beam counterpart for a release (or multiple releases).
> >>>
> >>>
> >>> +1 to Reuven's concerns here.
> >>
> >>
> >> Agree we should add the module and release it for at least one release,
> probably a few because users tend to hop a few releases. We have some
> precedent for breaking changes with the Python/Flink version dropping after
> asking users on user@ and polling on Twitter, etc.
> >>
> >> Kenn
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Ismaël Mejía <ie...@gmail.com>.
> The concern here is that Avro 1.9 is not backwards compatible with Avro 1.8, so would the future world would not be a simple "bring your own avro" but might require separate dataflow-with-avro-1.8 and dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I mistaken here? Maybe we could solve this with vending?)

Thinking a bit about it looks similar to what I mentioned with Spark
runner save that we cannot control those targets so for that reason I
talked about source code compatibility.
Avro is really hard to shade correctly because of the way the code
generation works, otherwise it could have been the best solution.

On Fri, Sep 11, 2020 at 7:28 PM Robert Bradshaw <ro...@google.com> wrote:
>
> On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>> Top-post: I'm generally in favor of moving Avro out of core specifically because it is something where different users (and dep chains) want different versions. The pain caused by having it in core has come up a lot to me. I don't think backwards-compatibility absolutism helps our users in this case. I do think gradual migration to ease pain is important.
>
>
> Agree. Backwards compatibility is not the absolute goal; whatever is best for existing and new users is what we should go for. That being said, this whole issue is caused by one of our dependencies not being backwards compatible itself...
>
>>
>> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com> wrote:
>>>
>>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com> wrote:
>>>>
>>>>
>>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com> wrote:
>>>>>
>>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>>>>> this break if we just removed it?
>>>>
>>>>
>>>> I think Dataflow would just need to declare a dependency on the new extension.
>>>
>>>
>>> I'm not sure this would solve the underlying problem (it just pushes it onto users and makes it more obscure). Maybe my reasoning is incorrect, but from what I see
>>>
>>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka, parquet, ...) depend on Avro.
>>> * Using Avro 1.9 with the above modules doesn't work.
>>
>>
>> I suggest taking these on case-by-case.
>>
>>  - Dataflow: implementation detail, probably not a major problem (we can just upgrade the pre-portability worker while for portability it is a non-issue)
>>  - Spark: probably need to use whatever version of Avro works for each version of Spark (portability mitigates)
>>  - SQL: happy to upgrade lib version, just needs to be able to read the data, Avro version not user-facing
>>  - IOs: I'm guessing that we have a diamond dep getting resolved by clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's Avro serde is a separate thing distributed by Confluent with Avro version obfuscated by use of parent poms and properties, but their examples use Avro 1.9.1.
>
>
> The concern here is that Avro 1.9 is not backwards compatible with Avro 1.8, so would the future world would not be a simple "bring your own avro" but might require separate dataflow-with-avro-1.8 and dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I mistaken here? Maybe we could solve this with vending?)
>
>>> Doesn't this mean that, even if we remove avro from Beam core, a user that uses Beam + Avro 1.9 will have issues with any of the above (fairly fundamental) modules?
>>>
>>>>  We could mitigate this by first adding the new extension module and deprecating the core Beam counterpart for a release (or multiple releases).
>>>
>>>
>>> +1 to Reuven's concerns here.
>>
>>
>> Agree we should add the module and release it for at least one release, probably a few because users tend to hop a few releases. We have some precedent for breaking changes with the Python/Flink version dropping after asking users on user@ and polling on Twitter, etc.
>>
>> Kenn

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Robert Bradshaw <ro...@google.com>.
On Fri, Sep 11, 2020 at 10:05 AM Kenneth Knowles <ke...@apache.org> wrote:

> Top-post: I'm generally in favor of moving Avro out of core specifically
> because it is something where different users (and dep chains) want
> different versions. The pain caused by having it in core has come up a lot
> to me. I don't think backwards-compatibility absolutism helps our users in
> this case. I do think gradual migration to ease pain is important.
>

Agree. Backwards compatibility is not the absolute goal; whatever is best
for existing and new users is what we should go for. That being said, this
whole issue is caused by one of our dependencies not being backwards
compatible itself...


> On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>>
>>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>>>> this break if we just removed it?
>>>>
>>>
>>> I think Dataflow would just need to declare a dependency on the new
>>> extension.
>>>
>>
>> I'm not sure this would solve the underlying problem (it just pushes it
>> onto users and makes it more obscure). Maybe my reasoning is incorrect, but
>> from what I see
>>
>> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka,
>> parquet, ...) depend on Avro.
>> * Using Avro 1.9 with the above modules doesn't work.
>>
>
> I suggest taking these on case-by-case.
>
>  - Dataflow: implementation detail, probably not a major problem (we can
> just upgrade the pre-portability worker while for portability it is a
> non-issue)
>  - Spark: probably need to use whatever version of Avro works for each
> version of Spark (portability mitigates)
>  - SQL: happy to upgrade lib version, just needs to be able to read the
> data, Avro version not user-facing
>  - IOs: I'm guessing that we have a diamond dep getting resolved by
> clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
> Avro serde is a separate thing distributed by Confluent with Avro version
> obfuscated by use of parent poms and properties, but their examples use
> Avro 1.9.1.
>

The concern here is that Avro 1.9 is not backwards compatible with Avro
1.8, so would the future world would not be a simple "bring your own avro"
but might require separate dataflow-with-avro-1.8 and
dataflow-with-avro-1.9 targets which certainly isn't scalable. (Or am I
mistaken here? Maybe we could solve this with vending?)

Doesn't this mean that, even if we remove avro from Beam core, a user that
>> uses Beam + Avro 1.9 will have issues with any of the above (fairly
>> fundamental) modules?
>>
>>  We could mitigate this by first adding the new extension module and
>>> deprecating the core Beam counterpart for a release (or multiple releases).
>>>
>>
>> +1 to Reuven's concerns here.
>>
>
> Agree we should add the module and release it for at least one release,
> probably a few because users tend to hop a few releases. We have some
> precedent for breaking changes with the Python/Flink version dropping after
> asking users on user@ and polling on Twitter, etc.
>
> Kenn
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Kenneth Knowles <ke...@apache.org>.
Top-post: I'm generally in favor of moving Avro out of core specifically
because it is something where different users (and dep chains) want
different versions. The pain caused by having it in core has come up a lot
to me. I don't think backwards-compatibility absolutism helps our users in
this case. I do think gradual migration to ease pain is important.

On Fri, Sep 11, 2020 at 9:30 AM Robert Bradshaw <ro...@google.com> wrote:

> On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com> wrote:
>
>>
>> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>>> this break if we just removed it?
>>>
>>
>> I think Dataflow would just need to declare a dependency on the new
>> extension.
>>
>
> I'm not sure this would solve the underlying problem (it just pushes it
> onto users and makes it more obscure). Maybe my reasoning is incorrect, but
> from what I see
>
> * Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka,
> parquet, ...) depend on Avro.
> * Using Avro 1.9 with the above modules doesn't work.
>

I suggest taking these on case-by-case.

 - Dataflow: implementation detail, probably not a major problem (we can
just upgrade the pre-portability worker while for portability it is a
non-issue)
 - Spark: probably need to use whatever version of Avro works for each
version of Spark (portability mitigates)
 - SQL: happy to upgrade lib version, just needs to be able to read the
data, Avro version not user-facing
 - IOs: I'm guessing that we have a diamond dep getting resolved by
clobbering. A quick glance seems like Parquet is on avro 1.10.0, Kafka's
Avro serde is a separate thing distributed by Confluent with Avro version
obfuscated by use of parent poms and properties, but their examples use
Avro 1.9.1.

Doesn't this mean that, even if we remove avro from Beam core, a user that
> uses Beam + Avro 1.9 will have issues with any of the above (fairly
> fundamental) modules?
>
>  We could mitigate this by first adding the new extension module and
>> deprecating the core Beam counterpart for a release (or multiple releases).
>>
>
> +1 to Reuven's concerns here.
>

Agree we should add the module and release it for at least one release,
probably a few because users tend to hop a few releases. We have some
precedent for breaking changes with the Python/Flink version dropping after
asking users on user@ and polling on Twitter, etc.

Kenn

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Robert Bradshaw <ro...@google.com>.
On Thu, Sep 10, 2020 at 2:48 PM Brian Hulette <bh...@google.com> wrote:

>
> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>> this break if we just removed it?
>>
>
> I think Dataflow would just need to declare a dependency on the new
> extension.
>

I'm not sure this would solve the underlying problem (it just pushes it
onto users and makes it more obscure). Maybe my reasoning is incorrect, but
from what I see

* Many Beam modules (e.g. dataflow, spark, file-based-io, sql, kafka,
parquet, ...) depend on Avro.
* Using Avro 1.9 with the above modules doesn't work.

Doesn't this mean that, even if we remove avro from Beam core, a user that
uses Beam + Avro 1.9 will have issues with any of the above (fairly
fundamental) modules?

 We could mitigate this by first adding the new extension module and
> deprecating the core Beam counterpart for a release (or multiple releases).
>

+1 to Reuven's concerns here.

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Cristian Constantinescu <ze...@gmail.com>.
Hi everyone,

PR https://github.com/apache/beam/pull/12748 now passes all the checks, and
could potentially be merged (not advocating this, just saying). I've
rebased on the latest master as of today. I've also left a comment in the
PR with the high level changes for ALL the modules. I encourage all the
interested parties to skim through that and raise any concerns they might
have.

Also note that while I'm pretty good at refactoring things, Java isn't my
strong language. Please keep that in mind as you review the code changes.

That being said, my main goal is to get Beam to play nice with the new
Confluent Schema libraries that include support for Protobuf and JSON
schemas. But the Confluent libs depend on avro 1.9, and Beam is on 1.8.
Upgrading Beam to use avro 1.9 has proven difficult (see
https://github.com/apache/beam/pull/9779) hence why Avro should be taken
out of core.

If you have any concerns or any particular tests I should run, please let
me know.

Thank you!

On Fri, Sep 11, 2020 at 5:48 AM Brian Hulette <bh...@google.com> wrote:

>
>
> On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
>> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
>> this break if we just removed it?
>>
>
> I think Dataflow would just need to declare a dependency on the new
> extension.
>
>
>>
>> On Thu, Sep 3, 2020 at 10:51 PM Reuven Lax <re...@google.com> wrote:
>> >
>> > As for 2, maybe it's time to remove @Experimental from SchemaCoder?
>> >
>>
>
> Probably worth a separate thread about dropping `@Experimental` on
> SchemaCoder. I'd be ok with that, the only breaking change I have in mind
> is that I think we should deprecate and remove the DATETIME primitive type,
> replacing it with a logical type.
>
>
>> > 1 is tricky though. Changes like this have caused a lot of trouble for
>> users in the past, and I think some users still have unpleasant memories of
>> being told "you just have to change some package names and imports."
>> >
>>
>
> We could mitigate this by first adding the new extension module and
> deprecating the core Beam counterpart for a release (or multiple releases).
>
>
>> > On Thu, Sep 3, 2020 at 6:18 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >>
>> >> Hi everyone,
>> >> The fact that core Beam has a dependency on Avro has led to a lot of
>> headaches when users (or runners) are using a different version. zeidoo [1]
>> was generous enough to put up a WIP PR [2] that moves everything that
>> depends on Avro (primarily AvroCoder and the Avro SchemaProvider I believe)
>> out of core Beam and into a separate extensions module. This way we could
>> have multiple extensions for different versions of Avro in the future.
>> >>
>> >> As I understand it, the downsides to making this change are:
>> >> 1) It's a breaking change, users with AvroCoder in their pipeline will
>> need to change their build dependencies and import statements.
>> >> 2) AvroCoder is the only (non-experimental) coder in core Beam that
>> can encode complex user types. So new users will need to dabble with the
>> Experimental SchemaCoder or add a second dependency to build a pipeline
>> with their own types.
>> >>
>> >> I think these costs are outweighed by the benefit of removing the
>> dependency in core Beam, but I wanted to reach out to the community to see
>> if there are any objections.
>> >>
>> >> Brian
>> >>
>> >> [1] github.com/zeidoo
>> >> [2] https://github.com/apache/beam/pull/12748
>>
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Brian Hulette <bh...@google.com>.
On Tue, Sep 8, 2020 at 9:18 AM Robert Bradshaw <ro...@google.com> wrote:

> IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
> out intermediate files (e.g. for non-shuffle Fusion breaks). Would
> this break if we just removed it?
>

I think Dataflow would just need to declare a dependency on the new
extension.


>
> On Thu, Sep 3, 2020 at 10:51 PM Reuven Lax <re...@google.com> wrote:
> >
> > As for 2, maybe it's time to remove @Experimental from SchemaCoder?
> >
>

Probably worth a separate thread about dropping `@Experimental` on
SchemaCoder. I'd be ok with that, the only breaking change I have in mind
is that I think we should deprecate and remove the DATETIME primitive type,
replacing it with a logical type.


> > 1 is tricky though. Changes like this have caused a lot of trouble for
> users in the past, and I think some users still have unpleasant memories of
> being told "you just have to change some package names and imports."
> >
>

We could mitigate this by first adding the new extension module and
deprecating the core Beam counterpart for a release (or multiple releases).


> > On Thu, Sep 3, 2020 at 6:18 PM Brian Hulette <bh...@google.com>
> wrote:
> >>
> >> Hi everyone,
> >> The fact that core Beam has a dependency on Avro has led to a lot of
> headaches when users (or runners) are using a different version. zeidoo [1]
> was generous enough to put up a WIP PR [2] that moves everything that
> depends on Avro (primarily AvroCoder and the Avro SchemaProvider I believe)
> out of core Beam and into a separate extensions module. This way we could
> have multiple extensions for different versions of Avro in the future.
> >>
> >> As I understand it, the downsides to making this change are:
> >> 1) It's a breaking change, users with AvroCoder in their pipeline will
> need to change their build dependencies and import statements.
> >> 2) AvroCoder is the only (non-experimental) coder in core Beam that can
> encode complex user types. So new users will need to dabble with the
> Experimental SchemaCoder or add a second dependency to build a pipeline
> with their own types.
> >>
> >> I think these costs are outweighed by the benefit of removing the
> dependency in core Beam, but I wanted to reach out to the community to see
> if there are any objections.
> >>
> >> Brian
> >>
> >> [1] github.com/zeidoo
> >> [2] https://github.com/apache/beam/pull/12748
>

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Robert Bradshaw <ro...@google.com>.
IIRC Dataflow (and perhaps others) implicitly depend on Avro to write
out intermediate files (e.g. for non-shuffle Fusion breaks). Would
this break if we just removed it?

On Thu, Sep 3, 2020 at 10:51 PM Reuven Lax <re...@google.com> wrote:
>
> As for 2, maybe it's time to remove @Experimental from SchemaCoder?
>
> 1 is tricky though. Changes like this have caused a lot of trouble for users in the past, and I think some users still have unpleasant memories of being told "you just have to change some package names and imports."
>
> On Thu, Sep 3, 2020 at 6:18 PM Brian Hulette <bh...@google.com> wrote:
>>
>> Hi everyone,
>> The fact that core Beam has a dependency on Avro has led to a lot of headaches when users (or runners) are using a different version. zeidoo [1] was generous enough to put up a WIP PR [2] that moves everything that depends on Avro (primarily AvroCoder and the Avro SchemaProvider I believe) out of core Beam and into a separate extensions module. This way we could have multiple extensions for different versions of Avro in the future.
>>
>> As I understand it, the downsides to making this change are:
>> 1) It's a breaking change, users with AvroCoder in their pipeline will need to change their build dependencies and import statements.
>> 2) AvroCoder is the only (non-experimental) coder in core Beam that can encode complex user types. So new users will need to dabble with the Experimental SchemaCoder or add a second dependency to build a pipeline with their own types.
>>
>> I think these costs are outweighed by the benefit of removing the dependency in core Beam, but I wanted to reach out to the community to see if there are any objections.
>>
>> Brian
>>
>> [1] github.com/zeidoo
>> [2] https://github.com/apache/beam/pull/12748

Re: [DISCUSS] Move Avro dependency out of core Beam

Posted by Reuven Lax <re...@google.com>.
As for 2, maybe it's time to remove @Experimental from SchemaCoder?

1 is tricky though. Changes like this have caused a lot of trouble for
users in the past, and I think some users still have unpleasant memories of
being told "you just have to change some package names and imports."

On Thu, Sep 3, 2020 at 6:18 PM Brian Hulette <bh...@google.com> wrote:

> Hi everyone,
> The fact that core Beam has a dependency on Avro has led to a lot of
> headaches when users (or runners) are using a different version. zeidoo [1]
> was generous enough to put up a WIP PR [2] that moves everything that
> depends on Avro (primarily AvroCoder and the Avro SchemaProvider I believe)
> out of core Beam and into a separate extensions module. This way we could
> have multiple extensions for different versions of Avro in the future.
>
> As I understand it, the downsides to making this change are:
> 1) It's a breaking change, users with AvroCoder in their pipeline will
> need to change their build dependencies and import statements.
> 2) AvroCoder is the only (non-experimental) coder in core Beam that can
> encode complex user types. So new users will need to dabble with the
> Experimental SchemaCoder or add a second dependency to build a pipeline
> with their own types.
>
> I think these costs are outweighed by the benefit of removing the
> dependency in core Beam, but I wanted to reach out to the community to see
> if there are any objections.
>
> Brian
>
> [1] github.com/zeidoo
> [2] https://github.com/apache/beam/pull/12748
>