You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Robbe Sneyders <ro...@ml6.eu> on 2019/04/02 12:12:25 UTC

Re: Deprecating Avro for fastavro on Python 3

Hi all,

Thank you for the feedback. Looking at the responses, it seems like there
is a consensus to move forward with fastavro as the default implementation
on Python 3.

There are 2 questions left however:
- Should fastavro also become the default implementation on Python 2?
This is a trade-off between having a consistent API across Python versions,
or keeping the current behavior on Python 2.

- Should we keep the avro-python3 dependency?
With the proposed solution, we could remove the avro-python3 dependency,
but it might have to be re-added if we want to support Avro again on Python
3 in a future version.

Kind regards,
Robbe

[image: https://ml6.eu] <https://ml6.eu/>

* Robbe Sneyders*

ML6 Gent
<https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>

M: +32 474 71 31 08


On Thu, 28 Mar 2019 at 18:28, Ahmet Altay <al...@google.com> wrote:

> Hi Ismaël,
>
> It is great to hear that Avro is planning to make a release soon.
>
> To answer your concerns, fastavro has a set of tests using regular avro
> files[1] and it also has a large set of users (with 675470 package
> downloads). This is in addition to it being a py2 & py3 compatible package
> and offering ~7x performance improvements [2]. Another data point, we were
> testing fastavro for a while behind an experimental flag and have not seen
> issues related compatibility.
>
> pyavro-rs sounds promising however I could not find a released version of
> it on pypi. The source code does not look like being maintained either with
> last commit on Jul 2, 2018. (for comparison last change on fastavro was on
> Mar 19, 2019).
>
> I think given the state of things, it makes sense to switch to fastavro as
> the default implementation to unblock python 3 changes. When avro offers a
> similar level of performance we could switch back without any visible user
> impact.
>
> Ahmet
>
> [1] https://github.com/fastavro/fastavro/tree/master/tests
> [2] https://pypi.org/project/fastavro/
>
> On Thu, Mar 28, 2019 at 7:53 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Hello,
>>
>> The problem of switching implementations is the risk of losing
>> interoperability, and this is more important than performance. Does
>> fastavro have tests that guarantee that it is fully compatible with
>> Avro’s Java version? (given that it is the de-facto implementation
>> used everywhere).
>>
>> If performance is a more important criteria maybe it is worth to check
>> at pyavro-rs [1], you can take a look at its performance in the great
>> talk of last year [2].
>>
>> I have been involved actively in the Avro community in the last months
>> and I am now a committer there. Also Dan Kulp who has done multiple
>> contributions in Beam is now a PMC member too. We are at this point
>> working hard to get the next release of Avro out, actually the branch
>> cut of Avro 1.9.0 is happening this week, and we plan to improve the
>> release cadence. Please understand that the issue with Avro is that it
>> is a really specific and ‘old‘ project (~10 years) so part of the
>> active moved to other areas because it is stable, but we are still
>> there working on it and we are eager to improve it for everyone’s
>> needs (and of course Beam needs).
>>
>> I know that Python 3’s Avro implementation is still lacking and could
>> be improved (views expressed here are clearly valid), but maybe this
>> is a chance to contribute there too. Remember Apache projects are a
>> family and we have a history of cross colaboration with other
>> communities e.g. Flink, Calcite so why not give it a chance to Avro
>> too.
>>
>> Regards,
>> Ismaël
>>
>> [1] https://github.com/flavray/pyavro-rs
>> [2]
>> https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf
>>
>> On Wed, Mar 27, 2019 at 11:42 PM Chamikara Jayalath
>> <ch...@google.com> wrote:
>> >
>> > +1 for making use_fastavro the default for Python3. I don't see any
>> significant drawbacks in doing this from Beam's point of view. One concern
>> is whether avro and fastavro can safely co-exist in the same environment so
>> that Beam continues to work for users who already have avro library
>> installed.
>> >
>> > Note that there are two use_fastavro flags (confusingly enough).
>> > (1) for avro file source [1]
>> > (2) an experiment flag [2] with the same name that makes Dataflow
>> runner use fastavro library for reading/writing intermediate files and for
>> reading Avro files exported by BigQuery.
>> >
>> > I can help with the latter.
>> >
>> > [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/avroio.py#L81
>> > [2]
>> https://lists.apache.org/thread.html/94bd362a3a041654e6ef9003fb3fa797e25274fdb4766065481a0796@%3Cuser.beam.apache.org%3E
>> >
>> > Thanks,
>> > Cham
>> >
>> > On Wed, Mar 27, 2019 at 3:27 PM Valentyn Tymofieiev <
>> valentyn@google.com> wrote:
>> >>
>> >> Thanks, Robbe and Frederik, for raising this.
>> >>
>> >> Over the course of making Beam Python 3 compatible this is at least
>> the second time [1] we have to deal with an error in avro-python3 package.
>> The release cadence of Apache Avro (1 release a year)
>> >> is concerning to me [2]. Even if we have a new release with Python 3
>> fixes soon, as Beam users start use Beam more actively on Python 3, we may
>> encounter more issues in avro-python3. If this happens, Beam will have to
>> monkey-patch its way around the avro-python3 issues, because waiting for
>> next Avro release may not be practical.
>> >>
>> >> So, I agree that it is be a good time to start transitioning off of
>> avro/avro-python3 dependency, given that fastavro is known to be a faster
>> alternative [3], and is released monthly[4]
>> >>
>> >> There are couple of ways to make this transition depending on how
>> careful we want to be. We should:
>> >>
>> >> 1. Remove the dependency on avro in the current codepath whenever
>> fastavro is used, as you propose.
>> >> 2. Remove Beam dependency on avro-python3 now,  OR,  if we want to be
>> safer,  set use_fastavro=True a default option on Python 3, but keep the
>> dependency on avro-python3, and keep that codepath, even though it may not
>> work right now on Py3, but might work after next Avro release.
>> >> 3. set use_fastavro=True a default option on Python 2.
>> >> 4. Remove Beam dependency on avro and avro-python3 after several
>> releases.
>> >>
>> >> Adding +Chamikara Jayalath and +Udi Meiri who have been working on
>> Beam IOs may have some thoughts here. Do you think that it is safe to make
>> use_fastavro=True a default option for both Py2 and Py3 now? If we make
>> use_fastavro a default option on Py3, do you think there is a benefit to
>> still keep the Avro codepath on Py3, or we can remove it?
>> >>
>> >> Thanks,
>> >> Valentyn
>> >>
>> >> [1] https://github.com/apache/avro/pull/436
>> >> [2] https://avro.apache.org/releases.html
>> >> [3]
>> https://medium.com/@abrarsheikh/benchmarking-avro-and-fastavro-using-pytest-benchmark-tox-and-matplotlib-bd7a83964453
>> >> [4] https://pypi.org/project/fastavro/#history
>> >>
>> >> On Wed, Mar 27, 2019 at 10:49 AM Robbe Sneyders <ro...@ml6.eu>
>> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> We're looking at fixing avroio on Python 3, which still fails due to
>> a non-picklable schema class in Avro [1]. This is fixed when using the
>> latest Avro master, but the last release dates back to May 2017.
>> >>>
>> >>> Fastavro does not have the same problem, but is currently also
>> failing due to a dependency of avroio on Avro for schema parsing.
>> >>>
>> >>> We would therefore propose to (temporarily?) deprecate Avro on Python
>> 3, and implement a pure fastavro solution instead. +Frederik Bode  already
>> submitted a PR for this [2].
>> >>>
>> >>> Use of fastavro is currently activated with the `use_fastavro` flag,
>> which defaults to False. Since this flag would not make sense anymore on
>> Python 3, we would like to switch the default value to True. The
>> documentation already mentions that this will probably become the default
>> on the long term, but this change would also impact Python 2. Is this a
>> problem?
>> >>>
>> >>> Also, looking at the performance gain of fastavro, is there any
>> reason to not deprecate Avro in favor of fastavro on Python 3 indefinitely?
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/BEAM-6522#comment-16784499
>> >>> [2] https://github.com/apache/beam/pull/8130
>> >>>
>> >>> Kind regards,
>> >>> Robbe
>>
>

Re: Deprecating Avro for fastavro on Python 3

Posted by Valentyn Tymofieiev <va...@google.com>.
I would suggest to make fastavro a default option on Python 3, for the lack
of alternative at the moment, but keep current default on Python 2.

I would also keep both avro and avro-python3 dependencies and associated
codepaths.

This way, we will gradually increase the usage of fastavro, but keep an
alternative available in case users encounter issues. Hopefully with next
Avro release, the non-default alternative on Python 3 will be viable, we
can keep the tests around to check that.

As usage of Python 3 ramps up, we will have more confidence in making the
switch to fastavro a default implementation on both Python 2 and Python 3
if we so choose.

Thanks,
Valentyn

Re: Deprecating Avro for fastavro on Python 3

Posted by Robert Bradshaw <ro...@google.com>.
I agree with Ahmet.

Fastavro seems to be well maintained and has good, tested
compatibility. Unless we expect significant performance improvements
in the standard Avro Python package (a significant undertaking, likely
not one we have the bandwidth to take on, and my impression is that
it's historically not a been priority) it's hard to justify using it
instead. Python 3 issues are just the trigger to consider finally
moving over, as I think that was the lonig-term intent back when
fastavro was added as an option. (Possibly if there are features
missing from fastavro, that could be a reason as well, at least to
keep the option around even if it's not the default.)

That being said, we should definitely not change the default and
remove the old version in the same release.

- Robert

On Tue, Apr 2, 2019 at 2:12 PM Robbe Sneyders <ro...@ml6.eu> wrote:
>
> Hi all,
>
> Thank you for the feedback. Looking at the responses, it seems like there is a consensus to move forward with fastavro as the default implementation on Python 3.
>
> There are 2 questions left however:
> - Should fastavro also become the default implementation on Python 2?
> This is a trade-off between having a consistent API across Python versions, or keeping the current behavior on Python 2.
>
> - Should we keep the avro-python3 dependency?
> With the proposed solution, we could remove the avro-python3 dependency, but it might have to be re-added if we want to support Avro again on Python 3 in a future version.
>
> Kind regards,
> Robbe
>
>
>
>
> Robbe Sneyders
>
> ML6 Gent
>
> M: +32 474 71 31 08
>
>
> On Thu, 28 Mar 2019 at 18:28, Ahmet Altay <al...@google.com> wrote:
>>
>> Hi Ismaël,
>>
>> It is great to hear that Avro is planning to make a release soon.
>>
>> To answer your concerns, fastavro has a set of tests using regular avro files[1] and it also has a large set of users (with 675470 package downloads). This is in addition to it being a py2 & py3 compatible package and offering ~7x performance improvements [2]. Another data point, we were testing fastavro for a while behind an experimental flag and have not seen issues related compatibility.
>>
>> pyavro-rs sounds promising however I could not find a released version of it on pypi. The source code does not look like being maintained either with last commit on Jul 2, 2018. (for comparison last change on fastavro was on Mar 19, 2019).
>>
>> I think given the state of things, it makes sense to switch to fastavro as the default implementation to unblock python 3 changes. When avro offers a similar level of performance we could switch back without any visible user impact.
>>
>> Ahmet
>>
>> [1] https://github.com/fastavro/fastavro/tree/master/tests
>> [2] https://pypi.org/project/fastavro/
>>
>> On Thu, Mar 28, 2019 at 7:53 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> The problem of switching implementations is the risk of losing
>>> interoperability, and this is more important than performance. Does
>>> fastavro have tests that guarantee that it is fully compatible with
>>> Avro’s Java version? (given that it is the de-facto implementation
>>> used everywhere).
>>>
>>> If performance is a more important criteria maybe it is worth to check
>>> at pyavro-rs [1], you can take a look at its performance in the great
>>> talk of last year [2].
>>>
>>> I have been involved actively in the Avro community in the last months
>>> and I am now a committer there. Also Dan Kulp who has done multiple
>>> contributions in Beam is now a PMC member too. We are at this point
>>> working hard to get the next release of Avro out, actually the branch
>>> cut of Avro 1.9.0 is happening this week, and we plan to improve the
>>> release cadence. Please understand that the issue with Avro is that it
>>> is a really specific and ‘old‘ project (~10 years) so part of the
>>> active moved to other areas because it is stable, but we are still
>>> there working on it and we are eager to improve it for everyone’s
>>> needs (and of course Beam needs).
>>>
>>> I know that Python 3’s Avro implementation is still lacking and could
>>> be improved (views expressed here are clearly valid), but maybe this
>>> is a chance to contribute there too. Remember Apache projects are a
>>> family and we have a history of cross colaboration with other
>>> communities e.g. Flink, Calcite so why not give it a chance to Avro
>>> too.
>>>
>>> Regards,
>>> Ismaël
>>>
>>> [1] https://github.com/flavray/pyavro-rs
>>> [2] https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf
>>>
>>> On Wed, Mar 27, 2019 at 11:42 PM Chamikara Jayalath
>>> <ch...@google.com> wrote:
>>> >
>>> > +1 for making use_fastavro the default for Python3. I don't see any significant drawbacks in doing this from Beam's point of view. One concern is whether avro and fastavro can safely co-exist in the same environment so that Beam continues to work for users who already have avro library installed.
>>> >
>>> > Note that there are two use_fastavro flags (confusingly enough).
>>> > (1) for avro file source [1]
>>> > (2) an experiment flag [2] with the same name that makes Dataflow runner use fastavro library for reading/writing intermediate files and for reading Avro files exported by BigQuery.
>>> >
>>> > I can help with the latter.
>>> >
>>> > [1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/avroio.py#L81
>>> > [2] https://lists.apache.org/thread.html/94bd362a3a041654e6ef9003fb3fa797e25274fdb4766065481a0796@%3Cuser.beam.apache.org%3E
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Wed, Mar 27, 2019 at 3:27 PM Valentyn Tymofieiev <va...@google.com> wrote:
>>> >>
>>> >> Thanks, Robbe and Frederik, for raising this.
>>> >>
>>> >> Over the course of making Beam Python 3 compatible this is at least the second time [1] we have to deal with an error in avro-python3 package. The release cadence of Apache Avro (1 release a year)
>>> >> is concerning to me [2]. Even if we have a new release with Python 3 fixes soon, as Beam users start use Beam more actively on Python 3, we may encounter more issues in avro-python3. If this happens, Beam will have to monkey-patch its way around the avro-python3 issues, because waiting for next Avro release may not be practical.
>>> >>
>>> >> So, I agree that it is be a good time to start transitioning off of avro/avro-python3 dependency, given that fastavro is known to be a faster alternative [3], and is released monthly[4]
>>> >>
>>> >> There are couple of ways to make this transition depending on how careful we want to be. We should:
>>> >>
>>> >> 1. Remove the dependency on avro in the current codepath whenever fastavro is used, as you propose.
>>> >> 2. Remove Beam dependency on avro-python3 now,  OR,  if we want to be safer,  set use_fastavro=True a default option on Python 3, but keep the dependency on avro-python3, and keep that codepath, even though it may not work right now on Py3, but might work after next Avro release.
>>> >> 3. set use_fastavro=True a default option on Python 2.
>>> >> 4. Remove Beam dependency on avro and avro-python3 after several releases.
>>> >>
>>> >> Adding +Chamikara Jayalath and +Udi Meiri who have been working on Beam IOs may have some thoughts here. Do you think that it is safe to make use_fastavro=True a default option for both Py2 and Py3 now? If we make use_fastavro a default option on Py3, do you think there is a benefit to still keep the Avro codepath on Py3, or we can remove it?
>>> >>
>>> >> Thanks,
>>> >> Valentyn
>>> >>
>>> >> [1] https://github.com/apache/avro/pull/436
>>> >> [2] https://avro.apache.org/releases.html
>>> >> [3] https://medium.com/@abrarsheikh/benchmarking-avro-and-fastavro-using-pytest-benchmark-tox-and-matplotlib-bd7a83964453
>>> >> [4] https://pypi.org/project/fastavro/#history
>>> >>
>>> >> On Wed, Mar 27, 2019 at 10:49 AM Robbe Sneyders <ro...@ml6.eu> wrote:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> We're looking at fixing avroio on Python 3, which still fails due to a non-picklable schema class in Avro [1]. This is fixed when using the latest Avro master, but the last release dates back to May 2017.
>>> >>>
>>> >>> Fastavro does not have the same problem, but is currently also failing due to a dependency of avroio on Avro for schema parsing.
>>> >>>
>>> >>> We would therefore propose to (temporarily?) deprecate Avro on Python 3, and implement a pure fastavro solution instead. +Frederik Bode  already submitted a PR for this [2].
>>> >>>
>>> >>> Use of fastavro is currently activated with the `use_fastavro` flag, which defaults to False. Since this flag would not make sense anymore on Python 3, we would like to switch the default value to True. The documentation already mentions that this will probably become the default on the long term, but this change would also impact Python 2. Is this a problem?
>>> >>>
>>> >>> Also, looking at the performance gain of fastavro, is there any reason to not deprecate Avro in favor of fastavro on Python 3 indefinitely?
>>> >>>
>>> >>> [1] https://issues.apache.org/jira/browse/BEAM-6522#comment-16784499
>>> >>> [2] https://github.com/apache/beam/pull/8130
>>> >>>
>>> >>> Kind regards,
>>> >>> Robbe