You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Ryan Blue <bl...@cloudera.com> on 2015/10/28 19:14:37 UTC

Python Avro implementations

Hi everyone,

Right now, we have two python implementations: py and py3. And there is 
also fastavro [1], which is popular because it is fast and more 
pythonic. It also works with python 2.7, python 3.x, pypy, and can be 
sped up by cython.

I had a recent e-mail exchange with Miki Tebeka, the creator and 
maintainer of fastavro, about the current python Avro implementations 
and he's interested in working with the Apache community to merge the 
existing implementations into one. I'm really excited about it, since 
this is a great opportunity to grow the Avro community and consolidate 
the python implementations.

I'd like to start a discussion from this thread about next steps. I 
think the best way forward is to bring fastavro in, and then work on 
building compatibility with the current APIs where we need to so that we 
can deprecate the existing py and py3 projects.

Does that sound reasonable?

rb


[1]: https://github.com/tebeka/fastavro

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Python Avro implementations

Posted by Marius Dieckmann <m....@googlemail.com>.
Hi,

i recently evaluated the performance of various python avro 
implementations. Besides the official python implementation and fastavro 
there is a fourth implementation called pyavroc [1]. pyavroc seems to be 
even faster than fastavro in terms of parsing performance but it uses 
the avro C library with python bindings rather than pure python. I am 
not sure if this is desired but maybe it could be a good option to 
develop fastavro in a way that it is possible to integrate the avro C 
into the code in order to improve the performance (in addition i am not 
sure if optimizing the code for cython could might improve the 
performance to similar level). In addition pyavroc does not seem to have 
much API compatibility so i am not sure what should be focus, API 
compatibility or performance.

In terms of parsing performance i found the following (normalized 
against normal python avro 1.7.7):
avro_python:    1
fastavro:            0.2717 (-> i am not sure if i used cython correctly)
pyavroc:            0.0285 (only functions used that are built-in in 
python, means no numpy or sth. similar)

The results were more or less stable with various tests and files.


Cheers

[1] https://github.com/Byhiras/pyavroc

Am 29.10.2015 um 14:23 schrieb Sean Busbey:
> sounds great to me.
>
> On Wed, Oct 28, 2015 at 1:14 PM, Ryan Blue <bl...@cloudera.com> wrote:
>> Hi everyone,
>>
>> Right now, we have two python implementations: py and py3. And there is also
>> fastavro [1], which is popular because it is fast and more pythonic. It also
>> works with python 2.7, python 3.x, pypy, and can be sped up by cython.
>>
>> I had a recent e-mail exchange with Miki Tebeka, the creator and maintainer
>> of fastavro, about the current python Avro implementations and he's
>> interested in working with the Apache community to merge the existing
>> implementations into one. I'm really excited about it, since this is a great
>> opportunity to grow the Avro community and consolidate the python
>> implementations.
>>
>> I'd like to start a discussion from this thread about next steps. I think
>> the best way forward is to bring fastavro in, and then work on building
>> compatibility with the current APIs where we need to so that we can
>> deprecate the existing py and py3 projects.
>>
>> Does that sound reasonable?
>>
>> rb
>>
>>
>> [1]: https://github.com/tebeka/fastavro
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>
>


Re: Python Avro implementations

Posted by Ryan Blue <bl...@cloudera.com>.
Hey everyone,

Sorry it took so long, I forgot I had promised to open an issue for 
this. It is here:

   https://issues.apache.org/jira/browse/AVRO-1756

Next step is to get a patch together!

rb

On 11/05/2015 09:06 AM, Ryan Blue wrote:
> Thanks, Miki! This sounds great. I'll open up an issue in Avro's tracker
> for this.
>
> You might also want to have a look at the ongoing import of Matthieu's
> js implementation for an idea about the steps:
>
>    https://issues.apache.org/jira/browse/AVRO-1747
>
> Please let us know what we can do to help the process along. If you want
> to put together a patch that adds fastavro as lang/python that would be
> a great start so we can start looking at it. It sounds like another item
> for us to follow up on is the repository structure and release policies
> in the other thread.
>
> rb
>
> On 10/31/2015 01:15 AM, Miki Tebeka wrote:
>> I'd love to have the project hosted under the the official avro
>> repository
>> and gain help from people who know Avro far better than me.
>>
>> I'll take some time to re-learn the existing avro API and try to
>> guestimate
>> the effort involved in wrapping the current fastavro codebase with it.
>> However I have a hunch we won't be 100% backward compatible and will need
>> some phase-out period (of course - I might be wrong :)
>>
>> On Thu, Oct 29, 2015 at 3:23 PM, Sean Busbey <bu...@cloudera.com> wrote:
>>
>>> sounds great to me.
>>>
>>> On Wed, Oct 28, 2015 at 1:14 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>> Hi everyone,
>>>>
>>>> Right now, we have two python implementations: py and py3. And there is
>>> also
>>>> fastavro [1], which is popular because it is fast and more pythonic. It
>>> also
>>>> works with python 2.7, python 3.x, pypy, and can be sped up by cython.
>>>>
>>>> I had a recent e-mail exchange with Miki Tebeka, the creator and
>>> maintainer
>>>> of fastavro, about the current python Avro implementations and he's
>>>> interested in working with the Apache community to merge the existing
>>>> implementations into one. I'm really excited about it, since this is a
>>> great
>>>> opportunity to grow the Avro community and consolidate the python
>>>> implementations.
>>>>
>>>> I'd like to start a discussion from this thread about next steps. I
>>>> think
>>>> the best way forward is to bring fastavro in, and then work on building
>>>> compatibility with the current APIs where we need to so that we can
>>>> deprecate the existing py and py3 projects.
>>>>
>>>> Does that sound reasonable?
>>>>
>>>> rb
>>>>
>>>>
>>>> [1]: https://github.com/tebeka/fastavro
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>>
>>>
>>>
>>> --
>>> Sean
>>>
>>
>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Python Avro implementations

Posted by Ryan Blue <bl...@cloudera.com>.
Thanks, Miki! This sounds great. I'll open up an issue in Avro's tracker 
for this.

You might also want to have a look at the ongoing import of Matthieu's 
js implementation for an idea about the steps:

   https://issues.apache.org/jira/browse/AVRO-1747

Please let us know what we can do to help the process along. If you want 
to put together a patch that adds fastavro as lang/python that would be 
a great start so we can start looking at it. It sounds like another item 
for us to follow up on is the repository structure and release policies 
in the other thread.

rb

On 10/31/2015 01:15 AM, Miki Tebeka wrote:
> I'd love to have the project hosted under the the official avro repository
> and gain help from people who know Avro far better than me.
>
> I'll take some time to re-learn the existing avro API and try to guestimate
> the effort involved in wrapping the current fastavro codebase with it.
> However I have a hunch we won't be 100% backward compatible and will need
> some phase-out period (of course - I might be wrong :)
>
> On Thu, Oct 29, 2015 at 3:23 PM, Sean Busbey <bu...@cloudera.com> wrote:
>
>> sounds great to me.
>>
>> On Wed, Oct 28, 2015 at 1:14 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>> Hi everyone,
>>>
>>> Right now, we have two python implementations: py and py3. And there is
>> also
>>> fastavro [1], which is popular because it is fast and more pythonic. It
>> also
>>> works with python 2.7, python 3.x, pypy, and can be sped up by cython.
>>>
>>> I had a recent e-mail exchange with Miki Tebeka, the creator and
>> maintainer
>>> of fastavro, about the current python Avro implementations and he's
>>> interested in working with the Apache community to merge the existing
>>> implementations into one. I'm really excited about it, since this is a
>> great
>>> opportunity to grow the Avro community and consolidate the python
>>> implementations.
>>>
>>> I'd like to start a discussion from this thread about next steps. I think
>>> the best way forward is to bring fastavro in, and then work on building
>>> compatibility with the current APIs where we need to so that we can
>>> deprecate the existing py and py3 projects.
>>>
>>> Does that sound reasonable?
>>>
>>> rb
>>>
>>>
>>> [1]: https://github.com/tebeka/fastavro
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>
>>
>>
>> --
>> Sean
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Python Avro implementations

Posted by Miki Tebeka <mi...@gmail.com>.
I'd love to have the project hosted under the the official avro repository
and gain help from people who know Avro far better than me.

I'll take some time to re-learn the existing avro API and try to guestimate
the effort involved in wrapping the current fastavro codebase with it.
However I have a hunch we won't be 100% backward compatible and will need
some phase-out period (of course - I might be wrong :)

On Thu, Oct 29, 2015 at 3:23 PM, Sean Busbey <bu...@cloudera.com> wrote:

> sounds great to me.
>
> On Wed, Oct 28, 2015 at 1:14 PM, Ryan Blue <bl...@cloudera.com> wrote:
> > Hi everyone,
> >
> > Right now, we have two python implementations: py and py3. And there is
> also
> > fastavro [1], which is popular because it is fast and more pythonic. It
> also
> > works with python 2.7, python 3.x, pypy, and can be sped up by cython.
> >
> > I had a recent e-mail exchange with Miki Tebeka, the creator and
> maintainer
> > of fastavro, about the current python Avro implementations and he's
> > interested in working with the Apache community to merge the existing
> > implementations into one. I'm really excited about it, since this is a
> great
> > opportunity to grow the Avro community and consolidate the python
> > implementations.
> >
> > I'd like to start a discussion from this thread about next steps. I think
> > the best way forward is to bring fastavro in, and then work on building
> > compatibility with the current APIs where we need to so that we can
> > deprecate the existing py and py3 projects.
> >
> > Does that sound reasonable?
> >
> > rb
> >
> >
> > [1]: https://github.com/tebeka/fastavro
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
>
>
>
> --
> Sean
>

Re: Python Avro implementations

Posted by Sean Busbey <bu...@cloudera.com>.
sounds great to me.

On Wed, Oct 28, 2015 at 1:14 PM, Ryan Blue <bl...@cloudera.com> wrote:
> Hi everyone,
>
> Right now, we have two python implementations: py and py3. And there is also
> fastavro [1], which is popular because it is fast and more pythonic. It also
> works with python 2.7, python 3.x, pypy, and can be sped up by cython.
>
> I had a recent e-mail exchange with Miki Tebeka, the creator and maintainer
> of fastavro, about the current python Avro implementations and he's
> interested in working with the Apache community to merge the existing
> implementations into one. I'm really excited about it, since this is a great
> opportunity to grow the Avro community and consolidate the python
> implementations.
>
> I'd like to start a discussion from this thread about next steps. I think
> the best way forward is to bring fastavro in, and then work on building
> compatibility with the current APIs where we need to so that we can
> deprecate the existing py and py3 projects.
>
> Does that sound reasonable?
>
> rb
>
>
> [1]: https://github.com/tebeka/fastavro
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.



-- 
Sean