You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Han JU <ju...@gmail.com> on 2015/01/09 14:32:42 UTC

Python avro performance

Hi,

I'm evaluating Avro to replace our csv based datasets and I notice a
performance problem in avro python bindings.
Basically I've tested on a 1.8GB dataset with 5 columns. With scala (avro
java bindings), reads and writes are fast (18s, 44s) but in python, for the
same file, it took nearly one hour to write, and 50 miniutes to read ...

My code is based on the avro documentation examples, and the schema is
relatively simple. My question:
  - Is this performance difference a known issue?
  - Is there something I miss (say a special configuration or something)?

I've seen a fastavro project and that is much faster in reading, but not
write support. This will prevent us from using Avro since we've lot of
python based programs that need to persist data.

Thanks!
-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

RE: Python avro performance

Posted by Ben Walsh <be...@byhiras.com>.
​
We have a Python C module which calls Avro's C implementation.

It can be about 70 times faster at reading than the standard Python implementation.

https://github.com/Byhiras/pyavroc

Currently it requires Avro-C to be built with some patches (available at https://github.com/Byhiras/avro).​ Some of these (eg. AVRO-1528 and AVRO-1572) are already in JIRA. I will try to get the rest attached to JIRA so they can be added to the upstream code.

Cheers

Ben


From: Doug Cutting <cu...@apache.org>
Sent: 10 January 2015 00:01
To: user@avro.apache.org
Subject: Re: Python avro performance
    
On Fri, Jan 9, 2015 at 6:05 AM, Bruce Mitchener
<br...@gmail.com> wrote:
> Has anyone profiled the Python code or otherwise looked at the performance?

Not that I've heard of.  Optimizing Avro Python would be a great project.

Doug

    

RE: Python avro performance

Posted by Ben Walsh <be...@byhiras.com>.
We have a Python C module which calls Avro's C implementation.

It can be about 70 times faster at reading than the standard Python implementation.

https://github.com/Byhiras/pyavroc

Currently it requires Avro-C to be built with some patches (available at https://github.com/Byhiras/avro).? Some of these (eg. AVRO-1528 and AVRO-1572) are already in JIRA. I will try to get the rest attached to JIRA so they can be added to the upstream code.

Cheers

Ben


________________________________
From: Doug Cutting <cu...@apache.org>
Sent: 10 January 2015 00:01
To: user@avro.apache.org
Subject: Re: Python avro performance

On Fri, Jan 9, 2015 at 6:05 AM, Bruce Mitchener
<br...@gmail.com> wrote:
> Has anyone profiled the Python code or otherwise looked at the performance?

Not that I've heard of.  Optimizing Avro Python would be a great project.

Doug


Re: Python avro performance

Posted by Philip Zeyliger <ph...@cloudera.com>.
If I recall correctly, the Python write implementation does a recursive
pass to check the data against the schema.  This is sometimes necessary to
choose which branch of a union to take when you're faced with typeless
dicts, but it's done more often than necessary in the python
implementation, and is very slow.

I think the right approach is to have a way for the user to tag the various
dicts to indicate which branch of a union it'll represent.

-- Philip


On Fri, Jan 9, 2015 at 4:01 PM, Doug Cutting <cu...@apache.org> wrote:

> On Fri, Jan 9, 2015 at 6:05 AM, Bruce Mitchener
> <br...@gmail.com> wrote:
> > Has anyone profiled the Python code or otherwise looked at the
> performance?
>
> Not that I've heard of.  Optimizing Avro Python would be a great project.
>
> Doug
>

Re: Python avro performance

Posted by Doug Cutting <cu...@apache.org>.
On Fri, Jan 9, 2015 at 6:05 AM, Bruce Mitchener
<br...@gmail.com> wrote:
> Has anyone profiled the Python code or otherwise looked at the performance?

Not that I've heard of.  Optimizing Avro Python would be a great project.

Doug

Re: Python avro performance

Posted by Wai Yip Tung <wy...@tungwaiyip.info>.
Python Avro is super slow. I have built a C module that is about 30 
times faster. It does both encoding and decoding. I intend to open 
source it soon. More testers would be helpful then.

Wai Yip

> Bruce Mitchener <ma...@gmail.com>
> Friday, January 09, 2015 6:05 AM
> Has anyone profiled the Python code or otherwise looked at the 
> performance?
>
>  - Bruce
>
> Sent from my iPhone
>
> On Jan 9, 2015, at 8:56 PM, Han JU <ju.han.felix@gmail.com 
> <ma...@gmail.com>> wrote:
>
> Han JU <ma...@gmail.com>
> Friday, January 09, 2015 5:56 AM
> Hi,
>
> Thanks. I've tried this project and its performance approaches 
> java/scala. But it seems that it has only read support. We have indeed 
> lots of use cases where python program need to persist datasets.
>
>
>
>
> -- 
> *JU Han*
>
> Data Engineer @ Botify.com
>
> +33 0619608888
> Mika Ristimaki <ma...@gmail.com>
> Friday, January 09, 2015 5:39 AM
> Hi,
>
> I can’t really comment why Python Avro is slow but you could try fastavro.
>
> https://pypi.python.org/pypi/fastavro
>
> -Mika
>
>
> Han JU <ma...@gmail.com>
> Friday, January 09, 2015 5:32 AM
> Hi,
>
> I'm evaluating Avro to replace our csv based datasets and I notice a 
> performance problem in avro python bindings.
> Basically I've tested on a 1.8GB dataset with 5 columns. With scala 
> (avro java bindings), reads and writes are fast (18s, 44s) but in 
> python, for the same file, it took nearly one hour to write, and 50 
> miniutes to read ...
>
> My code is based on the avro documentation examples, and the schema is 
> relatively simple. My question:
>   - Is this performance difference a known issue?
>   - Is there something I miss (say a special configuration or something)?
>
> I've seen a fastavro project and that is much faster in reading, but 
> not write support. This will prevent us from using Avro since we've 
> lot of python based programs that need to persist data.
>
> Thanks!
> -- 
> *JU Han*
>
> Data Engineer @ Botify.com
>
> +33 0619608888

Re: Python avro performance

Posted by Bruce Mitchener <br...@gmail.com>.
Has anyone profiled the Python code or otherwise looked at the performance?

 - Bruce

Sent from my iPhone

> On Jan 9, 2015, at 8:56 PM, Han JU <ju...@gmail.com> wrote:
> 
> Hi, 
> 
> Thanks. I've tried this project and its performance approaches java/scala. But it seems that it has only read support. We have indeed lots of use cases where python program need to persist datasets. 
> 
> 2015-01-09 14:39 GMT+01:00 Mika Ristimaki <mi...@gmail.com>:
>> Hi,
>> 
>> I can’t really comment why Python Avro is slow but you could try fastavro.
>> 
>> https://pypi.python.org/pypi/fastavro
>> 
>> -Mika
>> 
>>> On 09 Jan 2015, at 15:32, Han JU <ju...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I'm evaluating Avro to replace our csv based datasets and I notice a performance problem in avro python bindings.
>>> Basically I've tested on a 1.8GB dataset with 5 columns. With scala (avro java bindings), reads and writes are fast (18s, 44s) but in python, for the same file, it took nearly one hour to write, and 50 miniutes to read ...
>>> 
>>> My code is based on the avro documentation examples, and the schema is relatively simple. My question: 
>>>   - Is this performance difference a known issue? 
>>>   - Is there something I miss (say a special configuration or something)?
>>> 
>>> I've seen a fastavro project and that is much faster in reading, but not write support. This will prevent us from using Avro since we've lot of python based programs that need to persist data.
>>> 
>>> Thanks!
>>> -- 
>>> JU Han
>>> 
>>> Data Engineer @ Botify.com
>>> 
>>> +33 0619608888
> 
> 
> 
> -- 
> JU Han
> 
> Data Engineer @ Botify.com
> 
> +33 0619608888

Re: Python avro performance

Posted by Han JU <ju...@gmail.com>.
Hi,

Thanks. I've tried this project and its performance approaches java/scala.
But it seems that it has only read support. We have indeed lots of use
cases where python program need to persist datasets.

2015-01-09 14:39 GMT+01:00 Mika Ristimaki <mi...@gmail.com>:

> Hi,
>
> I can’t really comment why Python Avro is slow but you could try fastavro.
>
> https://pypi.python.org/pypi/fastavro
>
> -Mika
>
> On 09 Jan 2015, at 15:32, Han JU <ju...@gmail.com> wrote:
>
> Hi,
>
> I'm evaluating Avro to replace our csv based datasets and I notice a
> performance problem in avro python bindings.
> Basically I've tested on a 1.8GB dataset with 5 columns. With scala (avro
> java bindings), reads and writes are fast (18s, 44s) but in python, for the
> same file, it took nearly one hour to write, and 50 miniutes to read ...
>
> My code is based on the avro documentation examples, and the schema is
> relatively simple. My question:
>   - Is this performance difference a known issue?
>   - Is there something I miss (say a special configuration or something)?
>
> I've seen a fastavro project and that is much faster in reading, but not
> write support. This will prevent us from using Avro since we've lot of
> python based programs that need to persist data.
>
> Thanks!
> --
> *JU Han*
>
> Data Engineer @ Botify.com
>
> +33 0619608888
>
>
>


-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Re: Python avro performance

Posted by Mika Ristimaki <mi...@gmail.com>.
Hi,

I can’t really comment why Python Avro is slow but you could try fastavro.

https://pypi.python.org/pypi/fastavro <https://pypi.python.org/pypi/fastavro>

-Mika

> On 09 Jan 2015, at 15:32, Han JU <ju...@gmail.com> wrote:
> 
> Hi,
> 
> I'm evaluating Avro to replace our csv based datasets and I notice a performance problem in avro python bindings.
> Basically I've tested on a 1.8GB dataset with 5 columns. With scala (avro java bindings), reads and writes are fast (18s, 44s) but in python, for the same file, it took nearly one hour to write, and 50 miniutes to read ...
> 
> My code is based on the avro documentation examples, and the schema is relatively simple. My question: 
>   - Is this performance difference a known issue? 
>   - Is there something I miss (say a special configuration or something)?
> 
> I've seen a fastavro project and that is much faster in reading, but not write support. This will prevent us from using Avro since we've lot of python based programs that need to persist data.
> 
> Thanks!
> -- 
> JU Han
> 
> Data Engineer @ Botify.com
> 
> +33 0619608888