You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by Uri Laserson <la...@cloudera.com> on 2013/04/29 07:24:33 UTC

3x faster python reader

Hi all,

I rewrote some of the python code to read avro files.  I was able to
achieve a ~3x speedup over the current impl, and can probably do better if
it was cleaned up more.  The main changes are:
* Eliminated the object-oriented nature of the reader.  It's just functions
now.  Presumably this can be changed back, but it didn't really seem like
there was any reason for it.
* Given a reader and writer schema, it precomputes as much helpful info as
it can upfront and caches this in a dictionary that the read functions use
* The code is compiled with Cython for speedup.

How can this be used to improve the current python api?  Let me know how I
can be helpful...

Uri

-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Re: 3x faster python reader

Posted by Uri Laserson <la...@cloudera.com>.

Hi Miki,

Yes, I followed your model in remaking the Avro reader, but I performed the
schema resolution so that you could still specify separate writer/reader
schemas.  Your code is still 2.5x faster than mine when using the C
extensions.

I personally find the current API somewhat confusing, so I'd be into
changing it.

Uri


On Mon, Apr 29, 2013 at 2:32 PM, Miki Tebeka <mi...@gmail.com> wrote:

> Hi,
>
> I did the same for fastavro <https://bitbucket.org/tebeka/fastavro>. I
> found changing the current code while keeping the same API very hard.
>
> Another option we can take is leave the current code as version 1 add the
> new code either as new module under avro or as avro2.
>
> All the best,
> --
> Miki
>
>
> On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <laserson@cloudera.com
> >wrote:
>
> > Hi all,
> >
> > I rewrote some of the python code to read avro files.  I was able to
> > achieve a ~3x speedup over the current impl, and can probably do better
> if
> > it was cleaned up more.  The main changes are:
> > * Eliminated the object-oriented nature of the reader.  It's just
> functions
> > now.  Presumably this can be changed back, but it didn't really seem like
> > there was any reason for it.
> > * Given a reader and writer schema, it precomputes as much helpful info
> as
> > it can upfront and caches this in a dictionary that the read functions
> use
> > * The code is compiled with Cython for speedup.
> >
> > How can this be used to improve the current python api?  Let me know how
> I
> > can be helpful...
> >
> > Uri
> >
> > --
> > Uri Laserson, PhD
> > Data Scientist, Cloudera
> > Twitter/GitHub: @laserson
> > +1 617 910 0447
> > laserson@cloudera.com
> >
>



-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Re: 3x faster python reader

Posted by Russell Jurney <ru...@gmail.com>.

I'm very interested in getting these changes into trunk. Moral support +1 :)

Russell Jurney http://datasyndrome.com

On Apr 29, 2013, at 2:32 PM, Miki Tebeka <mi...@gmail.com> wrote:

> Hi,
>
> I did the same for fastavro <https://bitbucket.org/tebeka/fastavro>. I
> found changing the current code while keeping the same API very hard.
>
> Another option we can take is leave the current code as version 1 add the
> new code either as new module under avro or as avro2.
>
> All the best,
> --
> Miki
>
>
> On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <la...@cloudera.com>wrote:
>
>> Hi all,
>>
>> I rewrote some of the python code to read avro files.  I was able to
>> achieve a ~3x speedup over the current impl, and can probably do better if
>> it was cleaned up more.  The main changes are:
>> * Eliminated the object-oriented nature of the reader.  It's just functions
>> now.  Presumably this can be changed back, but it didn't really seem like
>> there was any reason for it.
>> * Given a reader and writer schema, it precomputes as much helpful info as
>> it can upfront and caches this in a dictionary that the read functions use
>> * The code is compiled with Cython for speedup.
>>
>> How can this be used to improve the current python api?  Let me know how I
>> can be helpful...
>>
>> Uri
>>
>> --
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
>>

Re: 3x faster python reader

Posted by Miki Tebeka <mi...@gmail.com>.

Hi,

I did the same for fastavro <https://bitbucket.org/tebeka/fastavro>. I
found changing the current code while keeping the same API very hard.

Another option we can take is leave the current code as version 1 add the
new code either as new module under avro or as avro2.

All the best,
--
Miki


On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <la...@cloudera.com>wrote:

> Hi all,
>
> I rewrote some of the python code to read avro files.  I was able to
> achieve a ~3x speedup over the current impl, and can probably do better if
> it was cleaned up more.  The main changes are:
> * Eliminated the object-oriented nature of the reader.  It's just functions
> now.  Presumably this can be changed back, but it didn't really seem like
> there was any reason for it.
> * Given a reader and writer schema, it precomputes as much helpful info as
> it can upfront and caches this in a dictionary that the read functions use
> * The code is compiled with Cython for speedup.
>
> How can this be used to improve the current python api?  Let me know how I
> can be helpful...
>
> Uri
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com
>

Re: 3x faster python reader

Posted by Uri Laserson <la...@cloudera.com>.

It's probably too messy to go into a patch at this point.  I just put the
code up on a fork:

https://github.com/laserson/avro/tree/perf

Phil, perhaps we could sit down at some point and go through it briefly?


On Mon, Apr 29, 2013 at 10:56 AM, Philip Zeyliger <ph...@cloudera.com>wrote:

> Hi Uri,
>
> Once you post to the JIRA, I'd be happy to review it.
>
> -- Philip
>
>
> On Mon, Apr 29, 2013 at 9:22 AM, Doug Cutting <cu...@apache.org> wrote:
>
> > Uri,
> >
> > This sounds awesome!  Is the API compatible with the existing API?  If
> > it's incompatible and cannot easily be made compatible then perhaps we
> > can add it as the 'new' API and deprecate the old one.  Regardless,
> > please file an issue in Jira (issues.apache.org/jira/browse/AVRO) and
> > attach your patch there.
> >
> > Thanks,
> >
> > Doug
> >
> > On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <la...@cloudera.com>
> > wrote:
> > > Hi all,
> > >
> > > I rewrote some of the python code to read avro files.  I was able to
> > > achieve a ~3x speedup over the current impl, and can probably do better
> > if
> > > it was cleaned up more.  The main changes are:
> > > * Eliminated the object-oriented nature of the reader.  It's just
> > functions
> > > now.  Presumably this can be changed back, but it didn't really seem
> like
> > > there was any reason for it.
> > > * Given a reader and writer schema, it precomputes as much helpful info
> > as
> > > it can upfront and caches this in a dictionary that the read functions
> > use
> > > * The code is compiled with Cython for speedup.
> > >
> > > How can this be used to improve the current python api?  Let me know
> how
> > I
> > > can be helpful...
> > >
> > > Uri
> > >
> > > --
> > > Uri Laserson, PhD
> > > Data Scientist, Cloudera
> > > Twitter/GitHub: @laserson
> > > +1 617 910 0447
> > > laserson@cloudera.com
> >
>



-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Re: 3x faster python reader

Posted by Philip Zeyliger <ph...@cloudera.com>.

Hi Uri,

Once you post to the JIRA, I'd be happy to review it.

-- Philip


On Mon, Apr 29, 2013 at 9:22 AM, Doug Cutting <cu...@apache.org> wrote:

> Uri,
>
> This sounds awesome!  Is the API compatible with the existing API?  If
> it's incompatible and cannot easily be made compatible then perhaps we
> can add it as the 'new' API and deprecate the old one.  Regardless,
> please file an issue in Jira (issues.apache.org/jira/browse/AVRO) and
> attach your patch there.
>
> Thanks,
>
> Doug
>
> On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <la...@cloudera.com>
> wrote:
> > Hi all,
> >
> > I rewrote some of the python code to read avro files.  I was able to
> > achieve a ~3x speedup over the current impl, and can probably do better
> if
> > it was cleaned up more.  The main changes are:
> > * Eliminated the object-oriented nature of the reader.  It's just
> functions
> > now.  Presumably this can be changed back, but it didn't really seem like
> > there was any reason for it.
> > * Given a reader and writer schema, it precomputes as much helpful info
> as
> > it can upfront and caches this in a dictionary that the read functions
> use
> > * The code is compiled with Cython for speedup.
> >
> > How can this be used to improve the current python api?  Let me know how
> I
> > can be helpful...
> >
> > Uri
> >
> > --
> > Uri Laserson, PhD
> > Data Scientist, Cloudera
> > Twitter/GitHub: @laserson
> > +1 617 910 0447
> > laserson@cloudera.com
>

Re: 3x faster python reader

Posted by Doug Cutting <cu...@apache.org>.

Uri,

This sounds awesome!  Is the API compatible with the existing API?  If
it's incompatible and cannot easily be made compatible then perhaps we
can add it as the 'new' API and deprecate the old one.  Regardless,
please file an issue in Jira (issues.apache.org/jira/browse/AVRO) and
attach your patch there.

Thanks,

Doug

On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <la...@cloudera.com> wrote:
> Hi all,
>
> I rewrote some of the python code to read avro files.  I was able to
> achieve a ~3x speedup over the current impl, and can probably do better if
> it was cleaned up more.  The main changes are:
> * Eliminated the object-oriented nature of the reader.  It's just functions
> now.  Presumably this can be changed back, but it didn't really seem like
> there was any reason for it.
> * Given a reader and writer schema, it precomputes as much helpful info as
> it can upfront and caches this in a dictionary that the read functions use
> * The code is compiled with Cython for speedup.
>
> How can this be used to improve the current python api?  Let me know how I
> can be helpful...
>
> Uri
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com