You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Spencer Nelson <s...@spencerwnelson.com> on 2021/07/26 23:57:12 UTC

new library for JIT-compiling avro schemas in Python

Hi all,

I'd like to share a library I wrote to encode and decode Avro data in
Python. It's called 'avroc', at https://github.com/spenczar/avroc,
MIT-licensed. I wrote a bit about it here:
https://journal.spencerwnelson.com/entries/avro.html.

Most runtime Avro decoders decode data by walking through a schema
definition, traversing it in memory to know exactly how to interpret the
next bytes step-by-step. They repeat this walking traversal for every
message they decode, which can involve a lot of overhead - for example,
there are often big switches on the "type" of a schema which must be
evaluated for every node in the definition. Encoders work similarly.

One way around this is code generation, which can produce efficient
routines - but only if you know the schema in advance. And code generation
doesn't work for schema resolution, where you want to read with a different
schema than the writer used (perhaps to drop fields you don't care about).

To get around those limitations, avroc uses a different strategy: it
translates Avro schemas into a Python AST during runtime. That AST is then
compiled into Python bytecode, which runs directly on the Python
interpreter virtual machine.

The results have been very promising. I'm able to outperform the Java
implementation on some real-world test data (admittedly, my case uses a
very deep and complicated schema), as measured by encoding and decoding
throughput.

I think this approach may be generally useful in other languages, so I
wanted to present this as a possible design option for others. I think I'm
not the first to think of this idea; "Avro schemas as LL(1) CFG
definitions" (
http://avro.apache.org/docs/1.10.2/api/java/org/apache/avro/io/parsing/doc-files/parsing.html)
is in the official docs and seems to be driving at a similar idea.

If you're using Python and interested in a high-performance, pure-Python
implementation, give avroc a shot, I'd love to hear from you.

-Spencer