You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Eelco Hillenius <ee...@gmail.com> on 2009/08/26 00:52:15 UTC

user experience

Hi,

I'd like to share some results I'm having with using Avro. Just fyi :-)

We are using Avro to log 'audit events'. Audit events are basically
simple Java objects with a few properties that describe the audit
event. An example is class SiteNodeDeletedEvent with properties
timeStamp, userId and siteNodeId. Most event classes have between 3 to
8 properties. What I like about doing audit logging like this rather
than just logging string messages, is that it forces us to use data
structures which will be easier to analyze later, and that it will be
much easier to go through our code to find what kind of audit events
we have (all events must extend the AuditEvent base class). We
basically just use Avro to serialize these objects to rolling log
files locally, which are put into HDFS by a daemon separately. We use
Avro's reflection API so that we don't have to deal with code
generation and keep our development model as simple as we can.

Currently we write only eight different events to a database, and this
so far has resulted in a bit over 12 million records. However, I hope
to ramp up what we log, so expect we will soon have trillions of
records. I'd much rather buy more disk space than having to worry
about scaling our database, and I think audit logging is kind of a
natural case for HDFS/ MR, but while I'm at it, why not just making
the logging itself efficient, which is where Avro comes into play.

I wrote a little framework for logging these events, and tested that
with our current records. In that test, I roll over each file after a
million records, so I end up with 13 files (last file only a quarter
million), totaling 121 MB unpacked/ 36 MB gzipped (that framework
typically gzips right after rolling over). So that's 10 MB unpacked/ 3
MB packed per million records. It writes those files, including
reading the records from a local MySQL database and instantiating the
event objects in 4.5 minutes on my MBP. Reading in and instantiating
those events from the log files again costs 1.3 minutes.

In my book, those are pretty good figures for my humble laptop! And
keep in mind that I am using the reflection API; using specific
records probably could eat quite a bit out of the processing time, at
least when it comes to writing. Anyway, I'm sure I won't have any
trouble selling Avro to my colleagues, and I just wanted to share my
experiences in case anyone would be interested. It'd be awesome to
read other's experiences as well. Now on to playing with MR and Pig
etc. :-)

Cheers,

Eelco

Re: user experience

Posted by Eelco Hillenius <ee...@gmail.com>.

> Decoding 3MB/sec seems rather slow to me (121MB log file instantiated to
> objects in ~40 secs).  For comparison, creating tuple objects from a Hadoop
> SequenceFile is ~5x faster.  Granted I'm comparing apples to oranges (my
> objects in SequenceFile to Eelco's test in Avro).

Maybe I got carried away a bit, but I was looking at 300,000 objects/
sec, all populated through reflection, and was pretty happy with that.

> This would depend on a lot on the objects themselves, the schema, and generic vs. specific, etc

Sure. It was late in the evening etc, and I just wanted to share that
Avro seems to work great for my purposes more than bringing any
serious analyses to the table.

Cheers,

Eelco

Re: user experience

Posted by Eelco Hillenius <ee...@gmail.com>.

> I've found that generating classes with the specific API is both simpler and
> faster.  In particular, if you have a set of related classes, use a
> method-free protocol file (.avpr) to define them.  The Java classes are
> generated by an Ant task.

The reason we don't want to go for the specific API is because we like
our development process to be as minimal as possible, and that
includes avoiding code generation where we can. We're with fine paying
a performance penalty for that, especially since for our purposes Avro
with reflection is definitively fast (and compact) enough. And the
nice thing about Avro is that if certain records turn out to be
bottlenecks, we can always turn them into specific ones.

My 2c,

Eelco

Re: user experience

Posted by Doug Cutting <cu...@apache.org>.

Scott Carey wrote:
> Decoding 3MB/sec seems rather slow to me (121MB log file instantiated to
> objects in ~40 secs).  For comparison, creating tuple objects from a Hadoop
> SequenceFile is ~5x faster.  Granted I'm comparing apples to oranges (my
> objects in SequenceFile to Eelco's test in Avro).
> 
> This would depend on a lot on the objects themselves, the schema, and
> generic vs. specific, etc.

FWIW, in microbenchmarks, accessing fields via reflection is around 100x 
slower than normal field access!  That makes the reflect API generally 
much slower than generic and specific.

Reflect is also a bit tricky to use, since you need to define classes 
whose fields Avro knows how to serialize: the reflect API cannot infer 
an Avro schema for every Java class, but rather only for a stylized 
subset of classes (which needs to be better documented, AVRO-35).

I've found that generating classes with the specific API is both simpler 
and faster.  In particular, if you have a set of related classes, use a 
method-free protocol file (.avpr) to define them.  The Java classes are 
generated by an Ant task.  For example, see the patch I attached to the 
following issue:

https://issues.apache.org/jira/browse/MAPREDUCE-157

The "schemata" Ant target generates a file under build/src named 
Events.java that contains nested classes for each type defined in 
Events.avpr. (That target would better be named "generate-avro-classes".)

Note that specific's generated code does not currently have constructors 
or accessor methods.  Instead all fields are public, so, to build an 
instance you create it with something like 'Foo foo = new Foo();' then 
set all its fields with things like 'foo.a = ...;".  If this proves too 
cumbersome, we could generate a constructor that includes all fields.  I 
don't see a big need for accessor methods: a public setter and getter is 
equivalent to a public field.  The only advantage accessors would add is 
if you might someday wish to replace the class with a non-Avro-generated 
implementation, change the fields, keep the accessor methods and 
serialize it manually or with reflection.  This does not seem like a 
likely scenario to me, and it's nice to keep the generated code small.

The primary downside of using the specific API is that you can't add 
extra methods, etc. to the generated classes.  You need to treat them 
just as dumb structs, and keep all application logic external to them. 
In practice I don't think this is a big limitation, however.

Doug

Re: user experience

Posted by Scott Carey <sc...@richrelevance.com>.

On 8/27/09 8:42 AM, "Doug Cutting" <cu...@apache.org> wrote:

> Eelco Hillenius wrote:
>>> reading the records from a local MySQL database and instantiating the
>>> event objects in 4.5 minutes on my MBP. Reading in and instantiating
>>> those events from the log files again costs 1.3 minutes.
>> 
>> Last time I'll bug you guys with this, but after some optimization on
>> my part, I cut it back to 2.6 minutes write and 42 seconds read time.
> 
> Thanks for this data!
> 
> It would be interesting to see how much using generic or specific
> representations would change these times.
> 
> Doug
> 

It would definitely be nice to set up some tests to compare various usage
patterns of the API.  Comparing to things like ProtocolBuffers and Thrift is
useful, but perhaps more interesting is comparing to SequenceFile or other
core Hadoop formats.

Decoding 3MB/sec seems rather slow to me (121MB log file instantiated to
objects in ~40 secs).  For comparison, creating tuple objects from a Hadoop
SequenceFile is ~5x faster.  Granted I'm comparing apples to oranges (my
objects in SequenceFile to Eelco's test in Avro).

This would depend on a lot on the objects themselves, the schema, and
generic vs. specific, etc.

Re: user experience

Posted by Doug Cutting <cu...@apache.org>.

Eelco Hillenius wrote:
>> reading the records from a local MySQL database and instantiating the
>> event objects in 4.5 minutes on my MBP. Reading in and instantiating
>> those events from the log files again costs 1.3 minutes.
> 
> Last time I'll bug you guys with this, but after some optimization on
> my part, I cut it back to 2.6 minutes write and 42 seconds read time.

Thanks for this data!

It would be interesting to see how much using generic or specific 
representations would change these times.

Doug

Re: user experience

Posted by Eelco Hillenius <ee...@gmail.com>.

> reading the records from a local MySQL database and instantiating the
> event objects in 4.5 minutes on my MBP. Reading in and instantiating
> those events from the log files again costs 1.3 minutes.

Last time I'll bug you guys with this, but after some optimization on
my part, I cut it back to 2.6 minutes write and 42 seconds read time.

Eelco