You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Eric Evans <ee...@rackspace.com> on 2010/11/27 19:59:02 UTC

(de)serialization comparisons

There is some work being done to compare the performance of thrift,
avro, msgpack, and cassandra's own hand-rolled message serialization.

https://issues.apache.org/jira/browse/CASSANDRA-1765

Any input from the folks on this list would be appreciated.

In particular, I'm curious about the erratic serialization times I'm
seeing from Avro, min vs max across a series of test runs, (see my
comment on CASSANDRA-1765 from 27/Nov/10 01:39 PM).

-- 
Eric Evans
eevans@rackspace.com


Re: (de)serialization comparisons

Posted by Scott Carey <sc...@richrelevance.com>.
I suspect that Cassandra has some needs similar to HAvroBase:

(http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/).

Note, much of that project can apply to other key/value stores, its fairly abstracted from HBase in the key components like schema management.

-Scott

On Nov 27, 2010, at 10:59 AM, Eric Evans wrote:

> 
> There is some work being done to compare the performance of thrift,
> avro, msgpack, and cassandra's own hand-rolled message serialization.
> 
> https://issues.apache.org/jira/browse/CASSANDRA-1765
> 
> Any input from the folks on this list would be appreciated.
> 
> In particular, I'm curious about the erratic serialization times I'm
> seeing from Avro, min vs max across a series of test runs, (see my
> comment on CASSANDRA-1765 from 27/Nov/10 01:39 PM).
> 
> -- 
> Eric Evans
> eevans@rackspace.com
> 


Re: (de)serialization comparisons

Posted by Scott Carey <sc...@richrelevance.com>.
Well, the code looks very inefficient and not a good test of how it is recommended to use Avro.

My hunch is that the test is not measuring serialization / deserialization, its measuring the cost of instantiating and initializing various objects.  Many of these are very expensive since they parse the string schema -- much like a regular expression in java, you want to compile the expression once and run it many times -- in Avro you want to initialize the DatumWriters and DatumReaders for a Schema once, and use it many times.

In the benchmark I recommend using System.nanoTime() since currentTimeMills() is not accurate (its essentially +- 7ms for most OS's).

The encoding loop in the patch looks like:

+        for (int j = 0; j < mainLoop; ++j) {
+        	SpecificDatumWriter writer = new SpecificDatumWriter(ARangeSliceCommand.SCHEMA$);
+        	ByteArrayOutputStream bout = new ByteArrayOutputStream();
+        	BinaryEncoder enc = new BinaryEncoder(bout);
+        	for (ARangeSliceCommand cmd : acmds)
+        	{
+        		enc.writeString(new Utf8(cmd.getSchema().toString()));
+        		writer.write(cmd, enc);
+        	}
+            bout.close();
+            buf = bout.toByteArray();
+            writer = null;
+            bout = null;
+            enc = null;


Honestly, without looking too far I think this is benchmarking the toString() method of Schema.  I think all objects in the acmds are the same, with the same schema, so that whole Utf8 instantiation can move out of that loop and be done once.  Furthermore why is the schema being written for each record?  Avro is designed to avoid that -- store it once somewhere and then write multiple records with that schema.   
The patch doesn't make it easy to see how big 'mainLoop' is compared to the size of 'acmds', if it is large and 'acmds' is small then datum writer creation is measured heavily too.

You need to cache the SpecificDatumWriter and SpecificDatumReader and re-use it for each schema.  These are meant to be re-used and kept per record type.
Caching the BinaryEncoder and BinaryDecoder will help too if 'mainLoop' is not small.  "new BinaryDecoder()" is deprecated, use the DecoderFactory and ideally re-use the instance when possible.
A ThreadLocal WeakHashMap (or WeakIdentityHashMap) of Class >> SpecificDatumWriter would work if you are concerned about thread safety or the object lifetimes.

The decoder loop is:
+        for (int j = 0; j < mainLoop; ++j) {
+        	BinaryDecoder dec = new BinaryDecoder(new ByteArrayInputStream(buf));
+        	SpecificDatumReader<ARangeSliceCommand> reader = new SpecificDatumReader<ARangeSliceCommand>(ARangeSliceCommand.SCHEMA$);
+        	for (int i = 0; i < acmds.size(); i++)
+        	{
+        		reader.setSchema(Schema.parse(dec.readString(new Utf8()).toString()));
+        		reader.read(new ARangeSliceCommand(), dec);
+        	}

First of all,  dec.readString() happily takes a null, either pass in a Utf8 you intend to reuse -- one created outside of the loop for example, or pass in null.  
Next, this test is testing Schema.parse() performance.  This should be outside the loop.  The schema of the written data should be kept once and not stored once per record.  Storing the schema once with each record is an Avro anti-pattern.  If you are storing a schema with every serialized record, you either using Avro wrong or have a use case that is not suitable for Avro at all.  Avro has a verbose JSON schema, and a very compact binary recored representation.  When a record type is serialized repeatedly, the schema need only be persisted once, and the records with unique data persisted compactly.  There are many ways to do this that differ depending on the use case.  Think about an RDBMS -- the equivalent of a table definition is not stored with each row.  Storing an avro schema with each avro record is like storing a table definition with each row.  If you have changing schemas over time, you need to store only one schema per change somewhere, and have a mechanism to know which groups of records have what schemas.  The canonical Avro example is the Avro Data File -- all records in it have the same schema and the schema is stored in the header of the file.

The decoder bit should look more like:

BinaryDecoder dec = DecoderFactory.defaultFactory().createBinaryDecoder(buf, null); 
SpecificDatumReader<ARangeSliceCommand> reader = threadLocalReaders.get(ARangeSliceCommand.class); // a Class >> SpecificDatumReader mapping.
reader.setSchema(Schema.parse(dec.readString(null).toString())); // schema stored once
ARangeSliceCommand record = new ARangeSliceCommand();
for (int j = 0; j < mainLoop; ++j) {  
  for (int i = 0; i < acmds.size(); i++)  {
    reader.read(record, dec); // or pass in null if there is no object to reuse, reusing is faster
  }
}


And that would be very fast and additionally keep very few bytes per record on average assuming mainLoop*admds.size() is large (at least a thousand).


On Nov 27, 2010, at 10:59 AM, Eric Evans wrote:

> https://issues.apache.org/jira/browse/CASSANDRA-1765