You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Niels Basjes (JIRA)" <ji...@apache.org> on 2016/03/10 16:10:40 UTC
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

    [ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189402#comment-15189402 ] 

Niels Basjes commented on AVRO-1704:
------------------------------------

I've been looking into what kind of solution would work here since I'm working on a project where we need datastructures going into Kafka and be available to multiple consumers.

The fundamental problem we need to solve is that of "Schema Evolution" in a streaming environment (Let's assume Kafka with the built in persistence of records).
We need three things to make this happen:
# A way to recognize a 'blob' is a serialized AVRO record.
#* We can simply assume it is always an AVRO record. 
#* I think we should simply let such a record start with "AVRO" to ensure we can cleanly catch problems like this STORM-512 (Summary: Timer ticks we written into Kafka which caused a lot of deserialization errors in reading the AVRO records.)
# A way to determine the schema this was written with.
#* As indicated above I vote for using the CRC-64-AVRO. 
#** I noticed that a simple typo fix in the documentation of a Schema causes a new fingerprint to be generated. 
#** Proposal: I think we should 'clean' the schema before calculating the fingerprint. I.e. remove the things that do not impact the binary form of the record (like the doc field).
# Have a place where we can find the schemas using the fingerprint as the key.
#* Here I think (looking at AVRO-1124 and the fact that there are ready to run implementations like this [Schema Registry|http://docs.confluent.io/current/schema-registry/docs/index.html]) we should limit what we keep inside Avro to something like a "SchemaFactory" interface (as the storage/retrieval interface to get a Schema) and a very basic implementation that simply reads the available schema's from a (set of) property file(s). Using this others can write additional implementations that can read/write to things like databases or the above mentioned Schema Registry.

So to summarize my proposal on the standard for the {{Single record serialization format}} can be written as:
{code}"AVRO"<CRC-64-AVRO(Normalized Schema)><regular binary form of the actual record>{code}

[~rdblue], I'm seeking feedback from you guys on this proposal. 


> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, meaning that I can read and write data with minimal effort across the various languages in use in my organization. If there was a standardized format for encoding single values that was optimized for out-of-band schema transfer, I would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode datums in this format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums. The reader would decode the fingerprint and ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed library users to inject custom backends. A simple, file system based one could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)