You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Niels Basjes (JIRA)" <ji...@apache.org> on 2016/03/10 16:10:40 UTC
[jira] [Commented] (AVRO-1704) Standardized format for encoding
messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189402#comment-15189402 ]
Niels Basjes commented on AVRO-1704:
------------------------------------
I've been looking into what kind of solution would work here since I'm working on a project where we need datastructures going into Kafka and be available to multiple consumers.
The fundamental problem we need to solve is that of "Schema Evolution" in a streaming environment (Let's assume Kafka with the built in persistence of records).
We need three things to make this happen:
# A way to recognize a 'blob' is a serialized AVRO record.
#* We can simply assume it is always an AVRO record.
#* I think we should simply let such a record start with "AVRO" to ensure we can cleanly catch problems like this STORM-512 (Summary: Timer ticks we written into Kafka which caused a lot of deserialization errors in reading the AVRO records.)
# A way to determine the schema this was written with.
#* As indicated above I vote for using the CRC-64-AVRO.
#** I noticed that a simple typo fix in the documentation of a Schema causes a new fingerprint to be generated.
#** Proposal: I think we should 'clean' the schema before calculating the fingerprint. I.e. remove the things that do not impact the binary form of the record (like the doc field).
# Have a place where we can find the schemas using the fingerprint as the key.
#* Here I think (looking at AVRO-1124 and the fact that there are ready to run implementations like this [Schema Registry|http://docs.confluent.io/current/schema-registry/docs/index.html]) we should limit what we keep inside Avro to something like a "SchemaFactory" interface (as the storage/retrieval interface to get a Schema) and a very basic implementation that simply reads the available schema's from a (set of) property file(s). Using this others can write additional implementations that can read/write to things like databases or the above mentioned Schema Registry.
So to summarize my proposal on the standard for the {{Single record serialization format}} can be written as:
{code}"AVRO"<CRC-64-AVRO(Normalized Schema)><regular binary form of the actual record>{code}
[~rdblue], I'm seeking feedback from you guys on this proposal.
> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
> Issue Type: Improvement
> Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, meaning that I can read and write data with minimal effort across the various languages in use in my organization. If there was a standardized format for encoding single values that was optimized for out-of-band schema transfer, I would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode datums in this format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums. The reader would decode the fingerprint and ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed library users to inject custom backends. A simple, file system based one could be provided out of the box.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)