You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Daniel Schierbeck <da...@gmail.com> on 2015/07/09 10:36:20 UTC

Using Avro for encoding messages

I'm working on a system that will store Avro-encoded messages in Kafka. The
system will have both producers and consumers in different languages,
including Ruby (not JRuby) and Java.

At the moment I'm encoding each message as a data file, which means that
the full schema is included in each encoded message. This is obviously
suboptimal, but it doesn't seem like there's a standardized format for
single-message Avro encodings.

I've reviewed Confluent's schema-registry offering, but that seems to be
overkill for my needs, and would require me to run and maintain yet another
piece of infrastructure. Ideally, I wouldn't have to use anything besides
Kafka.

Is this something that other people have experience with?

I've come up with a scheme that would seem to work well independently of
what kind of infrastructure you're using: whenever a writer process is
asked to encode a message m with schema s for the first time, it broadcasts
(s', s) to a schema registry, where s' is the fingerprint of s. The schema
registry in this case can be pluggable, and can be any mechanism that
allows different processes to access the schemas. The writer then encodes
the message as (s', m), i.e. only includes the schema fingerprint. A
reader, when first encountering a message with a schema fingerprint s',
looks up s from the schema registry and uses s to decode the message.

Here, the concept of a schema registry has been abstracted away and is not
tied to the concept of "schema ids" and versions. Furthermore, there are
some desirable traits:

1. Schemas are identified by their fingerprints, so there's no need for an
external system to issue schema ids.
2. Writing (s', s) pairs is idempotent, so there's no need to coordinate
that task. If you've got a system with many writers, you can let all of
them broadcast their schemas when they boot or when they need to encode
data using the schemas.
3. It would work using a range of different backends for the schema
registry. Simple key-value stores would obviously work, but for my case I'd
probably want to use Kafka itself. If the schemas are writting to a topic
with key-based compaction, where s' is the message key and s is the message
value, then Kafka would automatically clean up duplicates over time. This
would save me from having to add more pieces to my infrastructure.

Has this problem been solved already? If not, would it make sense to define
a common "message format" that defined the structure of (s', m) pairs?

Cheers,
Daniel Schierbeck

Re: Using Avro for encoding messages

Posted by Daniel Schierbeck <da...@gmail.com>.
The Confluent tools seem to be very oriented towards a Java-heavy
infrastructure, and I'd rather not have to re-implement all their somewhat
complex tooling in Ruby and Go. I'd much prefer a simplified model that can
more easily be implemented.
As an aside, Confluent *could* support such a standard by using a custom
"fingerprint type" that's just their id number.

On Thu, Jul 9, 2015 at 2:21 PM Svante Karlsson <sv...@csi.se>
wrote:

> >> What causes the schema normalization to be incomplete?
> Bad implementation, I use C++ avro and it's not complete and not very
> active.
>
> >And is that a problem? As long as the reader can get the schema, it
> shouldn't matter that there are duplicates – as long as the >differences
> between the duplicates do not affect decoding.
> Not really a problem, we tend to use machine generated schemas and they
> are always identical.
>
> I think there are holes in the simplification of types if I remember
> correctly.
> Namespaces should be collapsed,
> {"type" : "string"} -> "string" etc
>
> Current implementation can't reliably decide if two types are identical.
> If you correct the problem later then a registered schema would actually
> change it's hash since it now can be simplified. If this is a problem
> depends on your application.
>
> We currently encode this as you suggest <schema_type (byte)><schema_id
> (32/128bits)><avro (binary)>
> The binary fields should probably have a defined endianness also.
>
> I agree on that a defacto way of encoding this would be nice. Currently I
> would say that the confluent / linkedin way is the normal....
>
>
>

Re: Using Avro for encoding messages

Posted by Svante Karlsson <sv...@csi.se>.
>> What causes the schema normalization to be incomplete?
Bad implementation, I use C++ avro and it's not complete and not very
active.

>And is that a problem? As long as the reader can get the schema, it
shouldn't matter that there are duplicates – as long as the >differences
between the duplicates do not affect decoding.
Not really a problem, we tend to use machine generated schemas and they are
always identical.

I think there are holes in the simplification of types if I remember
correctly.
Namespaces should be collapsed,
{"type" : "string"} -> "string" etc

Current implementation can't reliably decide if two types are identical. If
you correct the problem later then a registered schema would actually
change it's hash since it now can be simplified. If this is a problem
depends on your application.

We currently encode this as you suggest <schema_type (byte)><schema_id
(32/128bits)><avro (binary)>
The binary fields should probably have a defined endianness also.

I agree on that a defacto way of encoding this would be nice. Currently I
would say that the confluent / linkedin way is the normal....

Re: Using Avro for encoding messages

Posted by Daniel Schierbeck <da...@gmail.com>.
Thanks for the reply, Svante!

What causes the schema normalization to be incomplete? And is that a
problem? As long as the reader can get the schema, it shouldn't matter that
there are duplicates – as long as the differences between the duplicates do
not affect decoding.

Would it make sense to have a spec for how to encode these messages? Maybe
<<fingerprint_type>> <<fingerprint>> <<data>> ? Maybe leave room for a
metadata map as well?

On Thu, Jul 9, 2015 at 12:54 PM Svante Karlsson <sv...@csi.se>
wrote:

> I had the same problem a while ago and for the same reasons as you mention
> we decided to use fingerprints (MD5 hash of the schema), however there are
> some catches here.
>
> First I believe that the normalisation of the schema is incomplete so you
> might end up with different hashes of the same schema.
>
> Second, using a 128 bit integer prepended to both key and values takes
> more space than using 32 bit. Not a big issue for values but for keys this
> doubles our size.
>
> Third, we already started to use confluent's registry as well because of
> the already existing integration with other pieces of infrastructure.
> (camus, bottledwater etc.)
>
> What should be useful given this perspective is a byte or two prepending
> the schema id - defining the registry namespace.
>
> I've added the fingerprint schema registry as a example in the c++ kafka
> library at
>
> https://github.com/bitbouncer/csi-kafka/tree/master/examples/schema-registry
>
>
> We run a couple of those in a mesos cluster and use HAproxy find them.
>
>
> /svante
>
>
> 2015-07-09 10:36 GMT+02:00 Daniel Schierbeck <da...@gmail.com>
> :
>
>> I'm working on a system that will store Avro-encoded messages in Kafka.
>> The system will have both producers and consumers in different languages,
>> including Ruby (not JRuby) and Java.
>>
>> At the moment I'm encoding each message as a data file, which means that
>> the full schema is included in each encoded message. This is obviously
>> suboptimal, but it doesn't seem like there's a standardized format for
>> single-message Avro encodings.
>>
>> I've reviewed Confluent's schema-registry offering, but that seems to be
>> overkill for my needs, and would require me to run and maintain yet another
>> piece of infrastructure. Ideally, I wouldn't have to use anything besides
>> Kafka.
>>
>> Is this something that other people have experience with?
>>
>> I've come up with a scheme that would seem to work well independently of
>> what kind of infrastructure you're using: whenever a writer process is
>> asked to encode a message m with schema s for the first time, it broadcasts
>> (s', s) to a schema registry, where s' is the fingerprint of s. The schema
>> registry in this case can be pluggable, and can be any mechanism that
>> allows different processes to access the schemas. The writer then encodes
>> the message as (s', m), i.e. only includes the schema fingerprint. A
>> reader, when first encountering a message with a schema fingerprint s',
>> looks up s from the schema registry and uses s to decode the message.
>>
>> Here, the concept of a schema registry has been abstracted away and is
>> not tied to the concept of "schema ids" and versions. Furthermore, there
>> are some desirable traits:
>>
>> 1. Schemas are identified by their fingerprints, so there's no need for
>> an external system to issue schema ids.
>> 2. Writing (s', s) pairs is idempotent, so there's no need to coordinate
>> that task. If you've got a system with many writers, you can let all of
>> them broadcast their schemas when they boot or when they need to encode
>> data using the schemas.
>> 3. It would work using a range of different backends for the schema
>> registry. Simple key-value stores would obviously work, but for my case I'd
>> probably want to use Kafka itself. If the schemas are writting to a topic
>> with key-based compaction, where s' is the message key and s is the message
>> value, then Kafka would automatically clean up duplicates over time. This
>> would save me from having to add more pieces to my infrastructure.
>>
>> Has this problem been solved already? If not, would it make sense to
>> define a common "message format" that defined the structure of (s', m)
>> pairs?
>>
>> Cheers,
>> Daniel Schierbeck
>>
>
>

Re: Using Avro for encoding messages

Posted by Svante Karlsson <sv...@csi.se>.
I had the same problem a while ago and for the same reasons as you mention
we decided to use fingerprints (MD5 hash of the schema), however there are
some catches here.

First I believe that the normalisation of the schema is incomplete so you
might end up with different hashes of the same schema.

Second, using a 128 bit integer prepended to both key and values takes more
space than using 32 bit. Not a big issue for values but for keys this
doubles our size.

Third, we already started to use confluent's registry as well because of
the already existing integration with other pieces of infrastructure.
(camus, bottledwater etc.)

What should be useful given this perspective is a byte or two prepending
the schema id - defining the registry namespace.

I've added the fingerprint schema registry as a example in the c++ kafka
library at
https://github.com/bitbouncer/csi-kafka/tree/master/examples/schema-registry


We run a couple of those in a mesos cluster and use HAproxy find them.


/svante


2015-07-09 10:36 GMT+02:00 Daniel Schierbeck <da...@gmail.com>:

> I'm working on a system that will store Avro-encoded messages in Kafka.
> The system will have both producers and consumers in different languages,
> including Ruby (not JRuby) and Java.
>
> At the moment I'm encoding each message as a data file, which means that
> the full schema is included in each encoded message. This is obviously
> suboptimal, but it doesn't seem like there's a standardized format for
> single-message Avro encodings.
>
> I've reviewed Confluent's schema-registry offering, but that seems to be
> overkill for my needs, and would require me to run and maintain yet another
> piece of infrastructure. Ideally, I wouldn't have to use anything besides
> Kafka.
>
> Is this something that other people have experience with?
>
> I've come up with a scheme that would seem to work well independently of
> what kind of infrastructure you're using: whenever a writer process is
> asked to encode a message m with schema s for the first time, it broadcasts
> (s', s) to a schema registry, where s' is the fingerprint of s. The schema
> registry in this case can be pluggable, and can be any mechanism that
> allows different processes to access the schemas. The writer then encodes
> the message as (s', m), i.e. only includes the schema fingerprint. A
> reader, when first encountering a message with a schema fingerprint s',
> looks up s from the schema registry and uses s to decode the message.
>
> Here, the concept of a schema registry has been abstracted away and is not
> tied to the concept of "schema ids" and versions. Furthermore, there are
> some desirable traits:
>
> 1. Schemas are identified by their fingerprints, so there's no need for an
> external system to issue schema ids.
> 2. Writing (s', s) pairs is idempotent, so there's no need to coordinate
> that task. If you've got a system with many writers, you can let all of
> them broadcast their schemas when they boot or when they need to encode
> data using the schemas.
> 3. It would work using a range of different backends for the schema
> registry. Simple key-value stores would obviously work, but for my case I'd
> probably want to use Kafka itself. If the schemas are writting to a topic
> with key-based compaction, where s' is the message key and s is the message
> value, then Kafka would automatically clean up duplicates over time. This
> would save me from having to add more pieces to my infrastructure.
>
> Has this problem been solved already? If not, would it make sense to
> define a common "message format" that defined the structure of (s', m)
> pairs?
>
> Cheers,
> Daniel Schierbeck
>