You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Mark <st...@gmail.com> on 2013/08/20 18:19:23 UTC

Schema Registry?

Can someone break down how message serialization would work with Avro and a schema registry? We are planning to use Avro with Kafka and I've read instead of adding a schema to every single event it would be wise to add some sort of fingerprint with each message to identify which schema it should used. What I'm having trouble understanding is, how do we read the fingerprint without a schema? Don't we need the schema to deserialize?  Same question goes for working with Hadoop.. how does the input format know which schema to use?

Thanks

Re: Schema Registry?

Posted by Eric Wasserman <ew...@247-inc.com>.
Yes we have a Kafka event consumer that creates the files in HDFS. There are other non-Hadoop consumers as well. 

On Aug 21, 2013, at 2:23 PM, "Mark" <st...@gmail.com> wrote:

> Some final questions.
> 
> Since there is no need for the schema in each Kafka event do you just output the message without the container file (file header, metadata, sync_markers)? If so, how do you get this working with the Kafka hadoop consumers? Doing it this way, does it require you to write your own consumer to write to hadoop?
> 
> Thanks
> 
> On Aug 20, 2013, at 11:01 AM, Eric Wasserman <ew...@247-inc.com> wrote:
> 
>> You may want to check out this Avro feature request: https://issues.apache.org/jira/browse/AVRO-1124
>> which has a lot of nice motivation and usage patterns. Unfortunately, its not yet a resolved request.
>> 
>> There are really two broad use cases. 
>> 
>> 1) The data are "small" compared to the schema (perhaps because its a collection or stream of records encoded by different schemas)
>> 2) The data are "big" compared to the schema. (very big records or lots of records that share a schema)
>> 
>> Case (1) is often a candidate for a schema registry. Case (2) not as much.
>> 
>> Examples from my own usage:
>> 
>> For Kafka we include an MD5 digest of the writer's schema with each Message. It is serialized as a concatenation of the fixed-length MD5 and the binary Avro-encoded data. To decode we read off the MD5, look up the schema and use it to decode the remainder of the Message.
>> [You could also segregate data written with different schemas into different Kafka topics. By knowing which topic a message is under you then arrange a way to look up the writer's schema. That lets you avoid even the cost of including the MD5 in the Messages.]
>> 
>> In either case consumer code needs to look up the full schema from a "registry" in order to do the actual decode the Avro-encoded data. The registry serves the full schema that corresponds to the specified MD5 digest.
>> 
>> We use a similar technique for storing MD5-tagged Avro data in "columns" of Cassandra and so on.
>> 
>> Case (2) is pretty well handled by just embedding the full schema itself.
>> 
>> For example, for Hadoop you can just use Avro data files which include the actual schema in a header. All the record in the file then adhere to that same schema. In this case using a registry to get the writer's schema is not necessary.
>> 
>> Note: As described in the feature request linked above, some people use a schema registry as a way of coordinating schema evolution rather than just as a way of making schema access "economical".
>> 
>> 
>> 
>> On Aug 20, 2013, at 9:19 AM, Mark wrote:
>> 
>>> Can someone break down how message serialization would work with Avro and a schema registry? We are planning to use Avro with Kafka and I've read instead of adding a schema to every single event it would be wise to add some sort of fingerprint with each message to identify which schema it should used. What I'm having trouble understanding is, how do we read the fingerprint without a schema? Don't we need the schema to deserialize?  Same question goes for working with Hadoop.. how does the input format know which schema to use?
>>> 
>>> Thanks
>> 
>> 
> 
> 


Re: Schema Registry?

Posted by Mark <st...@gmail.com>.
Some final questions.

Since there is no need for the schema in each Kafka event do you just output the message without the container file (file header, metadata, sync_markers)? If so, how do you get this working with the Kafka hadoop consumers? Doing it this way, does it require you to write your own consumer to write to hadoop?

Thanks

On Aug 20, 2013, at 11:01 AM, Eric Wasserman <ew...@247-inc.com> wrote:

> You may want to check out this Avro feature request: https://issues.apache.org/jira/browse/AVRO-1124
> which has a lot of nice motivation and usage patterns. Unfortunately, its not yet a resolved request.
> 
> There are really two broad use cases. 
> 
> 1) The data are "small" compared to the schema (perhaps because its a collection or stream of records encoded by different schemas)
> 2) The data are "big" compared to the schema. (very big records or lots of records that share a schema)
> 
> Case (1) is often a candidate for a schema registry. Case (2) not as much.
> 
> Examples from my own usage:
> 
> For Kafka we include an MD5 digest of the writer's schema with each Message. It is serialized as a concatenation of the fixed-length MD5 and the binary Avro-encoded data. To decode we read off the MD5, look up the schema and use it to decode the remainder of the Message.
> [You could also segregate data written with different schemas into different Kafka topics. By knowing which topic a message is under you then arrange a way to look up the writer's schema. That lets you avoid even the cost of including the MD5 in the Messages.]
> 
> In either case consumer code needs to look up the full schema from a "registry" in order to do the actual decode the Avro-encoded data. The registry serves the full schema that corresponds to the specified MD5 digest.
> 
> We use a similar technique for storing MD5-tagged Avro data in "columns" of Cassandra and so on.
> 
> Case (2) is pretty well handled by just embedding the full schema itself.
> 
> For example, for Hadoop you can just use Avro data files which include the actual schema in a header. All the record in the file then adhere to that same schema. In this case using a registry to get the writer's schema is not necessary.
> 
> Note: As described in the feature request linked above, some people use a schema registry as a way of coordinating schema evolution rather than just as a way of making schema access "economical".
> 
> 
> 
> On Aug 20, 2013, at 9:19 AM, Mark wrote:
> 
>> Can someone break down how message serialization would work with Avro and a schema registry? We are planning to use Avro with Kafka and I've read instead of adding a schema to every single event it would be wise to add some sort of fingerprint with each message to identify which schema it should used. What I'm having trouble understanding is, how do we read the fingerprint without a schema? Don't we need the schema to deserialize?  Same question goes for working with Hadoop.. how does the input format know which schema to use?
>> 
>> Thanks
> 
> 


Re: Schema Registry?

Posted by Eric Wasserman <ew...@247-inc.com>.
That is correct.

On Aug 20, 2013, at 11:35 AM, Mark wrote:

> Great response. From what I understand in your last response is that you are actually sending a wrapped Avro message to Kafka and all of your consumers know how to decode this wrapped message into two parts… a unique identifier (MD5) and the actual Avro message. Is that correct? If so, that answers question #1. 
> 



Re: Schema Registry?

Posted by Mark <st...@gmail.com>.
Great response. From what I understand in your last response is that you are actually sending a wrapped Avro message to Kafka and all of your consumers know how to decode this wrapped message into two parts… a unique identifier (MD5) and the actual Avro message. Is that correct? If so, that answers question #1. 



On Aug 20, 2013, at 11:01 AM, Eric Wasserman <ew...@247-inc.com> wrote:

> You may want to check out this Avro feature request: https://issues.apache.org/jira/browse/AVRO-1124
> which has a lot of nice motivation and usage patterns. Unfortunately, its not yet a resolved request.
> 
> There are really two broad use cases. 
> 
> 1) The data are "small" compared to the schema (perhaps because its a collection or stream of records encoded by different schemas)
> 2) The data are "big" compared to the schema. (very big records or lots of records that share a schema)
> 
> Case (1) is often a candidate for a schema registry. Case (2) not as much.
> 
> Examples from my own usage:
> 
> For Kafka we include an MD5 digest of the writer's schema with each Message. It is serialized as a concatenation of the fixed-length MD5 and the binary Avro-encoded data. To decode we read off the MD5, look up the schema and use it to decode the remainder of the Message.
> [You could also segregate data written with different schemas into different Kafka topics. By knowing which topic a message is under you then arrange a way to look up the writer's schema. That lets you avoid even the cost of including the MD5 in the Messages.]
> 
> In either case consumer code needs to look up the full schema from a "registry" in order to do the actual decode the Avro-encoded data. The registry serves the full schema that corresponds to the specified MD5 digest.
> 
> We use a similar technique for storing MD5-tagged Avro data in "columns" of Cassandra and so on.
> 
> Case (2) is pretty well handled by just embedding the full schema itself.
> 
> For example, for Hadoop you can just use Avro data files which include the actual schema in a header. All the record in the file then adhere to that same schema. In this case using a registry to get the writer's schema is not necessary.
> 
> Note: As described in the feature request linked above, some people use a schema registry as a way of coordinating schema evolution rather than just as a way of making schema access "economical".
> 
> 
> 
> On Aug 20, 2013, at 9:19 AM, Mark wrote:
> 
>> Can someone break down how message serialization would work with Avro and a schema registry? We are planning to use Avro with Kafka and I've read instead of adding a schema to every single event it would be wise to add some sort of fingerprint with each message to identify which schema it should used. What I'm having trouble understanding is, how do we read the fingerprint without a schema? Don't we need the schema to deserialize?  Same question goes for working with Hadoop.. how does the input format know which schema to use?
>> 
>> Thanks
> 
> 


Re: Schema Registry?

Posted by Eric Wasserman <ew...@247-inc.com>.
You may want to check out this Avro feature request: https://issues.apache.org/jira/browse/AVRO-1124
which has a lot of nice motivation and usage patterns. Unfortunately, its not yet a resolved request.

There are really two broad use cases. 

1) The data are "small" compared to the schema (perhaps because its a collection or stream of records encoded by different schemas)
2) The data are "big" compared to the schema. (very big records or lots of records that share a schema)

Case (1) is often a candidate for a schema registry. Case (2) not as much.

Examples from my own usage:

For Kafka we include an MD5 digest of the writer's schema with each Message. It is serialized as a concatenation of the fixed-length MD5 and the binary Avro-encoded data. To decode we read off the MD5, look up the schema and use it to decode the remainder of the Message.
[You could also segregate data written with different schemas into different Kafka topics. By knowing which topic a message is under you then arrange a way to look up the writer's schema. That lets you avoid even the cost of including the MD5 in the Messages.]

In either case consumer code needs to look up the full schema from a "registry" in order to do the actual decode the Avro-encoded data. The registry serves the full schema that corresponds to the specified MD5 digest.

We use a similar technique for storing MD5-tagged Avro data in "columns" of Cassandra and so on.

Case (2) is pretty well handled by just embedding the full schema itself.

For example, for Hadoop you can just use Avro data files which include the actual schema in a header. All the record in the file then adhere to that same schema. In this case using a registry to get the writer's schema is not necessary.

Note: As described in the feature request linked above, some people use a schema registry as a way of coordinating schema evolution rather than just as a way of making schema access "economical".



On Aug 20, 2013, at 9:19 AM, Mark wrote:

> Can someone break down how message serialization would work with Avro and a schema registry? We are planning to use Avro with Kafka and I've read instead of adding a schema to every single event it would be wise to add some sort of fingerprint with each message to identify which schema it should used. What I'm having trouble understanding is, how do we read the fingerprint without a schema? Don't we need the schema to deserialize?  Same question goes for working with Hadoop.. how does the input format know which schema to use?
> 
> Thanks