You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Mark <st...@gmail.com> on 2013/08/22 05:15:10 UTC

More questions on avro serialization

Does LinkedIn include the SHA of the schema into the header of each Avro message they write or do they wrap the avro message and prepend the SHA?

In either case, how does the Hadoop consumer know what schema to read?

Re: More questions on avro serialization

Posted by Mark <st...@gmail.com>.

… or is the payload of the message prepending with a magic byte followed by the SHA?

On Aug 22, 2013, at 9:49 AM, Mark <st...@gmail.com> wrote:

> Are you referring to the same message class as: https://github.com/apache/kafka/blob/0.7/core/src/main/scala/kafka/message/Message.scala or are you talking bout a wrapper around this message class which has its own magic byte followed by SHA of schema? If its the former, I'm confused. 
> 
> 
> FYI, Looks like Camus gets a 4 byte identifier from a schema registry.
> 
> https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/coders/KafkaAvroMessageEncoder.java
> 
> 
> On Aug 22, 2013, at 9:37 AM, Neha Narkhede <ne...@gmail.com> wrote:
> 
>> The point of the magic byte is to indicate the current version of the
>> message format. One part of the format is the fact that it is Avro encoded.
>> I'm not sure how Camus gets a 4 byte id, but at LinkedIn we use the 16 byte
>> MD5 hash of the schema. Since AVRO-1124 is not resolved yet, I'm not sure
>> if I can comment on the compatibility just yet.
>> 
>> Thanks,
>> Neha
>> 
>> 
>> On Wed, Aug 21, 2013 at 9:00 PM, Mark <st...@gmail.com> wrote:
>> 
>>> Neha, thanks for the response.
>>> 
>>> So the only point of the magic byte is to indicate that the rest of the
>>> message is Avro encoded? I noticed that in Camus a 4 byte int id of the
>>> schema is written instead of the 16 byte SHA. Is this the new preferred
>>> way? Which is compatible with
>>> https://issues.apache.org/jira/browse/AVRO-1124?
>>> 
>>> Thanks again
>>> 
>>> On Aug 21, 2013, at 8:38 PM, Neha Narkhede <ne...@gmail.com>
>>> wrote:
>>> 
>>>> We define the LinkedIn Kafka message to have a magic byte (indicating
>>> Avro
>>>> serialization), MD5 header followed by the payload. The Hadoop consumer
>>>> reads the MD5, looks up the schema in the repository and deserializes the
>>>> message.
>>>> 
>>>> Thanks,
>>>> Neha
>>>> 
>>>> 
>>>> On Wed, Aug 21, 2013 at 8:15 PM, Mark <st...@gmail.com> wrote:
>>>> 
>>>>> Does LinkedIn include the SHA of the schema into the header of each Avro
>>>>> message they write or do they wrap the avro message and prepend the SHA?
>>>>> 
>>>>> In either case, how does the Hadoop consumer know what schema to read?
>>> 
>>> 
>

Re: More questions on avro serialization

Posted by Mark <st...@gmail.com>.

Are you referring to the same message class as: https://github.com/apache/kafka/blob/0.7/core/src/main/scala/kafka/message/Message.scala or are you talking bout a wrapper around this message class which has its own magic byte followed by SHA of schema? If its the former, I'm confused. 


FYI, Looks like Camus gets a 4 byte identifier from a schema registry.

https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/coders/KafkaAvroMessageEncoder.java


On Aug 22, 2013, at 9:37 AM, Neha Narkhede <ne...@gmail.com> wrote:

> The point of the magic byte is to indicate the current version of the
> message format. One part of the format is the fact that it is Avro encoded.
> I'm not sure how Camus gets a 4 byte id, but at LinkedIn we use the 16 byte
> MD5 hash of the schema. Since AVRO-1124 is not resolved yet, I'm not sure
> if I can comment on the compatibility just yet.
> 
> Thanks,
> Neha
> 
> 
> On Wed, Aug 21, 2013 at 9:00 PM, Mark <st...@gmail.com> wrote:
> 
>> Neha, thanks for the response.
>> 
>> So the only point of the magic byte is to indicate that the rest of the
>> message is Avro encoded? I noticed that in Camus a 4 byte int id of the
>> schema is written instead of the 16 byte SHA. Is this the new preferred
>> way? Which is compatible with
>> https://issues.apache.org/jira/browse/AVRO-1124?
>> 
>> Thanks again
>> 
>> On Aug 21, 2013, at 8:38 PM, Neha Narkhede <ne...@gmail.com>
>> wrote:
>> 
>>> We define the LinkedIn Kafka message to have a magic byte (indicating
>> Avro
>>> serialization), MD5 header followed by the payload. The Hadoop consumer
>>> reads the MD5, looks up the schema in the repository and deserializes the
>>> message.
>>> 
>>> Thanks,
>>> Neha
>>> 
>>> 
>>> On Wed, Aug 21, 2013 at 8:15 PM, Mark <st...@gmail.com> wrote:
>>> 
>>>> Does LinkedIn include the SHA of the schema into the header of each Avro
>>>> message they write or do they wrap the avro message and prepend the SHA?
>>>> 
>>>> In either case, how does the Hadoop consumer know what schema to read?
>> 
>>

Re: More questions on avro serialization

Posted by Neha Narkhede <ne...@gmail.com>.

The point of the magic byte is to indicate the current version of the
message format. One part of the format is the fact that it is Avro encoded.
I'm not sure how Camus gets a 4 byte id, but at LinkedIn we use the 16 byte
MD5 hash of the schema. Since AVRO-1124 is not resolved yet, I'm not sure
if I can comment on the compatibility just yet.

Thanks,
Neha


On Wed, Aug 21, 2013 at 9:00 PM, Mark <st...@gmail.com> wrote:

> Neha, thanks for the response.
>
> So the only point of the magic byte is to indicate that the rest of the
> message is Avro encoded? I noticed that in Camus a 4 byte int id of the
> schema is written instead of the 16 byte SHA. Is this the new preferred
> way? Which is compatible with
> https://issues.apache.org/jira/browse/AVRO-1124?
>
> Thanks again
>
> On Aug 21, 2013, at 8:38 PM, Neha Narkhede <ne...@gmail.com>
> wrote:
>
> > We define the LinkedIn Kafka message to have a magic byte (indicating
> Avro
> > serialization), MD5 header followed by the payload. The Hadoop consumer
> > reads the MD5, looks up the schema in the repository and deserializes the
> > message.
> >
> > Thanks,
> > Neha
> >
> >
> > On Wed, Aug 21, 2013 at 8:15 PM, Mark <st...@gmail.com> wrote:
> >
> >> Does LinkedIn include the SHA of the schema into the header of each Avro
> >> message they write or do they wrap the avro message and prepend the SHA?
> >>
> >> In either case, how does the Hadoop consumer know what schema to read?
>
>

Re: More questions on avro serialization

Posted by Mark <st...@gmail.com>.

Neha, thanks for the response. 

So the only point of the magic byte is to indicate that the rest of the message is Avro encoded? I noticed that in Camus a 4 byte int id of the schema is written instead of the 16 byte SHA. Is this the new preferred way? Which is compatible with https://issues.apache.org/jira/browse/AVRO-1124?

Thanks again

On Aug 21, 2013, at 8:38 PM, Neha Narkhede <ne...@gmail.com> wrote:

> We define the LinkedIn Kafka message to have a magic byte (indicating Avro
> serialization), MD5 header followed by the payload. The Hadoop consumer
> reads the MD5, looks up the schema in the repository and deserializes the
> message.
> 
> Thanks,
> Neha
> 
> 
> On Wed, Aug 21, 2013 at 8:15 PM, Mark <st...@gmail.com> wrote:
> 
>> Does LinkedIn include the SHA of the schema into the header of each Avro
>> message they write or do they wrap the avro message and prepend the SHA?
>> 
>> In either case, how does the Hadoop consumer know what schema to read?

Re: More questions on avro serialization

Posted by Neha Narkhede <ne...@gmail.com>.

We define the LinkedIn Kafka message to have a magic byte (indicating Avro
serialization), MD5 header followed by the payload. The Hadoop consumer
reads the MD5, looks up the schema in the repository and deserializes the
message.

Thanks,
Neha

On Wed, Aug 21, 2013 at 8:15 PM, Mark <st...@gmail.com> wrote:

> Does LinkedIn include the SHA of the schema into the header of each Avro
> message they write or do they wrap the avro message and prepend the SHA?
>
> In either case, how does the Hadoop consumer know what schema to read?