You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by HILEM Youcef <yo...@laposte.fr> on 2015/12/07 23:15:06 UTC

add a type to a union

Hi,

At La Poste Pôle Colis we use Avro in our new reactive architecture (kafka, spark streaming, Cassandra, elasticsearch, play framework).

In our modeling we used the type union to bring together in one schema all trace events of a package (arrival, departure, transportation, ...) at the body attribute.

Example :
{
"namespace" : "fr.laposte.colis.schema.pivot.message",
"name" : "Message",
"type" : "record",
"doc" : "Cette structure défini les caractéristiques de base d'un message. Peut(doit) être spécialisée pour un usage particulier",
                                "fields" : [
                                               {
                                                               "name" : "header",
                                                               "type" : "fr.laposte.colis.schema.pivot.common.message.MessageHeader",
                                                               "doc" :  "Entête du message"
                                               },{
                                                               "name" : "body",
                                                               "type" : ["fr.laposte.colis.schema.pivot.announcement.AnnouncementEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.delivery.DeliveryEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.handling.HandlingEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.crm.CrmEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.transport.CustomsTransportMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.consignment.CustomsContainerEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.consignment.CustomsParcelEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.rest.common.Rest",
                                                                              "fr.laposte.colis.schema.pivot.reject.RejectMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.dpmo.defectrequest.DefectRequestEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.dpmo.defectresult.DefectResultEventMessageBody",
                                                                               "fr.laposte.colis.schema.timeout.TimeoutMessageBody",
                                                                               "fr.laposte.colis.schema.notification.Notification"
                                                                              ],
                                                               "doc" :  "Abstraction du corps de message. Peut-être substitué par tout type dérivé du type MessageBody"
                                               }
                                ]
}

However, as well explained at (https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html) : "Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate"

My question : is a default value for field "body" is sufficient so that if the reader encounters a union branch it does not know about, it can substitute the default value (see http://grokbase.com/t/avro/user/11b3bn6r6z/does-extending-union-break-compatibility) ?

Thank you in advance for your help.

RE: add a type to a union

Posted by HILEM Youcef <yo...@laposte.fr>.
Thank you Martin for your help and advice.
We use Confluent.io Schema Registry for avro schema versioning (http://docs.confluent.io/2.0.0/schema-registry/docs/intro.html).
Currently, our preference is to use POJO generated by avro compiler.
We will evaluate these different solutions.
A third option would be to expand the Schema Registry by the url of the generated POJOs (for each schema version). Then use the Java class loader mechanism to load (use) the right classes during deserialization. Thus, all consumers will use the correct version during deserialization.
It remains to check that data pipeline of old consumers is compliant at each schema evolution.
Best regards.
Youcef HILEM


De : Martin Kleppmann [mailto:martin@kleppmann.com]
Envoyé : mardi 15 décembre 2015 22:11
À : user@avro.apache.org
Objet : Re: add a type to a union

One approach you could use: instead of a union, make a separate field for every possible type of message, and make every field a union with null (with default value null). Then only fill in the field for the corresponding message type. If you do this, a reader using an old version of the schema will simply see all fields as null (rather than an exception) if it encounters an unknown message type.

Another possibility: you can always use the writer schema to decode the data, and use the "generic" (dynamically typed) interface for accessing the data. In that case, schema evolution is handled by the application code.

Putting binary Avro blobs in the database is absolutely fine, as long as you attach a schema version to every blob (so that you know the writer schema with which it was encoded). You can keep the schemas in a separate database table.

Martin

On 15 Dec 2015, at 16:38, HILEM Youcef <yo...@laposte.fr>> wrote:

Hi Martin,

Thank you for your clear answer.
I will test the example you provide.
In this case it is strongly not recommended to use binary avro as a blob in a database.
It is very difficult if not impossible to deserialize with a single reader all lines.
Best regards.
Youcef.

De : Martin Kleppmann [mailto:martin@kleppmann.com]
Envoyé : lundi 14 décembre 2015 22:46
À : <us...@avro.apache.org>>
Objet : Re: add a type to a union

Hi Youcef,

Glad you found my old blog post on Avro schema evolution :)

I encourage you to try a simple example, which will make it clearer: https://gist.github.com/ept/5fd7c625969248b31e73

In this example, the writer has a union of null, string and long, whereas the reader only has a union of null and string. A default value of null is set. If the record has a null or string value, it is correctly parsed by the reader. If the record has a long value, the reader throws an exception, because it is not one of the union datatypes it is expecting.

So the default value unfortunately doesn't help here. If you want to add a new branch to a union schema, you have to make sure that all the readers are updated with the new schema first, and only then should writers start generating data with the new schema.

Hope that helps.
Martin


On 7 Dec 2015, at 22:15, HILEM Youcef <yo...@laposte.fr>> wrote:

Hi,

At La Poste Pôle Colis we use Avro in our new reactive architecture (kafka, spark streaming, Cassandra, elasticsearch, play framework).

In our modeling we used the type union to bring together in one schema all trace events of a package (arrival, departure, transportation, ...) at the body attribute.

Example :
{
"namespace" : "fr.laposte.colis.schema.pivot.message",
"name" : "Message",
"type" : "record",
"doc" : "Cette structure défini les caractéristiques de base d'un message. Peut(doit) être spécialisée pour un usage particulier",
                                "fields" : [
                                               {
                                                               "name" : "header",
                                                               "type" : "fr.laposte.colis.schema.pivot.common.message.MessageHeader",
                                                               "doc" :  "Entête du message"
                                               },{
                                                               "name" : "body",
                                                               "type" : ["fr.laposte.colis.schema.pivot.announcement.AnnouncementEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.delivery.DeliveryEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.handling.HandlingEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.crm.CrmEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.transport.CustomsTransportMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.consignment.CustomsContainerEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.consignment.CustomsParcelEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.rest.common.Rest",
                                                                              "fr.laposte.colis.schema.pivot.reject.RejectMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.dpmo.defectrequest.DefectRequestEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.dpmo.defectresult.DefectResultEventMessageBody",
                                                                               "fr.laposte.colis.schema.timeout.TimeoutMessageBody",
                                                                               "fr.laposte.colis.schema.notification.Notification"
                                                                              ],
                                                               "doc" :  "Abstraction du corps de message. Peut-être substitué par tout type dérivé du type MessageBody"
                                               }
                                ]
}

However, as well explained at (https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html) : “Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate”

My question : is a default value for field “body” is sufficient so that if the reader encounters a union branch it does not know about, it can substitute the default value (see http://grokbase.com/t/avro/user/11b3bn6r6z/does-extending-union-break-compatibility) ?

Thank you in advance for your help.


Post-scriptum La Poste
Ce message est confidentiel. Sous reserve de tout accord conclu par
ecrit entre vous et La Poste, son contenu ne represente en aucun cas un
engagement de la part de La Poste. Toute publication, utilisation ou
diffusion, meme partielle, doit etre autorisee prealablement. Si vous
n'etes pas destinataire de ce message, merci d'en avertir immediatement
l'expediteur.


Post-scriptum La Poste

Ce message est confidentiel. Sous reserve de tout accord conclu par
ecrit entre vous et La Poste, son contenu ne represente en aucun cas un engagement de la part de La Poste. Toute publication, utilisation ou diffusion, meme partielle, doit etre autorisee prealablement. Si vous n'etes pas destinataire de ce message, merci d'en avertir immediatement
l'expediteur.

Re: add a type to a union

Posted by Martin Kleppmann <ma...@kleppmann.com>.
One approach you could use: instead of a union, make a separate field for every possible type of message, and make every field a union with null (with default value null). Then only fill in the field for the corresponding message type. If you do this, a reader using an old version of the schema will simply see all fields as null (rather than an exception) if it encounters an unknown message type.

Another possibility: you can always use the writer schema to decode the data, and use the "generic" (dynamically typed) interface for accessing the data. In that case, schema evolution is handled by the application code.

Putting binary Avro blobs in the database is absolutely fine, as long as you attach a schema version to every blob (so that you know the writer schema with which it was encoded). You can keep the schemas in a separate database table.

Martin

> On 15 Dec 2015, at 16:38, HILEM Youcef <yo...@laposte.fr> wrote:
> 
> Hi Martin,
>  
> Thank you for your clear answer.
> I will test the example you provide.
> In this case it is strongly not recommended to use binary avro as a blob in a database.
> It is very difficult if not impossible to deserialize with a single reader all lines.
> Best regards.
> Youcef.
>  
> De : Martin Kleppmann [mailto:martin@kleppmann.com] 
> Envoyé : lundi 14 décembre 2015 22:46
> À : <us...@avro.apache.org>
> Objet : Re: add a type to a union
>  
> Hi Youcef,
>  
> Glad you found my old blog post on Avro schema evolution :)
>  
> I encourage you to try a simple example, which will make it clearer: https://gist.github.com/ept/5fd7c625969248b31e73 <https://gist.github.com/ept/5fd7c625969248b31e73>
>  
> In this example, the writer has a union of null, string and long, whereas the reader only has a union of null and string. A default value of null is set. If the record has a null or string value, it is correctly parsed by the reader. If the record has a long value, the reader throws an exception, because it is not one of the union datatypes it is expecting.
>  
> So the default value unfortunately doesn't help here. If you want to add a new branch to a union schema, you have to make sure that all the readers are updated with the new schema first, and only then should writers start generating data with the new schema.
>  
> Hope that helps.
> Martin
>  
>  
> On 7 Dec 2015, at 22:15, HILEM Youcef <youcef.hilem@laposte.fr <ma...@laposte.fr>> wrote:
>  
> Hi,
>  
> At La Poste Pôle Colis we use Avro in our new reactive architecture (kafka, spark streaming, Cassandra, elasticsearch, play framework).
>  
> In our modeling we used the type union to bring together in one schema all trace events of a package (arrival, departure, transportation, ...) at the body attribute.
>  
> Example :
> {
> "namespace" : "fr.laposte.colis.schema.pivot.message",
> "name" : "Message",
> "type" : "record",
> "doc" : "Cette structure défini les caractéristiques de base d'un message. Peut(doit) être spécialisée pour un usage particulier",
>                                 "fields" : [ 
>                                                {
>                                                                "name" : "header",
>                                                                "type" : "fr.laposte.colis.schema.pivot.common.message.MessageHeader",
>                                                                "doc" :  "Entête du message"
>                                                },{
>                                                                "name" : "body",
>                                                                "type" : ["fr.laposte.colis.schema.pivot.announcement.AnnouncementEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.delivery.DeliveryEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.handling.HandlingEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.crm.CrmEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.customs.transport.CustomsTransportMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.customs.consignment.CustomsContainerEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.customs.consignment.CustomsParcelEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.rest.common.Rest",
>                                                                               "fr.laposte.colis.schema.pivot.reject.RejectMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.dpmo.defectrequest.DefectRequestEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.dpmo.defectresult.DefectResultEventMessageBody",
>                                                                                "fr.laposte.colis.schema.timeout.TimeoutMessageBody",
>                                                                                "fr.laposte.colis.schema.notification.Notification"
>                                                                               ],
>                                                                "doc" :  "Abstraction du corps de message. Peut-être substitué par tout type dérivé du type MessageBody"
>                                                } 
>                                 ]
> }
>  
> However, as well explained at (https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html <https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html>) : “Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate”
>  
> My question : is a default value for field “body” is sufficient so that if the reader encounters a union branch it does not know about, it can substitute the default value (see http://grokbase.com/t/avro/user/11b3bn6r6z/does-extending-union-break-compatibility <http://grokbase.com/t/avro/user/11b3bn6r6z/does-extending-union-break-compatibility>) ?
>  
> Thank you in advance for your help.
>  
> 
> Post-scriptum La Poste
> 
> Ce message est confidentiel. Sous reserve de tout accord conclu par
> ecrit entre vous et La Poste, son contenu ne represente en aucun cas un
> engagement de la part de La Poste. Toute publication, utilisation ou
> diffusion, meme partielle, doit etre autorisee prealablement. Si vous
> n'etes pas destinataire de ce message, merci d'en avertir immediatement
> l'expediteur.
> 


RE: add a type to a union

Posted by HILEM Youcef <yo...@laposte.fr>.
Hi Martin,

Thank you for your clear answer.
I will test the example you provide.
In this case it is strongly not recommended to use binary avro as a blob in a database.
It is very difficult if not impossible to deserialize with a single reader all lines.
Best regards.
Youcef.

De : Martin Kleppmann [mailto:martin@kleppmann.com]
Envoyé : lundi 14 décembre 2015 22:46
À : <us...@avro.apache.org>
Objet : Re: add a type to a union

Hi Youcef,

Glad you found my old blog post on Avro schema evolution :)

I encourage you to try a simple example, which will make it clearer: https://gist.github.com/ept/5fd7c625969248b31e73

In this example, the writer has a union of null, string and long, whereas the reader only has a union of null and string. A default value of null is set. If the record has a null or string value, it is correctly parsed by the reader. If the record has a long value, the reader throws an exception, because it is not one of the union datatypes it is expecting.

So the default value unfortunately doesn't help here. If you want to add a new branch to a union schema, you have to make sure that all the readers are updated with the new schema first, and only then should writers start generating data with the new schema.

Hope that helps.
Martin


On 7 Dec 2015, at 22:15, HILEM Youcef <yo...@laposte.fr>> wrote:

Hi,

At La Poste Pôle Colis we use Avro in our new reactive architecture (kafka, spark streaming, Cassandra, elasticsearch, play framework).

In our modeling we used the type union to bring together in one schema all trace events of a package (arrival, departure, transportation, ...) at the body attribute.

Example :
{
"namespace" : "fr.laposte.colis.schema.pivot.message",
"name" : "Message",
"type" : "record",
"doc" : "Cette structure défini les caractéristiques de base d'un message. Peut(doit) être spécialisée pour un usage particulier",
                                "fields" : [
                                               {
                                                               "name" : "header",
                                                               "type" : "fr.laposte.colis.schema.pivot.common.message.MessageHeader",
                                                               "doc" :  "Entête du message"
                                               },{
                                                               "name" : "body",
                                                               "type" : ["fr.laposte.colis.schema.pivot.announcement.AnnouncementEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.delivery.DeliveryEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.handling.HandlingEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.crm.CrmEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.transport.CustomsTransportMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.consignment.CustomsContainerEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.customs.consignment.CustomsParcelEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.rest.common.Rest",
                                                                              "fr.laposte.colis.schema.pivot.reject.RejectMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.dpmo.defectrequest.DefectRequestEventMessageBody",
                                                                              "fr.laposte.colis.schema.pivot.dpmo.defectresult.DefectResultEventMessageBody",
                                                                               "fr.laposte.colis.schema.timeout.TimeoutMessageBody",
                                                                               "fr.laposte.colis.schema.notification.Notification"
                                                                              ],
                                                               "doc" :  "Abstraction du corps de message. Peut-être substitué par tout type dérivé du type MessageBody"
                                               }
                                ]
}

However, as well explained at (https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html) : “Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate”

My question : is a default value for field “body” is sufficient so that if the reader encounters a union branch it does not know about, it can substitute the default value (see http://grokbase.com/t/avro/user/11b3bn6r6z/does-extending-union-break-compatibility) ?

Thank you in advance for your help.


Post-scriptum La Poste

Ce message est confidentiel. Sous reserve de tout accord conclu par
ecrit entre vous et La Poste, son contenu ne represente en aucun cas un
engagement de la part de La Poste. Toute publication, utilisation ou
diffusion, meme partielle, doit etre autorisee prealablement. Si vous
n'etes pas destinataire de ce message, merci d'en avertir immediatement
l'expediteur.

Re: add a type to a union

Posted by Martin Kleppmann <ma...@kleppmann.com>.
Hi Youcef,

Glad you found my old blog post on Avro schema evolution :)

I encourage you to try a simple example, which will make it clearer: https://gist.github.com/ept/5fd7c625969248b31e73 <https://gist.github.com/ept/5fd7c625969248b31e73>

In this example, the writer has a union of null, string and long, whereas the reader only has a union of null and string. A default value of null is set. If the record has a null or string value, it is correctly parsed by the reader. If the record has a long value, the reader throws an exception, because it is not one of the union datatypes it is expecting.

So the default value unfortunately doesn't help here. If you want to add a new branch to a union schema, you have to make sure that all the readers are updated with the new schema first, and only then should writers start generating data with the new schema.

Hope that helps.
Martin


> On 7 Dec 2015, at 22:15, HILEM Youcef <yo...@laposte.fr> wrote:
> 
> Hi,
>  
> At La Poste Pôle Colis we use Avro in our new reactive architecture (kafka, spark streaming, Cassandra, elasticsearch, play framework).
>  
> In our modeling we used the type union to bring together in one schema all trace events of a package (arrival, departure, transportation, ...) at the body attribute.
>  
> Example :
> {
> "namespace" : "fr.laposte.colis.schema.pivot.message",
> "name" : "Message",
> "type" : "record",
> "doc" : "Cette structure défini les caractéristiques de base d'un message. Peut(doit) être spécialisée pour un usage particulier",
>                                 "fields" : [ 
>                                                {
>                                                                "name" : "header",
>                                                                "type" : "fr.laposte.colis.schema.pivot.common.message.MessageHeader",
>                                                                "doc" :  "Entête du message"
>                                                },{
>                                                                "name" : "body",
>                                                                "type" : ["fr.laposte.colis.schema.pivot.announcement.AnnouncementEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.delivery.DeliveryEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.handling.HandlingEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.crm.CrmEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.customs.transport.CustomsTransportMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.customs.consignment.CustomsContainerEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.customs.consignment.CustomsParcelEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.rest.common.Rest",
>                                                                               "fr.laposte.colis.schema.pivot.reject.RejectMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.dpmo.defectrequest.DefectRequestEventMessageBody",
>                                                                               "fr.laposte.colis.schema.pivot.dpmo.defectresult.DefectResultEventMessageBody",
>                                                                                "fr.laposte.colis.schema.timeout.TimeoutMessageBody",
>                                                                                "fr.laposte.colis.schema.notification.Notification"
>                                                                               ],
>                                                                "doc" :  "Abstraction du corps de message. Peut-être substitué par tout type dérivé du type MessageBody"
>                                                } 
>                                 ]
> }
>  
> However, as well explained at (https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html <https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html>) : “Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate”
>  
> My question : is a default value for field “body” is sufficient so that if the reader encounters a union branch it does not know about, it can substitute the default value (see http://grokbase.com/t/avro/user/11b3bn6r6z/does-extending-union-break-compatibility <http://grokbase.com/t/avro/user/11b3bn6r6z/does-extending-union-break-compatibility>) ?
>  
> Thank you in advance for your help.