You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Martin Mucha <al...@gmail.com> on 2020/01/01 10:27:46 UTC

Re: Recomended naming of types to support for schema evolution

Hi guys, thanks for answers, I really appreciate it. There were a lots of
letters typed just for me, thanks!

Preword: There were some misunderstandings in my text, so I'll try to be
more specific and reference documentation/sources. And I have to apologize
beforehand, sometimes I have too confrontational "tone", please don't take
it bad if you don't like the way I'm saying something. Lets go.
———

I think we're essentially talking about the same stuff in principle, but on
different levels, my being more low-level avro, while I think you talk
about confluent "expasion-pack". I do not talk about confluent platform,
and actually cannot use it at the moment, but that's not important, since
all of that is possible in plain AVRO.

(I) Lets begin with message encoding, you mentioned 3:

1) message format (5 byte prefix) — *my comment: I believe this is not
AVRO, but nonstandardized expansion pack. 5B prefix is some hash of schema,
works with schema registry. OK.*
*2) *"stream format" — *my comment: again, not avro but confluent expansion
pack. I know about its existence however I did not see it anywhere and
don't know how this is actually produced.*
3) "send the schema before every message" — *my comment: not sure what that
is, but yes, I've heard about strategies of sending 2 kinds of messages,
where one is "broadcasting" new schemas. Not sure if this has trivial
support in pure AVRO, in principal it should be possible, but ...*

ok so now lets add 2 strategies from pure avro:

4) sending just bytes, schema does not change — well this one is obvious
5) SINGLE OBJECT ENCODING (see [1] at bottom for link) *— comment: this is
what I've been talking about in previous mail, and in principle it's AVRO
way of 1) variant. So first 2 bytes are header identifying version of
header (yep, variant 1 with just hash is incorrect to be honest), then
followed by schema fingerprint relevant for given header version (currently
there is just one, but there is possibility for future development, while
variant 1 does not have this possibility), which is 8B, in total 10B. But
it's worthy to see, that variants 1 and 5 are essencially the same, just 1
incorrect by design, while 5 is correct by desing. But in principle it's
binary identifier of schema.*

(II) quote: There's a concept of "canonicalization" in Avro, so if you have
"the same" schema, but with fields out of order, or adding new fields, the
canonical format is very good at keeping like-with-like.

I'm sorry, like-with-like is totally insufficient for me. I need 100%
same-with-same, otherwise there can be huge problems and financial loss.
Like-with-like is not acceptable. There must be 100% guarantee, that
evolution will do just what it should do. You asked what my evolution would
be: Typically it will be just adding field(s), or renaming fields, lets say
that we will follow recommendations of confluent platform [2], in more
detail it is documented in original avro[3], which I believe confluent
platform just delegates to, but [3] is really harder to read and understand
in context.

(III) schema registry to the rescue(?) — ok, so what schema registry does?
IIUC, it's just a trivial app, which holds all known schema instances, and
provide REST api, to fetch Schema data by their 5B ID, which deserializer
obtained from message data. Right? But in (I) I already explained, that
`message format` is almost the same as `single object encoding`, actually I
think that avro team just stole confluent serializer and retrofit in bact
to avro, fixing `message format` problems along the way. So in principle if
you build a map ID->Schema locally and keep it updated somehow, you will
end-up having same functionality, just without rest calls. Because
confluent SchemaRegistry is just that — external map id-> schema, which you
can read and write. Right? (maybe with potential DoS, when kafka topic
contains too much messages with schema IDs which schema registry does not
know about, and it will be flooded by rest requests; off topic). So based
on that, I don't really understand, how schema registry could change the
landscape for me; since it's essencially the same, at most it might do some
thinks in depth a little bit different, which allows to side-step AVRO
gotchas. And that is what I'm searching for. Pure AVRO solution for schema
evolution, and recommended way how to do that. The problem I have with pure
avro solution of this is, that avro schema parse (`
org.apache.avro.Schema.Parser`) will NOT read 2 schemas with same "name" (`
org.apache.avro.Schema.Name`), because "name" equality (`
org.apache.avro.Schema.Name#equals`) is defined using fully qualified type
name, ie. if you have schema like:

{
  "namespace": "avroTest",
  "type": "record",
  "name": "Money",
  "fields": [
    {
      "name": "value",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}

the fully qualified name would be "avroTest.Money". And if you create new
version, and add new field, but do not change the namespace or name, the
fully qualified name will be the same, and 1 instance of `
org.apache.avro.Schema.Parser` will not be able to parse both of these
versions, producing: `throw new SchemaParseException("Can't redefine:
"+name);` This is why I asked about recommended naming scheme. Because this
limitation exist. So then you can only have 1 parser per schema or
different "name".

The problem with schema identity is general: if you have project, with two
avsc with same "name" the avro-maven-plugin will fail during compilation.
That's second reason why I asked about naming scheme: it seems that is
generally unsupported to have 2 versions of same schema having same
namespace.name in 1 project. Maybe this is the reason to have version ID
somewhere in namespace? If you app needs, for whichever reason, to send
data in 2 different versions?

(IV) the part where I lost you. Let me try to explain it then.

I really don't know how this work/should work, as there are close to no
complete actual examples and documentation does not help much. For example
if avro schema evolves from v1 to v2,

*ok so you have schema, lets call it `v1` and you add field respecting
[2]/[3], and you have second schema v2.*

and the type names and nameschema aren't the same, how will be the pairing
between fields made ?? Completely puzzling.

*ok, so it's not that puzzling, it's explained in [3]. But as explained
above, schema v1 and v2 won't be able to be parsed using same Schema.Parser
because of AVRO implementation.*

I need no less then schema evolution with backward and forward
compatibility

*schema evolution is clear I suppose, ie changing original schema to
something different, and compatibility, backward and forward, is achieved
by AVRO itself. Deserializer needs to somehow identify writer schema
(trivially in strategies 1 or 5), the the data are deserialized using
writer schema, and evolved to desired reader schema. Not sure where/if this
is documented, but you can check sources:
*org.apache.avro.specific.SpecificDatumReader#SpecificDatumReader(org.apache.avro.Schema,
org.apache.avro.Schema)

Construct given writer's and reader's schema. */
public SpecificDatumReader(Schema writer, Schema reader) {
  this(writer, reader, SpecificData.get());
}

with schema reuse (ie. no hacks with top level union, but schema reusing
using schema imports).

*about "schema reuse": this is my term, as it's not documented
sufficiently. Sometimes you want to define type, which is referenced
multiple times from other schema, potentially from different files. Typical
and superbly ugly and problematic hack (recomended by ... some guys) is to
define avro schema to have top-level union[4] instead of record, and cram
everyting into 1 big file. But that is completely wrong. The correct way is
to define that in separate files, and parse them in correct order or use `*
<imports>*` in avro-maven-plugin.*

I think I can hack my way through, by using one parser per set of 1 schema
of given version and all needed imports, which will make everything working
(well I don't yet know about anything which will fail), but it completely
does not feel right. And I would like to know, what is the corret avro way.
And I suppose it should be possible without confluent schema registry, just
with single object encoding as I cannot see any difference between them,
but please correct me if I'm wrong.

*I cannot see anything non-avro being written here. I cannot see here
anything java-specific, this is all pure avro constructs.*

(V) ad: Maybe it'd help to know what "evolution" you plan, and what type
names and name schemas you plan to be changing? The "schema evolution" is
mostly meant to make it easier to add and remove fields from the schemas
without having to coordinate deploys and juggle iron-clad contract
interchange formats. It's not meant for wild rewrites of the contract IDLs
on active running services!

about our usage: we have N running services which currently communicates
using avro. We need to be able to redeploy services independently, that's
the reason for backward and forward compatibility: service X upgrades and
start sending data using upgraded schema, but old services must be able to
consume them! And after service X is upgraded, there are myriads of records
produced while it was down for a while, which must be processed. So yes,
it's just adding/removing column, mostly. This should be working and
possible using just avro, well, as it's sold on their website. I
understand, that maybe with confluent bonus-track code it might be working
correctly, but some of our services cannot use that, so we are stuck with
plain avro, but that should not be problem; single-object-encoding and
schema registry should do the same thing and avro should be working even
without confluent platform.

M.

Links:
[1]  https://avro.apache.org/docs/current/spec.html#single_object_encoding
[2] https://docs.confluent.io/current/schema-registry/avro.html
[3] https://avro.apache.org/docs/current/spec.html#Schema+Resolution
[4] https://avro.apache.org/docs/current/spec.html#Unions

út 31. 12. 2019 v 20:49 odesílatel Lee Hambley <le...@gmail.com>
napsal:

> Hi Martin,
>
> Vance already said it all, but let me see if I can elaborate a bit.
>
> I don't understand avro sufficiently and don't know schema registry at
>> all, actually. So maybe following questions will be dumb.
>>
>> a) how is schema registry with 5B header different from single object
>> encoding with 10B header?
>>
>
> Not sure what this 10b header is. Broadly speaking there's three ways to
> send header into with avro plus the secret 4th way (don't, schema doesn't
> change, reader and writer both have it).
>
> 1. Message format (5 byte prefix, with a schema registry, header carries
> just the schema/version lookup into for the registry)
> 2 Stream format (?) (naming is for sure wrong, this sends the schema
> before sending any records, then an arbitrary (unlimited?) number of
> recorrds, useful for archiving homogeneous data)
> 3. send the schema before every message (might be flexible, but could
> negate any bandwidth savings)
>
> all three of these have names, and they're all recommended in certain
> circumstances, even without going deep, and with my weak executive
> summaries, I believe you could already imagine how they might be usef.
>
> b) will schema registry somehow relieve me from having to parse individual
>> schemas? What if I want to/have to send 2 different version of certain
>> schema?
>>
>
> There's a concept of "canonicalization" in Avro, so if you have "the same"
> schema, but with fields out of order, or adding new fields, the canonical
> format is very good at keeping like-with-like. Libraries for the registry
> will absolve you of doing any parsing.
>
> Usually you configure a writer (producer, whatever) with a registry URL,
> and a payload in a map/hash, and the "current" schemas, the library you use
> will canonicalize the schema, make sure it exists in the registry, and emit
> a binary avro payload referencing the schema.
>
> The reader needs no local schema files, it will receive a message with a
> 5b prefix and will look up that schema at that version in the registry, and
> will give you back a hash/map with the data. If you added a field to the
> producer before adding it to the consumer you may have an extra member in
> the map that you don't know how to handle yet, or you might have an empty
> value that you don't know how to deal with if the consumer "knows" more
> fields than the consumer.
>
> You solve this "problem" with the regular approach you would in any code
> with untrusted data
>
>
>
>> c) actually what I have here is (seemingly) pretty similar setup (and
>> btw, which was recommended here as an alternative to confluent schema
>> registry): it's a registry without an extra service. Trivial map mapping
>> single object encoding long[data type] schema fingerprint, pairing schema
>> fingerprint to schema. So when the bytes "arrive" I can easily read header,
>> find out fingerprint, get hold onto schema and decode it. Trivial. But the
>> snag is, that single Schema.Names instance can contain just one Name of
>> given "identity", and equality is based on fully qualified type, ie.
>> namespace and name. Thus if you have schema in 2 versions, which does have
>> same namespace and name, they cannot be parsed using same Parser. Does
>> schema registry (from confluent platform, right?) work differently than
>> this? Does this "use it for decoding" process bypasses avros new
>> Schema.Parser().parse and everything beneath it?
>>
>
> IT's not idiomatic to put "v2" or anything in the schema namespace, unless
> someone is coaching you to avoid the schema registry approach (which as a
> few of us have mentioned, is one principle reason to use avro). I've been
> in a company who have a v2 namespace in avro, but it's the last "v" we'll
> ever have. In v1 we didn't use a schema registry, in v2 we do, and the
> registrty ensures readers and writers can always talk, and we just need to
> be mindful of
>
> FWIW we have one schema registry in each of our environments (prod,
> staging, qa), in retrospect we think this might have been a mistake, as for
> e.g the QA env doesn't keep any history, so we often fail to test older
> payloads in our test environments, but tbh it hasn't caused any _real_
> problems yet, but it's something I would consider approaching with a global
> registry (fed by my CI system?) in the future.
>
>
>> ~ I really don't know how this work/should work, as there are close to no
>> complete actual examples and documentation does not help much. For example
>> if avro schema evolves from v1 to v2, and the type names and nameschema
>> aren't the same, how will be the pairing between fields made ?? Completely
>> puzzling. I need no less then schema evolution with backward and forward
>> compatibility with schema reuse (ie. no hacks with top level union, but
>> schema reusing using schema imports). I think I can hack my way through, by
>> using one parser per set of 1 schema of given version and all needed
>> imports, which will make everything working (well I don't yet know about
>> anything which will fail), but it completely does not feel right. And I
>> would like to know, what is the corret avro way. And I suppose it should be
>> possible without confluent schema registry, just with single object
>> encoding as I cannot see any difference between them, but please correct me
>> if I'm wrong.
>>
>
> You lost me here, I think you're maybe crossing some vocabulary from your
> language stack, not from Avro per-se, but I'm coming at Avro from Ruby and
> Node (yikes.) and have never used any JVM language integration, so assume
> this is ignorance on my part.
>
> Maybe it'd help to know what "evolution" you plan, and what type names and
> name schemas you plan to be changing? The "schema evolution" is mostly
> meant to make it easier to add and remove fields from the schemas without
> having to coordinate deploys and juggle iron-clad contract interchange
> formats. It's not meant for wild rewrites of the contract IDLs on active
> running services!
>
> All the best for 2020, anyone else who happens to be reading mailing list
> emails this NYE!
>
>
>> thanks,
>> Mar.
>>
>> po 30. 12. 2019 v 20:32 odesílatel Lee Hambley <le...@gmail.com>
>> napsal:
>>
>>> Hi Martin,
>>>
>>> I believe the answer is "just use the schema registry". When you then
>>> encode for the network your library should give you a binary package with a
>>> 5 byte header that includes the schema version and name from the registry.
>>> The reader will when go to the registry and find that schema at that
>>> version and use it for decoding.
>>>
>>> In my experience the naming/etc doesn't matter, only things like
>>> defaults in enums and things need to be given a thought, but you'll see
>>> that for yourself with experience.
>>>
>>> HTH, Regards,
>>>
>>> Lee Hambley
>>> http://lee.hambley.name/
>>> +49 (0) 170 298 5667
>>>
>>>
>>> On Mon, 30 Dec 2019 at 17:26, Martin Mucha <al...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I'm relatively new to avro, and I'm still struggling with getting
>>>> schema evolution and related issues. But today it should be simple question.
>>>>
>>>> What is recommended naming of types if we want to use schema evolution?
>>>> Should namespace contain some information about version of schema? Or
>>>> should it be in type itself? Or neither? What is the best practice? Is
>>>> evolution even possible if namespace/type name is different?
>>>>
>>>> I thought that "neither" it's the case, built the app so that version
>>>> ID is nowhere except for the directory structure, only latest version is
>>>> compiled to java classes using maven plugin, and parsed all other avsc
>>>> files in code (to be able to build some sort of schema registry, identify
>>>> used writer schema using single object encoding and use schema evolution).
>>>> However I used separate Parser instance to parse each schema. But if one
>>>> would like to use schema imports, he cannot have separate parser for every
>>>> schema, and having global one in this setup is also not possible, as each
>>>> type can be registered just once in org.apache.avro.Schema.Names. Btw. I
>>>> favored this variant(ie. no ID in name/namespace) because in this setup,
>>>> after I introduce new schema version, I do not have to change imports in
>>>> whole project, but just one line in pom.xml saying which directory should
>>>> be compiled into java files.
>>>>
>>>> so what could be the suggestion to correct naming-versioning scheme?
>>>> thanks,
>>>> M.
>>>>
>>>

Re: Recomended naming of types to support for schema evolution

Posted by Lee Hambley <le...@gmail.com>.

> I think we're essentially talking about the same stuff in principle, but
> on different levels, my being more low-level avro, while I think you talk
> about confluent "expasion-pack". I do not talk about confluent platform,
> and actually cannot use it at the moment, but that's not important, since
> all of that is possible in plain AVRO.
>

Well confluent or hortonworks, there's two widely used registries, I have
used both interchangably (they are different, but not in basic usage).


> (I) Lets begin with message encoding, you mentioned 3:
>
> 1) message format (5 byte prefix) — *my comment: I believe this is not
> AVRO, but nonstandardized expansion pack. 5B prefix is some hash of schema,
> works with schema registry. OK.*
>

Correct, it's common-enough that I'd be confident in calling it a de-facto
standard.


> *2) *"stream format" — *my comment: again, not avro but confluent
> expansion pack. I know about its existence however I did not see it
> anywhere and don't know how this is actually produced.*
>

https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files - I
totally goofed on the name, this is the object container file. One header,
multiple records.


> 3) "send the schema before every message" — *my comment: not sure what
> that is, but yes, I've heard about strategies of sending 2 kinds of
> messages, where one is "broadcasting" new schemas. Not sure if this has
> trivial support in pure AVRO, in principal it should be possible, but ...*
>

Send schema before message is also
https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files where
number of records = 1.

ok so now lets add 2 strategies from pure avro:
>
> 4) sending just bytes, schema does not change — well this one is obvious
>

yep



> 5) SINGLE OBJECT ENCODING (see [1] at bottom for link) *— comment: this
> is what I've been talking about in previous mail, and in principle it's
> AVRO way of 1) variant. So first 2 bytes are header identifying version of
> header (yep, variant 1 with just hash is incorrect to be honest), then
> followed by schema fingerprint relevant for given header version (currently
> there is just one, but there is possibility for future development, while
> variant 1 does not have this possibility), which is 8B, in total 10B.*
>



> * But it's worthy to see, that variants 1 and 5 are essencially the same,
> just 1 incorrect by design, while 5 is correct by desing. But in principle
> it's binary identifier of schema.*
>

You seem to have something strongly against using the schema registry, but
it solves one very real problem:

How to ship schemas to your readers and writers? In SOE method you know
_which_ schema to use, but I'm not aware of any solution to help with that,
to help parsing the files, and running the signatures, etc. I'm sure
something exists, and it may be fair to say that 5 is a formalized version
of 1, or some other symbiotic relationship where two competing approaches
optimized for different things.
When applying a registry, it can be possible to authenticate producers
(only trusted producers can upload new schemas). And you eliminate a lot of
trouble when deploying consumers, since they don't need to have the current
set of all schemas bundled with their deploy payload, they just need HTTP
access to the registry.

(I'm not trying to push you to a registry, at all, but I have only good
experiences)

(II) quote: There's a concept of "canonicalization" in Avro, so if you have
> "the same" schema, but with fields out of order, or adding new fields, the
> canonical format is very good at keeping like-with-like.
>
> I'm sorry, like-with-like is totally insufficient for me. I need 100%
> same-with-same, otherwise there can be huge problems and financial loss.
> Like-with-like is not acceptable. There must be 100% guarantee, that
> evolution will do just what it should do. You asked what my evolution would
> be: Typically it will be just adding field(s), or renaming fields, lets say
> that we will follow recommendations of confluent platform [2], in more
> detail it is documented in original avro[3], which I believe confluent
> platform just delegates to, but [3] is really harder to read and understand
> in context.
>

I'm not trying to sell you anything, and if Avro is insufficient, then
please, take whatever tech you need. You can't both demand magical schema
resolution behaviour, whilst also demanding to have absolute control.

If you change {namespace: "Foo", name: "Money", ...} then the combination
of `Foo.Money` will be assigned an ID. The schema will be canonicalized
(which is deterministoc and very complete in the spec) and assigned a
version. If you add/remove fields to `Foo.Money` you will increment the
version.

Producers will emit data with `[s1, v1]` (just example) header, and
consumers with a registry supporting library will ALWAYS get that correct
schema.

I'll confess I don't know what happens if you add a field (version+1), and
then remove the same field, if that's technically a new version, or the
prior version, I might experiment with it sometime.


> (III) schema registry to the rescue(?) — ok, so what schema registry does?
> IIUC, it's just a trivial app, which holds all known schema instances, and
> provide REST api, to fetch Schema data by their 5B ID, which deserializer
> obtained from message data. Right?
>

Correct.


> But in (I) I already explained, that `message format` is almost the same
> as `single object encoding`,
>

SOE doesn't say how you should find the right schema given a fingerprint,
so SOE and the schema registry approach are not equvilent, and if you go
SOE you'll need to find or build something like a registry.


> actually I think that avro team just stole confluent serializer and
> retrofit in bact to avro, fixing `message format` problems along the way.
> So in principle if you build a map ID->Schema locally and keep it updated
> somehow, you will end-up having same functionality, just without rest calls.
>

YSK that the registry clients often do a pre-fetch on initialization, and
only actually make HTTP calls on an unknown incoming schema ID/Ver which
should be infrequent enough that you don't need to plan latency/time for
HTTP requests.

Because confluent SchemaRegistry is just that — external map id-> schema,
> which you can read and write. Right? (maybe with potential DoS, when kafka
> topic contains too much messages with schema IDs which schema registry does
> not know about, and it will be flooded by rest requests; off topic). So
> based on that, I don't really understand, how schema registry could change
> the landscape for me; since it's essencially the same, at most it might do
> some thinks in depth a little bit different, which allows to side-step AVRO
> gotchas. And that is what I'm searching for. Pure AVRO solution for schema
> evolution, and recommended way how to do that. The problem I have with pure
> avro solution of this is, that avro schema parse (`
> org.apache.avro.Schema.Parser`) will NOT read 2 schemas with same "name"
> (`org.apache.avro.Schema.Name`), because "name" equality (`
> org.apache.avro.Schema.Name#equals`
> <http://org.apache.avro.Schema.Name#equals>) is defined using fully
> qualified type name, ie. if you have schema like:
>




> The problem with schema identity is general: if you have project, with two
> avsc with same "name" the avro-maven-plugin will fail during compilation.
> That's second reason why I asked about naming scheme: it seems that is
> generally unsupported to have 2 versions of same schema having same
> namespace.name in 1 project. Maybe this is the reason to have version ID
> somewhere in namespace? If you app needs, for whichever reason, to send
> data in 2 different versions?
>

Not familiar enoguh with Java/Maven to know anything about this, sorry.

However the usual approach is to always have the NEWEST schema in the
producer, and then have all the old schemas available for the consumers.
Use a registry, or build something yourself to extend SOE.


> (IV) the part where I lost you. Let me try to explain it then.
>
> I really don't know how this work/should work, as there are close to no
> complete actual examples and documentation does not help much. For example
> if avro schema evolves from v1 to v2,
>
> *ok so you have schema, lets call it `v1` and you add field respecting
> [2]/[3], and you have second schema v2.*
>
> and the type names and nameschema aren't the same, how will be the pairing
> between fields made ?? Completely puzzling.
>
> *ok, so it's not that puzzling, it's explained in [3]. But as explained
> above, schema v1 and v2 won't be able to be parsed using same Schema.Parser
> because of AVRO implementation.*
>

Sounds like a problem that exists because the interface expects you to
define the schema in code *before*. Any code using a schema registry just
needs a registry client, and then a binary payload, and you get back a
decoded object.


> I need no less then schema evolution with backward and forward
> compatibility
>
> *schema evolution is clear I suppose, ie changing original schema to
> something different, and compatibility, backward and forward, is achieved
> by AVRO itself. Deserializer needs to somehow identify writer schema
> (trivially in strategies 1 or 5), the the data are deserialized using
> writer schema, and evolved to desired reader schema. Not sure where/if this
> is documented, but you can check sources: *org.apache.avro.specific.SpecificDatumReader#SpecificDatumReader(org.apache.avro.Schema,
> org.apache.avro.Schema)
>
> Construct given writer's and reader's schema. */
> public SpecificDatumReader(Schema writer, Schema reader) {
>   this(writer, reader, SpecificData.get());
> }
>
>
> with schema reuse (ie. no hacks with top level union, but schema reusing
> using schema imports).
>
> *about "schema reuse": this is my term, as it's not documented
> sufficiently. Sometimes you want to define type, which is referenced
> multiple times from other schema, potentially from different files. Typical
> and superbly ugly and problematic hack (recomended by ... some guys) is to
> define avro schema to have top-level union[4] instead of record, and cram
> everyting into 1 big file. But that is completely wrong. The correct way is
> to define that in separate files, and parse them in correct order or use `*
> <imports>*` in avro-maven-plugin.*
>

Well, there's something else that maybe you should be aware of, the
"superbly ugly hack" is SUPER common. In Kafka it (to my knowledge) is only
really possible to define one message type per topic, so the solution
everyone uses is to define an "empty" type, with just one giant union of
all the other types that can exist on the topic.
FWIW this has worked flawlessly for years at my company with dozens of
schema changes per month across the company common libraries/schemas. But I
agree, it's a shitty hack.



> I think I can hack my way through, by using one parser per set of 1 schema
> of given version and all needed imports, which will make everything working
> (well I don't yet know about anything which will fail), but it completely
> does not feel right. And I would like to know, what is the corret avro way.
> And I suppose it should be possible without confluent schema registry, just
> with single object encoding as I cannot see any difference between them,
> but please correct me if I'm wrong.
>
> *I cannot see anything non-avro being written here. I cannot see here
> anything java-specific, this is all pure avro constructs.*
>
> (V) ad: Maybe it'd help to know what "evolution" you plan, and what type
> names and name schemas you plan to be changing? The "schema evolution" is
> mostly meant to make it easier to add and remove fields from the schemas
> without having to coordinate deploys and juggle iron-clad contract
> interchange formats. It's not meant for wild rewrites of the contract IDLs
> on active running services!
>
> about our usage: we have N running services which currently communicates
> using avro. We need to be able to redeploy services independently, that's
> the reason for backward and forward compatibility: service X upgrades and
> start sending data using upgraded schema, but old services must be able to
> consume them! And after service X is upgraded, there are myriads of records
> produced while it was down for a while, which must be processed. So yes,
> it's just adding/removing column, mostly. This should be working and
> possible using just avro, well, as it's sold on their website. I
> understand, that maybe with confluent bonus-track code it might be working
> correctly, but some of our services cannot use that, so we are stuck with
> plain avro, but that should not be problem; single-object-encoding and
> schema registry should do the same thing and avro should be working even
> without confluent platform.
>

This makes perfect sense, so let me try and close this email out with the
following points that I believe are important:

- schema is defined by namespace+name for the rest of these bulletpoints
- if you rename the schema {'namespace': 'money.v1', name: "activity"}, to
{'namespace': 'money.v2', name: "activity"} this is a new schema, no
evolution of anything will help you
- typically producers (writers in Avro parlance) will always have the
newest schema
- for SOE or registry approaches (anything that doesn't send the whole
schema in a prefix header) the consumers (Readers in avro parlance) will
need to access prior schemas, I guess if you use SOE you need to find a way
to compile prior schemas in, and then have multiple instances of your Java
parser class, and put them all in a big map with their computed
fingerprint, so you can find the right one.
- with the registry the producer would put the newest schema in the
registry the first time they produce, consumers will use the ID+ver to look
up the correct version in the registry
- consumers don't need to be compiled with schemas from files
- schema resolution is DETERMINISTIC, but you still need to define what
happens when a consumer receives an unknown field in a "newer" schema that
maybe doesn't have defaults (or implement a policy in your coding workflow
to say "all evolved fields must have defaults" to simplify)

I'd also say that you should _maybe_ avoid the confluent docs, I've never
read them, I avoid confluent anything like the plague after being burned by
some of their shitty practices in a previous job. The horton works registry
is compatible with every lib I've ever tried (node.js, ruby, go, rust) and
I assume teh same is true for Java. There's virtially no documentation here
because what it's doing is TRIVIAL.

Finally, if there was a registry+SOE solution (and, maybe there is) that
relies on the 10b header and fingerprint, rather than the canonical form
JSON and some "IDs", I agree that would be preferable, but it's still
essentially gonig to be an HTTP+rest "shared service" between producers and
consumers.

Your usecase makes sense, it's the same thing all of the rest of us are
doing with schema evolution, at this point I'd recommend that you just
experiment for a few hours one afternoon until you feel comfortable.

This really isn't terribly complicated, and your heightenend concernes
about finances/etc are commendable, but unwarranted, a lot of our
multi-million EUR turnover in my current role is handled with Avro, and
we've never skipped a beat.

People in our team occasionally wish for something simpler, and we look at
protob, or msgpack, and invariably arrive back at "ohh, right, avro makes
the most sense"

Not sure how much more I can hep, I seem to be advocating for a solution
you have decided to avoid, but I hope at least from this mail the specifics
about SOE and "message format" prefixes, and how that influences
producer/consumer design and deploy helps.

Regards,