You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Askar Safin <sa...@mail.ru> on 2022/02/12 23:22:04 UTC

Gigantic list of Avro spec issues

Hi. I'm writing my own Avro implementation in Rust for personal use. During this work I found a lot of issues in Avro 
spec (the list follows).

I send this mail not only to user and dev mailing lists of Avro, but also to Apache community list, Kafka list and 3 
semi-randomly chosen Materialize employees. Because I want to draw attention to this problems. I hope this wider 
community helps Avro fix their issues and possible will give necessary resources.

As well as I understand Avro is used in Kafka. And Kafka, according to their site, is used in "80% of all Fortune 100 
companies". So Avro is critical piece of infrastructure of humanity. It should be absolutely perfect (and so I list 
even very small bugs). But it is not perfect.

Some of items in this list are (small and big) bugs, some are typos, some are my objections to the design. Some are 
fixable while keeping compatibility, some are not. I don't want to spend my time to report them as separate bugs, but 
you can try to convince me to do so.

I created this list simply by reading the spec from end to end (I skipped sections on RPC and logical types). And I 
even didn't look at implementations!

I write this is hope to help Avro.

I think big audit of spec and its implementations should be done.

All line numbers apply to spec.xml from tag release-1.11.0 (i. e. 
https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml ). As well as I understand this tag 
corresponds to currently published version at https://avro.apache.org/docs/current/spec.html .

So, here we go.

* [Opinion] [No line]. In Avro one have to define named records inside each other like so:

{ "type": "record", "name": "a", "fields": [{"name":"b","type":{"type":"record","name":"c",...}}] }

This is very unnatural. In popular programming languages one usually define named record next to each other, not one 
inside other. Such representation is not handy to deal programmatically. In my implementation I have to convert this 
representation to usual form "root type + list of named types" right after reading JSON and convert back just before 
writing.

* [Opinion] [No line]. In this list you will see a lot of questions on Avro schema (encoded as JSON). Good JSON schema 
( https://json-schema.org/ ) would resolve many of them

* [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, in whole spec phrase "type name" is used 
very vaguely. Sometimes it means strings like "record" and sometimes it means names of named types. I propose to define 
in very beginning of the spec terms for primitive types, terms for strings like "record" and terms for names of defined 
types. Here is one possible way to do this: name strings like "record" and "fixed" "type kinds" and never name them 
type names, thus reserving term "type name" to named types only (and possibly to primitive types).

This issue already caused problems: look, for example, to this problems with {"type":"record","name":"record",...}: 
https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .

* [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit integers. Such type is present in languages such 
as C and Rust

* [Very theoretical bug, possible even security-related] [Line 435]. "The float is converted into a 32-bit integer 
using a method equivalent to Java's floatToIntBits and then encoded in little-endian format". If we click at provided 
link, we will see that this Java function does NaN normalization. I think NaN normalization is good thing. But I think 
this quite possible spec implementers overlooked this NaN normalization requirement. So I propose: write explicitly 
directly in Avro spec that NaN are normalized. Audit all Avro implementations: whether they actually implemented this 
requirement. Create tests, which will actually test this requirement.

Also I don't know whether bit pattern provided in that Java doc (0x7fc00000) is quiet NaN or signaling. If it is 
signaling, this is very bad.

As well as I understand if you will configure your FPU particular way than merely copying signaling NaN from one place 
to another will abort your program. So, if your FPU is configured certain way then feeding particular binary Avro data 
to a program can crash it. I. e. this is security problem. So a reader should be careful to check whether input data is 
signaling NaN *before* storing it in floating point registers.

I checked whether manipulating signaling NaN can actually crash a program in default settings in Windows and Linux. And 
it turned out that a program will not crash. Still I think signaling NaN should be handled carefully.

Write to spec that writers should normalize NaNs, that readers should reject non-normalized NaNs and that readers 
should be careful not to store incoming floating number to floating-point variable before its sanitizing. Write that 
this is security issue.

* [Opinion] [Line 68]. "unicode character sequence". As well as I understand Unicode character sequence means sequence 
of Unicode scalar values. Note that scalar value is not same thing as code point. Unfortunately, some people don't know 
this, so please write explicitly: "this is sequence of scalar values, not code points", to make sure implementations 
will be correct

* [Bug] [Line 71]. "Primitive types have no specified attributes". This is lie. At line 1527 you specify logical type 
based on primitive type int. Thus you specify particular meaning of attribute "logicalType" for primitive type "int". 
Be careful at your wording. The spec should be rock-solid

* [Opinion] [Line 96]. "aliases: a JSON array of strings, providing alternate names for this record (optional)". Is 
empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say this is nitpicking, but I don't think so. 
Avro has important place in our infrastructure, so everything is important. Think carefully whether empty list (and 
duplicates) is allowed everywhere in the spec where you see some kind of list. I think empty arrays (and duplicates) 
should be disallowed in this particular case. Because the more things we allow, the bigger is attack surface

* [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields allowed? Already reported by me at 
https://issues.apache.org/jira/browse/AVRO-3279

* [Bug] [Line 235]. "Unions". How many variants in union allowed? Already reported by me at 
https://issues.apache.org/jira/browse/AVRO-3280

* [Bug] [Line 101]. "name: a JSON string providing the name of the field (required), and". Word "and" usually placed 
immediately before last item in sequence. The text here looks like item "doc" was last, but then the text was edited 
not carefully. This is very stupid typographic issue which shows that authors are not careful about spec quality. Also, 
the spec is not consistent on placing dots after items (this applies to whole spec). Sometimes I see nothing in the end 
of item, sometimes "." and sometimes ";"

* [Opinion] [Line 106]. "default: A default value for..." What follows is essentially description of JSON 
representation of Avro datum (except for unions). So, you managed to put very important part of your spec directly into 
one paragraph into second level bullet point?!

* [Opinion] [Line 112]. "Default values for union fields correspond to the first schema in the union". This phrase is 
difference between JSON encoding for Avro data and JSON encoding for default field. And, of course, presence of this 
difference is design bug

* [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 
are mapped to unsigned 8-bit byte values 0-255". Wat? This is very unnatural encoding. You misuse JSON string. They are 
for strings, they are not for binary data. You should use array of numbers instead. I. e. encode bytes 0x0f 0x02 as 
[15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C programs have difficulties with such strings

* [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what if given long is not a JSON-safe integer? As 
we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside of range [-(2**53)+1, (2**53)-1] are not 
JSON-safe

* [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, despite they seems to be allowed by Avro spec. 
So, JSON representation of Avro data is incomplete

* [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined by spec. For unknown reasons you decided to 
insert essentially whole description of JSON representation of Avro data into one small paragraph even before type 
system is fully described. Please use terms only after their definition

* [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional or required. Do you ever read your spec?

* [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" is marked using <em>, not <code>, and we 
can see this in rendered version. This is very stupid typographic bug, which is immediately obvious to anybody reading 
this document, even to non-technical people

* [Bug] [Line 200]. "a single attribute". As we see in provided example, "default" is allowed, too. What is meaning of 
this "default" attribute? And how its meaning differs from meaning of "default" key in field description? (Same for 
maps)

* [Bug] [Line 238]. "declares a schema which may be either a null or string". Lie. Schema is ["null", "string"]. 
*Value* may be a null or string. Please check that you don't confuse types (schemas) with value through whole your spec

* [Very opinionated opinion :)] [Line 235]. I don't like your unions at all. I'm coming from languages like Haskell and 
Rust, where true sum types are supported. They are similar to your unions, but their alternatives are named. 
Alternatives are identified by their names, so there is no restriction on duplicate types. So there is no need for very 
unnatural restriction "Unions may not contain more than one schema with the same type, except for the named types 
record, fixed and enum"

* [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, providing alternate names for this enum". You 
mean "fixed", right? So, you copy-pasted section on enums? Do you ever read your spec from end to end at least one time?

* [Bug] [Line 265]. "size: an integer, specifying the number of bytes per value". Is zero allowed?

* [Bug] [Line 292]. "The null namespace may not be used in a dot-separated sequence of names". You defined previously 
null namespace as a empty string instead of *whole* namespace. I. e. null namespace is lack of namespace (i. e. lack of 
whole dot-separated sequence). So there is no sense in speaking on using null namespace in dot-separated sequence. You 
probably mean that one should not use empty string in dot-separated sequence

* [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished by specifying an additional schema, the 
results of which are described in Schema Resolution". Term "additional schema" is vague here. I would say so: 
"Deserializing data into a newer schema is accomplished by using an algorithm described in Schema Resolution"

* [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to read Avro data with a schema that does not..." 
The whole paragraph is very vague. At first reading I thought that it is about schema resolution. After several 
attempts to understand it I finally understood that the paragraph is about reading attempts without original writer 
schema available at all. I propose removing whole paragraph or rewriting it completely

* [Bug] [Line 385]. "For example, int and long are always serialized the same way". What this means? You probably mean 
that *same* int and long (i. e. int and long, which are numerically identical) serialized the same way.

* [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. Do you mean no bytes at all? Or null bytes, 
i. e. some undefined number of null bytes? (Of course, I understand that you mean the first variant, but I still don't 
like the phrase)

* [Bug] [Line 417]. "int and long values are written..." Is canonical (i. e. smallest) encoding of numbers required? 
Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307

I still think canonical representations should be required. The more forms of encoding you allow the bigger is attack 
surface.

Also, it would be desirable property for binary representations be equal when data is equal. It would be good if you 
guarantee this property at least for some subset of schemas (and of course, you should write explicitly for which 
schemas the property is guaranteed). Non-canonical representations break this property

* [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is sequence of encoded scalar values (don't confuse 
with code points). Unfortunately, not everybody knows this, and thus we see WTF-8 (i. e. encoding similar to UTF-8, but 
with standalone surrogates) available in places where proper UTF-8 should reside. So everywhere where the spec says 
"UTF-8" I propose to explicitly write that standalone surrogates are not allowed and that readers should fail if they 
find them (I prefer to place this sentence to introduction of the spec)

* [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are practically an int, but theoretically a 
long". Wat? So, other implementations use int (as per spec), but C++ uses long, right? So, go fix C++ implementation to 
match spec and other implementations

* [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON null". There is no reasons to special-case 
nulls. This is additional requirement, which adds complexity to implementations without any reasons

* [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It seems I found a real bug. :) Consider this 
schema:

[{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]

As well as I understand such schema fully allowed. Now consider this encoded value: {"map":{"b":0}}. What is it? Map or 
record named "map"?

* [Bug] [Line 677]. "data is ordered by ascending numeric value". What about NaNs?

* [Bug] [Line 682]. "compared lexicographically by Unicode code point". Replace with scalar values. UTF-8 consists of 
scalar values

* [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for this file". I don't like this point. It 
implies that container files are usually not equal. Thus it is not possible to compare them bitwise to determine 
equality. So, in my Avro implementation I write null bytes instead of this marker (yes, this possibly means that my 
implementation is non-conforming)

* [Opinion] [Line 717]. There is no any marker for end of container file. Thus there is no way to determine whether all 
data was written

* [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to string". Wat? How these values are promoted?

* [Bug] [Line 1153]. What implementation should do (when it does schema resolution)? It should first check that schemas 
match (and report any errors) and then read data? Or proceed straight to reading data? This is important distinction. 
For example, what happens when we attempt to read file container without data elements using schema resolution 
algorithm? (Are such container allowed, by the way?) In the first case scheme check should be performed. In the second 
such reading should always be successful.

If you think the first case is correct, then the section should describe algorithm for determining matching of schemas 
separately from algorithm of actual reading data

* [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" instead of {"type":"int"}>>, right?

* [Bug] [Line 1331]. "replace any escaped characters". Any? What about "a\"b"? It is impossible to replace :)

----

Some notes about my task. I want to implement this: 
https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . I. e. I want to have shell utils in Linux, 
which exchange some structured data. Here is how I chose format for representing that data.

* I want format to be binary, not textual, this rules out JSON, XML, etc
* I want format to be typed, this rules out CBOR etc
* I want format to have support for proper sum types (similar to Haskell's), this rules out Microsoft Bond. As well as 
I understand this also rules out using GVariants, proposed in above mentioned article. And this rules out Protobuf: 
Protobuf has support for sum types (they are named OneOf), but this OneOfs are always optional (speaking in Avro 
language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
* I want format to have support for recursive types, this rules out Bare ( baremessages.org )

So, we have not so many formats left. Avro and possibly a few more. And I chose Avro. And I really like it. Because:

* It is very compact
* It has very elegant way to support schema evolution (as opposed to Protobuf, where fields are tagged, i. e. you trade 
space efficiency for schema evolution)
* It has container format with schema attached
* You don't need to write items count to container header (good for streaming)

So, Avro is simply *best* for my task. But then I discovered its problems (listed above). How it is happened that such 
good format has so bad spec? How it is happened that *best* format for this task happened to be so bad? What this says 
about our industry?

==
Askar Safin
http://safinaskar.com
https://sr.ht/~safinaskar
https://github.com/safinaskar

Re: Gigantic list of Avro spec issues

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!

Thanks, Dan, for your calm and measured response -- you've given some
excellent advice on how someone can make a positive contribution to
the project and the spec.

Askar, your approach in presenting your specification review should
have been more constructive: it isn't useful to cross-post across
lists and recipients indiscriminately, and please remain civil when
pointing out (potential) errors.  I understand that the Avro
specification isn't perfect, but I think the community has always been
open to making changes and providing clarifications.

I'm glad that you've found some things to appreciate in the Avro
project, and I hope you choose to contribute in making the
improvements that you'd like to see.

Ryan

[1] https://www.apache.org/foundation/policies/conduct

On Sun, Feb 13, 2022 at 6:05 PM Dan Schmitt <da...@gmail.com> wrote:
>
> I will admit the spec is likely weak around unicode/utf encoding,
> (as the serialization of strings to json isn't consistent across all
> language bindings) but I file the ticket against implementations
> and write test cases rather than make guesses at the spec
> wording and make demands without knowing it's a real issue.
>
> On Sat, Feb 12, 2022 at 6:55 PM Dan Schmitt <da...@gmail.com> wrote:
> >
> > Generally the "I'm going to lump all my complaints into
> > one big bug" is a good way to get them ignored.
> >
> > I'll skip "the design is wrong and it should change because
> > I don't like it" and cite "it's used everywhere with lots of
> > implementations so you can't change it in an incompatible way".
> >
> > I'll skip obvious typos and suggest you catch more flies with
> > honey than vinegar, and you can work a git repository and
> > make a pull request for those and they'd be fixed fast.  (Or
> > read 2 or 3 of the 8 implementations if the phrasing is confusing
> > to you.)
> >
> > I'll make some suggestions on the technical details
> > So, specifically:
> >
> > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker
> > for this file". I don't like this point. It
> > implies that container files are usually not equal. Thus it is not
> > possible to compare them bitwise to determine
> > equality. So, in my Avro implementation I write null bytes instead of
> > this marker (yes, this possibly means that my
> > implementation is non-conforming)
> > * [Opinion] [Line 717]. There is no any marker for end of container
> > file. Thus there is no way to determine whether all
> > data was written
> >
> > If you use the sync marker, you don't need an end of container marker.
> > (Flush the sync and container block map with new data after the new
> > block is written, if you have the metadata and block list you know that
> > much is complete and written for you to read, if you read the metadata
> > and your sync byte marker is wrong, go re-read/continue read.)
> >
> > * [Very theoretical bug, possible even security-related] [Line 435].
> > Since you have a test case that proves it doesn't crash systems, it's
> > sort of not a bug right?  You could at the test case to the test suite.
> >
> > * [Bug] [Line 572]. "Currently for C/C++ implementations, the
> > positions are practically an int, but theoretically a
> > long". Wat? So, other implementations use int (as per spec), but C++
> > uses long, right? So, go fix C++ implementation to
> > match spec and other implementations
> >
> > This is not a bug, but an acknowledgement that the C/C++ offsets
> > are internally implemented via pointer math to be efficient but if
> > you try to read in enough data that a long offset makes sense,
> > you will be sad/run out of memory.   That the internal implementation
> > for C/C++ supports the minimum required by the specification.
> >
> > * [Bug] [Line 385]. "For example, int and long are always serialized
> > the same way". What this means? You probably mean
> > that *same* int and long (i. e. int and long, which are numerically
> > identical) serialized the same way.
> >
> > That rewrite is wrong.  Your wording would allow serialization to be
> > altered by value (e.g. it would be allowable to use big endian storage
> > for odd numbers and little endian for even as each same int and long
> > would be serialized the same way.)
> >
> > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a
> > JSON null". There is no reasons to special-case
> > nulls. This is additional requirement, which adds complexity to
> > implementations without any reasons
> >
> > You are making assumptions about implementation encoding of null.
> > A C developer would say writing 0x00 to the file that you will read back
> > later is fine for null or false or 0.
> >
> > * [Bug] [Line 417]. - can't take action, the request breaks compatibility.
> >
> > * [Bug] [Line 292]. "The null namespace may not be used in a
> > dot-separated sequence of names". You defined previously
> > null namespace as a empty string instead of *whole* namespace. I. e.
> > null namespace is lack of namespace (i. e. lack of
> > whole dot-separated sequence). So there is no sense in speaking on
> > using null namespace in dot-separated sequence. You
> > probably mean that one should not use empty string in dot-separated sequence
> >
> > That doesn't read as a bug at all to me, you have introduce a "whole
> > namespace" and only then does it not make sense
> >
> > a.b.c - a is the top level namespace, b is a namespace in a, c is a
> > namespace in b.
> >
> > Probably easier to read up on C++ namespaces and nesting and full
> > specification to see how your introduction of "whole namespace" where
> > it doesn't exist is what is causing your confusion.
> >
> > I look forward to more bug reports and pull requests to iron out typos
> > and standardize existing practice (you might want to read through
> > other implementations to make sure your suggestions are useful to
> > the community at large.)
> >
> > On Sat, Feb 12, 2022 at 5:22 PM Askar Safin <sa...@mail.ru> wrote:
> > >
> > > Hi. I'm writing my own Avro implementation in Rust for personal use. During this work I found a lot of issues in Avro
> > > spec (the list follows).
> > >
> > > I send this mail not only to user and dev mailing lists of Avro, but also to Apache community list, Kafka list and 3
> > > semi-randomly chosen Materialize employees. Because I want to draw attention to this problems. I hope this wider
> > > community helps Avro fix their issues and possible will give necessary resources.
> > >
> > > As well as I understand Avro is used in Kafka. And Kafka, according to their site, is used in "80% of all Fortune 100
> > > companies". So Avro is critical piece of infrastructure of humanity. It should be absolutely perfect (and so I list
> > > even very small bugs). But it is not perfect.
> > >
> > > Some of items in this list are (small and big) bugs, some are typos, some are my objections to the design. Some are
> > > fixable while keeping compatibility, some are not. I don't want to spend my time to report them as separate bugs, but
> > > you can try to convince me to do so.
> > >
> > > I created this list simply by reading the spec from end to end (I skipped sections on RPC and logical types). And I
> > > even didn't look at implementations!
> > >
> > > I write this is hope to help Avro.
> > >
> > > I think big audit of spec and its implementations should be done.
> > >
> > > All line numbers apply to spec.xml from tag release-1.11.0 (i. e.
> > > https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml ). As well as I understand this tag
> > > corresponds to currently published version at https://avro.apache.org/docs/current/spec.html .
> > >
> > > So, here we go.
> > >
> > > * [Opinion] [No line]. In Avro one have to define named records inside each other like so:
> > >
> > > { "type": "record", "name": "a", "fields": [{"name":"b","type":{"type":"record","name":"c",...}}] }
> > >
> > > This is very unnatural. In popular programming languages one usually define named record next to each other, not one
> > > inside other. Such representation is not handy to deal programmatically. In my implementation I have to convert this
> > > representation to usual form "root type + list of named types" right after reading JSON and convert back just before
> > > writing.
> > >
> > > * [Opinion] [No line]. In this list you will see a lot of questions on Avro schema (encoded as JSON). Good JSON schema
> > > ( https://json-schema.org/ ) would resolve many of them
> > >
> > > * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, in whole spec phrase "type name" is used
> > > very vaguely. Sometimes it means strings like "record" and sometimes it means names of named types. I propose to define
> > > in very beginning of the spec terms for primitive types, terms for strings like "record" and terms for names of defined
> > > types. Here is one possible way to do this: name strings like "record" and "fixed" "type kinds" and never name them
> > > type names, thus reserving term "type name" to named types only (and possibly to primitive types).
> > >
> > > This issue already caused problems: look, for example, to this problems with {"type":"record","name":"record",...}:
> > > https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .
> > >
> > > * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit integers. Such type is present in languages such
> > > as C and Rust
> > >
> > > * [Very theoretical bug, possible even security-related] [Line 435]. "The float is converted into a 32-bit integer
> > > using a method equivalent to Java's floatToIntBits and then encoded in little-endian format". If we click at provided
> > > link, we will see that this Java function does NaN normalization. I think NaN normalization is good thing. But I think
> > > this quite possible spec implementers overlooked this NaN normalization requirement. So I propose: write explicitly
> > > directly in Avro spec that NaN are normalized. Audit all Avro implementations: whether they actually implemented this
> > > requirement. Create tests, which will actually test this requirement.
> > >
> > > Also I don't know whether bit pattern provided in that Java doc (0x7fc00000) is quiet NaN or signaling. If it is
> > > signaling, this is very bad.
> > >
> > > As well as I understand if you will configure your FPU particular way than merely copying signaling NaN from one place
> > > to another will abort your program. So, if your FPU is configured certain way then feeding particular binary Avro data
> > > to a program can crash it. I. e. this is security problem. So a reader should be careful to check whether input data is
> > > signaling NaN *before* storing it in floating point registers.
> > >
> > > I checked whether manipulating signaling NaN can actually crash a program in default settings in Windows and Linux. And
> > > it turned out that a program will not crash. Still I think signaling NaN should be handled carefully.
> > >
> > > Write to spec that writers should normalize NaNs, that readers should reject non-normalized NaNs and that readers
> > > should be careful not to store incoming floating number to floating-point variable before its sanitizing. Write that
> > > this is security issue.
> > >
> > > * [Opinion] [Line 68]. "unicode character sequence". As well as I understand Unicode character sequence means sequence
> > > of Unicode scalar values. Note that scalar value is not same thing as code point. Unfortunately, some people don't know
> > > this, so please write explicitly: "this is sequence of scalar values, not code points", to make sure implementations
> > > will be correct
> > >
> > > * [Bug] [Line 71]. "Primitive types have no specified attributes". This is lie. At line 1527 you specify logical type
> > > based on primitive type int. Thus you specify particular meaning of attribute "logicalType" for primitive type "int".
> > > Be careful at your wording. The spec should be rock-solid
> > >
> > > * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing alternate names for this record (optional)". Is
> > > empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say this is nitpicking, but I don't think so.
> > > Avro has important place in our infrastructure, so everything is important. Think carefully whether empty list (and
> > > duplicates) is allowed everywhere in the spec where you see some kind of list. I think empty arrays (and duplicates)
> > > should be disallowed in this particular case. Because the more things we allow, the bigger is attack surface
> > >
> > > * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields allowed? Already reported by me at
> > > https://issues.apache.org/jira/browse/AVRO-3279
> > >
> > > * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already reported by me at
> > > https://issues.apache.org/jira/browse/AVRO-3280
> > >
> > > * [Bug] [Line 101]. "name: a JSON string providing the name of the field (required), and". Word "and" usually placed
> > > immediately before last item in sequence. The text here looks like item "doc" was last, but then the text was edited
> > > not carefully. This is very stupid typographic issue which shows that authors are not careful about spec quality. Also,
> > > the spec is not consistent on placing dots after items (this applies to whole spec). Sometimes I see nothing in the end
> > > of item, sometimes "." and sometimes ";"
> > >
> > > * [Opinion] [Line 106]. "default: A default value for..." What follows is essentially description of JSON
> > > representation of Avro datum (except for unions). So, you managed to put very important part of your spec directly into
> > > one paragraph into second level bullet point?!
> > >
> > > * [Opinion] [Line 112]. "Default values for union fields correspond to the first schema in the union". This phrase is
> > > difference between JSON encoding for Avro data and JSON encoding for default field. And, of course, presence of this
> > > difference is design bug
> > >
> > > * [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255
> > > are mapped to unsigned 8-bit byte values 0-255". Wat? This is very unnatural encoding. You misuse JSON string. They are
> > > for strings, they are not for binary data. You should use array of numbers instead. I. e. encode bytes 0x0f 0x02 as
> > > [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C programs have difficulties with such strings
> > >
> > > * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what if given long is not a JSON-safe integer? As
> > > we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside of range [-(2**53)+1, (2**53)-1] are not
> > > JSON-safe
> > >
> > > * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, despite they seems to be allowed by Avro spec.
> > > So, JSON representation of Avro data is incomplete
> > >
> > > * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined by spec. For unknown reasons you decided to
> > > insert essentially whole description of JSON representation of Avro data into one small paragraph even before type
> > > system is fully described. Please use terms only after their definition
> > >
> > > * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional or required. Do you ever read your spec?
> > >
> > > * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" is marked using <em>, not <code>, and we
> > > can see this in rendered version. This is very stupid typographic bug, which is immediately obvious to anybody reading
> > > this document, even to non-technical people
> > >
> > > * [Bug] [Line 200]. "a single attribute". As we see in provided example, "default" is allowed, too. What is meaning of
> > > this "default" attribute? And how its meaning differs from meaning of "default" key in field description? (Same for
> > > maps)
> > >
> > > * [Bug] [Line 238]. "declares a schema which may be either a null or string". Lie. Schema is ["null", "string"].
> > > *Value* may be a null or string. Please check that you don't confuse types (schemas) with value through whole your spec
> > >
> > > * [Very opinionated opinion :)] [Line 235]. I don't like your unions at all. I'm coming from languages like Haskell and
> > > Rust, where true sum types are supported. They are similar to your unions, but their alternatives are named.
> > > Alternatives are identified by their names, so there is no restriction on duplicate types. So there is no need for very
> > > unnatural restriction "Unions may not contain more than one schema with the same type, except for the named types
> > > record, fixed and enum"
> > >
> > > * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, providing alternate names for this enum". You
> > > mean "fixed", right? So, you copy-pasted section on enums? Do you ever read your spec from end to end at least one time?
> > >
> > > * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per value". Is zero allowed?
> > >
> > > * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated sequence of names". You defined previously
> > > null namespace as a empty string instead of *whole* namespace. I. e. null namespace is lack of namespace (i. e. lack of
> > > whole dot-separated sequence). So there is no sense in speaking on using null namespace in dot-separated sequence. You
> > > probably mean that one should not use empty string in dot-separated sequence
> > >
> > > * [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished by specifying an additional schema, the
> > > results of which are described in Schema Resolution". Term "additional schema" is vague here. I would say so:
> > > "Deserializing data into a newer schema is accomplished by using an algorithm described in Schema Resolution"
> > >
> > > * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to read Avro data with a schema that does not..."
> > > The whole paragraph is very vague. At first reading I thought that it is about schema resolution. After several
> > > attempts to understand it I finally understood that the paragraph is about reading attempts without original writer
> > > schema available at all. I propose removing whole paragraph or rewriting it completely
> > >
> > > * [Bug] [Line 385]. "For example, int and long are always serialized the same way". What this means? You probably mean
> > > that *same* int and long (i. e. int and long, which are numerically identical) serialized the same way.
> > >
> > > * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. Do you mean no bytes at all? Or null bytes,
> > > i. e. some undefined number of null bytes? (Of course, I understand that you mean the first variant, but I still don't
> > > like the phrase)
> > >
> > > * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. e. smallest) encoding of numbers required?
> > > Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307
> > >
> > > I still think canonical representations should be required. The more forms of encoding you allow the bigger is attack
> > > surface.
> > >
> > > Also, it would be desirable property for binary representations be equal when data is equal. It would be good if you
> > > guarantee this property at least for some subset of schemas (and of course, you should write explicitly for which
> > > schemas the property is guaranteed). Non-canonical representations break this property
> > >
> > > * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is sequence of encoded scalar values (don't confuse
> > > with code points). Unfortunately, not everybody knows this, and thus we see WTF-8 (i. e. encoding similar to UTF-8, but
> > > with standalone surrogates) available in places where proper UTF-8 should reside. So everywhere where the spec says
> > > "UTF-8" I propose to explicitly write that standalone surrogates are not allowed and that readers should fail if they
> > > find them (I prefer to place this sentence to introduction of the spec)
> > >
> > > * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are practically an int, but theoretically a
> > > long". Wat? So, other implementations use int (as per spec), but C++ uses long, right? So, go fix C++ implementation to
> > > match spec and other implementations
> > >
> > > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON null". There is no reasons to special-case
> > > nulls. This is additional requirement, which adds complexity to implementations without any reasons
> > >
> > > * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It seems I found a real bug. :) Consider this
> > > schema:
> > >
> > > [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]
> > >
> > > As well as I understand such schema fully allowed. Now consider this encoded value: {"map":{"b":0}}. What is it? Map or
> > > record named "map"?
> > >
> > > * [Bug] [Line 677]. "data is ordered by ascending numeric value". What about NaNs?
> > >
> > > * [Bug] [Line 682]. "compared lexicographically by Unicode code point". Replace with scalar values. UTF-8 consists of
> > > scalar values
> > >
> > > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for this file". I don't like this point. It
> > > implies that container files are usually not equal. Thus it is not possible to compare them bitwise to determine
> > > equality. So, in my Avro implementation I write null bytes instead of this marker (yes, this possibly means that my
> > > implementation is non-conforming)
> > >
> > > * [Opinion] [Line 717]. There is no any marker for end of container file. Thus there is no way to determine whether all
> > > data was written
> > >
> > > * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to string". Wat? How these values are promoted?
> > >
> > > * [Bug] [Line 1153]. What implementation should do (when it does schema resolution)? It should first check that schemas
> > > match (and report any errors) and then read data? Or proceed straight to reading data? This is important distinction.
> > > For example, what happens when we attempt to read file container without data elements using schema resolution
> > > algorithm? (Are such container allowed, by the way?) In the first case scheme check should be performed. In the second
> > > such reading should always be successful.
> > >
> > > If you think the first case is correct, then the section should describe algorithm for determining matching of schemas
> > > separately from algorithm of actual reading data
> > >
> > > * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" instead of {"type":"int"}>>, right?
> > >
> > > * [Bug] [Line 1331]. "replace any escaped characters". Any? What about "a\"b"? It is impossible to replace :)
> > >
> > > ----
> > >
> > > Some notes about my task. I want to implement this:
> > > https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . I. e. I want to have shell utils in Linux,
> > > which exchange some structured data. Here is how I chose format for representing that data.
> > >
> > > * I want format to be binary, not textual, this rules out JSON, XML, etc
> > > * I want format to be typed, this rules out CBOR etc
> > > * I want format to have support for proper sum types (similar to Haskell's), this rules out Microsoft Bond. As well as
> > > I understand this also rules out using GVariants, proposed in above mentioned article. And this rules out Protobuf:
> > > Protobuf has support for sum types (they are named OneOf), but this OneOfs are always optional (speaking in Avro
> > > language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
> > > * I want format to have support for recursive types, this rules out Bare ( baremessages.org )
> > >
> > > So, we have not so many formats left. Avro and possibly a few more. And I chose Avro. And I really like it. Because:
> > >
> > > * It is very compact
> > > * It has very elegant way to support schema evolution (as opposed to Protobuf, where fields are tagged, i. e. you trade
> > > space efficiency for schema evolution)
> > > * It has container format with schema attached
> > > * You don't need to write items count to container header (good for streaming)
> > >
> > > So, Avro is simply *best* for my task. But then I discovered its problems (listed above). How it is happened that such
> > > good format has so bad spec? How it is happened that *best* format for this task happened to be so bad? What this says
> > > about our industry?
> > >
> > > ==
> > > Askar Safin
> > > http://safinaskar.com
> > > https://sr.ht/~safinaskar
> > > https://github.com/safinaskar

Re: Gigantic list of Avro spec issues

Posted by Dan Schmitt <da...@gmail.com>.
I will admit the spec is likely weak around unicode/utf encoding,
(as the serialization of strings to json isn't consistent across all
language bindings) but I file the ticket against implementations
and write test cases rather than make guesses at the spec
wording and make demands without knowing it's a real issue.

On Sat, Feb 12, 2022 at 6:55 PM Dan Schmitt <da...@gmail.com> wrote:
>
> Generally the "I'm going to lump all my complaints into
> one big bug" is a good way to get them ignored.
>
> I'll skip "the design is wrong and it should change because
> I don't like it" and cite "it's used everywhere with lots of
> implementations so you can't change it in an incompatible way".
>
> I'll skip obvious typos and suggest you catch more flies with
> honey than vinegar, and you can work a git repository and
> make a pull request for those and they'd be fixed fast.  (Or
> read 2 or 3 of the 8 implementations if the phrasing is confusing
> to you.)
>
> I'll make some suggestions on the technical details
> So, specifically:
>
> * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker
> for this file". I don't like this point. It
> implies that container files are usually not equal. Thus it is not
> possible to compare them bitwise to determine
> equality. So, in my Avro implementation I write null bytes instead of
> this marker (yes, this possibly means that my
> implementation is non-conforming)
> * [Opinion] [Line 717]. There is no any marker for end of container
> file. Thus there is no way to determine whether all
> data was written
>
> If you use the sync marker, you don't need an end of container marker.
> (Flush the sync and container block map with new data after the new
> block is written, if you have the metadata and block list you know that
> much is complete and written for you to read, if you read the metadata
> and your sync byte marker is wrong, go re-read/continue read.)
>
> * [Very theoretical bug, possible even security-related] [Line 435].
> Since you have a test case that proves it doesn't crash systems, it's
> sort of not a bug right?  You could at the test case to the test suite.
>
> * [Bug] [Line 572]. "Currently for C/C++ implementations, the
> positions are practically an int, but theoretically a
> long". Wat? So, other implementations use int (as per spec), but C++
> uses long, right? So, go fix C++ implementation to
> match spec and other implementations
>
> This is not a bug, but an acknowledgement that the C/C++ offsets
> are internally implemented via pointer math to be efficient but if
> you try to read in enough data that a long offset makes sense,
> you will be sad/run out of memory.   That the internal implementation
> for C/C++ supports the minimum required by the specification.
>
> * [Bug] [Line 385]. "For example, int and long are always serialized
> the same way". What this means? You probably mean
> that *same* int and long (i. e. int and long, which are numerically
> identical) serialized the same way.
>
> That rewrite is wrong.  Your wording would allow serialization to be
> altered by value (e.g. it would be allowable to use big endian storage
> for odd numbers and little endian for even as each same int and long
> would be serialized the same way.)
>
> * [Opinion] [Line 596]. "if its type is null, then it is encoded as a
> JSON null". There is no reasons to special-case
> nulls. This is additional requirement, which adds complexity to
> implementations without any reasons
>
> You are making assumptions about implementation encoding of null.
> A C developer would say writing 0x00 to the file that you will read back
> later is fine for null or false or 0.
>
> * [Bug] [Line 417]. - can't take action, the request breaks compatibility.
>
> * [Bug] [Line 292]. "The null namespace may not be used in a
> dot-separated sequence of names". You defined previously
> null namespace as a empty string instead of *whole* namespace. I. e.
> null namespace is lack of namespace (i. e. lack of
> whole dot-separated sequence). So there is no sense in speaking on
> using null namespace in dot-separated sequence. You
> probably mean that one should not use empty string in dot-separated sequence
>
> That doesn't read as a bug at all to me, you have introduce a "whole
> namespace" and only then does it not make sense
>
> a.b.c - a is the top level namespace, b is a namespace in a, c is a
> namespace in b.
>
> Probably easier to read up on C++ namespaces and nesting and full
> specification to see how your introduction of "whole namespace" where
> it doesn't exist is what is causing your confusion.
>
> I look forward to more bug reports and pull requests to iron out typos
> and standardize existing practice (you might want to read through
> other implementations to make sure your suggestions are useful to
> the community at large.)
>
> On Sat, Feb 12, 2022 at 5:22 PM Askar Safin <sa...@mail.ru> wrote:
> >
> > Hi. I'm writing my own Avro implementation in Rust for personal use. During this work I found a lot of issues in Avro
> > spec (the list follows).
> >
> > I send this mail not only to user and dev mailing lists of Avro, but also to Apache community list, Kafka list and 3
> > semi-randomly chosen Materialize employees. Because I want to draw attention to this problems. I hope this wider
> > community helps Avro fix their issues and possible will give necessary resources.
> >
> > As well as I understand Avro is used in Kafka. And Kafka, according to their site, is used in "80% of all Fortune 100
> > companies". So Avro is critical piece of infrastructure of humanity. It should be absolutely perfect (and so I list
> > even very small bugs). But it is not perfect.
> >
> > Some of items in this list are (small and big) bugs, some are typos, some are my objections to the design. Some are
> > fixable while keeping compatibility, some are not. I don't want to spend my time to report them as separate bugs, but
> > you can try to convince me to do so.
> >
> > I created this list simply by reading the spec from end to end (I skipped sections on RPC and logical types). And I
> > even didn't look at implementations!
> >
> > I write this is hope to help Avro.
> >
> > I think big audit of spec and its implementations should be done.
> >
> > All line numbers apply to spec.xml from tag release-1.11.0 (i. e.
> > https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml ). As well as I understand this tag
> > corresponds to currently published version at https://avro.apache.org/docs/current/spec.html .
> >
> > So, here we go.
> >
> > * [Opinion] [No line]. In Avro one have to define named records inside each other like so:
> >
> > { "type": "record", "name": "a", "fields": [{"name":"b","type":{"type":"record","name":"c",...}}] }
> >
> > This is very unnatural. In popular programming languages one usually define named record next to each other, not one
> > inside other. Such representation is not handy to deal programmatically. In my implementation I have to convert this
> > representation to usual form "root type + list of named types" right after reading JSON and convert back just before
> > writing.
> >
> > * [Opinion] [No line]. In this list you will see a lot of questions on Avro schema (encoded as JSON). Good JSON schema
> > ( https://json-schema.org/ ) would resolve many of them
> >
> > * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, in whole spec phrase "type name" is used
> > very vaguely. Sometimes it means strings like "record" and sometimes it means names of named types. I propose to define
> > in very beginning of the spec terms for primitive types, terms for strings like "record" and terms for names of defined
> > types. Here is one possible way to do this: name strings like "record" and "fixed" "type kinds" and never name them
> > type names, thus reserving term "type name" to named types only (and possibly to primitive types).
> >
> > This issue already caused problems: look, for example, to this problems with {"type":"record","name":"record",...}:
> > https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .
> >
> > * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit integers. Such type is present in languages such
> > as C and Rust
> >
> > * [Very theoretical bug, possible even security-related] [Line 435]. "The float is converted into a 32-bit integer
> > using a method equivalent to Java's floatToIntBits and then encoded in little-endian format". If we click at provided
> > link, we will see that this Java function does NaN normalization. I think NaN normalization is good thing. But I think
> > this quite possible spec implementers overlooked this NaN normalization requirement. So I propose: write explicitly
> > directly in Avro spec that NaN are normalized. Audit all Avro implementations: whether they actually implemented this
> > requirement. Create tests, which will actually test this requirement.
> >
> > Also I don't know whether bit pattern provided in that Java doc (0x7fc00000) is quiet NaN or signaling. If it is
> > signaling, this is very bad.
> >
> > As well as I understand if you will configure your FPU particular way than merely copying signaling NaN from one place
> > to another will abort your program. So, if your FPU is configured certain way then feeding particular binary Avro data
> > to a program can crash it. I. e. this is security problem. So a reader should be careful to check whether input data is
> > signaling NaN *before* storing it in floating point registers.
> >
> > I checked whether manipulating signaling NaN can actually crash a program in default settings in Windows and Linux. And
> > it turned out that a program will not crash. Still I think signaling NaN should be handled carefully.
> >
> > Write to spec that writers should normalize NaNs, that readers should reject non-normalized NaNs and that readers
> > should be careful not to store incoming floating number to floating-point variable before its sanitizing. Write that
> > this is security issue.
> >
> > * [Opinion] [Line 68]. "unicode character sequence". As well as I understand Unicode character sequence means sequence
> > of Unicode scalar values. Note that scalar value is not same thing as code point. Unfortunately, some people don't know
> > this, so please write explicitly: "this is sequence of scalar values, not code points", to make sure implementations
> > will be correct
> >
> > * [Bug] [Line 71]. "Primitive types have no specified attributes". This is lie. At line 1527 you specify logical type
> > based on primitive type int. Thus you specify particular meaning of attribute "logicalType" for primitive type "int".
> > Be careful at your wording. The spec should be rock-solid
> >
> > * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing alternate names for this record (optional)". Is
> > empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say this is nitpicking, but I don't think so.
> > Avro has important place in our infrastructure, so everything is important. Think carefully whether empty list (and
> > duplicates) is allowed everywhere in the spec where you see some kind of list. I think empty arrays (and duplicates)
> > should be disallowed in this particular case. Because the more things we allow, the bigger is attack surface
> >
> > * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields allowed? Already reported by me at
> > https://issues.apache.org/jira/browse/AVRO-3279
> >
> > * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already reported by me at
> > https://issues.apache.org/jira/browse/AVRO-3280
> >
> > * [Bug] [Line 101]. "name: a JSON string providing the name of the field (required), and". Word "and" usually placed
> > immediately before last item in sequence. The text here looks like item "doc" was last, but then the text was edited
> > not carefully. This is very stupid typographic issue which shows that authors are not careful about spec quality. Also,
> > the spec is not consistent on placing dots after items (this applies to whole spec). Sometimes I see nothing in the end
> > of item, sometimes "." and sometimes ";"
> >
> > * [Opinion] [Line 106]. "default: A default value for..." What follows is essentially description of JSON
> > representation of Avro datum (except for unions). So, you managed to put very important part of your spec directly into
> > one paragraph into second level bullet point?!
> >
> > * [Opinion] [Line 112]. "Default values for union fields correspond to the first schema in the union". This phrase is
> > difference between JSON encoding for Avro data and JSON encoding for default field. And, of course, presence of this
> > difference is design bug
> >
> > * [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255
> > are mapped to unsigned 8-bit byte values 0-255". Wat? This is very unnatural encoding. You misuse JSON string. They are
> > for strings, they are not for binary data. You should use array of numbers instead. I. e. encode bytes 0x0f 0x02 as
> > [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C programs have difficulties with such strings
> >
> > * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what if given long is not a JSON-safe integer? As
> > we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside of range [-(2**53)+1, (2**53)-1] are not
> > JSON-safe
> >
> > * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, despite they seems to be allowed by Avro spec.
> > So, JSON representation of Avro data is incomplete
> >
> > * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined by spec. For unknown reasons you decided to
> > insert essentially whole description of JSON representation of Avro data into one small paragraph even before type
> > system is fully described. Please use terms only after their definition
> >
> > * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional or required. Do you ever read your spec?
> >
> > * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" is marked using <em>, not <code>, and we
> > can see this in rendered version. This is very stupid typographic bug, which is immediately obvious to anybody reading
> > this document, even to non-technical people
> >
> > * [Bug] [Line 200]. "a single attribute". As we see in provided example, "default" is allowed, too. What is meaning of
> > this "default" attribute? And how its meaning differs from meaning of "default" key in field description? (Same for
> > maps)
> >
> > * [Bug] [Line 238]. "declares a schema which may be either a null or string". Lie. Schema is ["null", "string"].
> > *Value* may be a null or string. Please check that you don't confuse types (schemas) with value through whole your spec
> >
> > * [Very opinionated opinion :)] [Line 235]. I don't like your unions at all. I'm coming from languages like Haskell and
> > Rust, where true sum types are supported. They are similar to your unions, but their alternatives are named.
> > Alternatives are identified by their names, so there is no restriction on duplicate types. So there is no need for very
> > unnatural restriction "Unions may not contain more than one schema with the same type, except for the named types
> > record, fixed and enum"
> >
> > * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, providing alternate names for this enum". You
> > mean "fixed", right? So, you copy-pasted section on enums? Do you ever read your spec from end to end at least one time?
> >
> > * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per value". Is zero allowed?
> >
> > * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated sequence of names". You defined previously
> > null namespace as a empty string instead of *whole* namespace. I. e. null namespace is lack of namespace (i. e. lack of
> > whole dot-separated sequence). So there is no sense in speaking on using null namespace in dot-separated sequence. You
> > probably mean that one should not use empty string in dot-separated sequence
> >
> > * [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished by specifying an additional schema, the
> > results of which are described in Schema Resolution". Term "additional schema" is vague here. I would say so:
> > "Deserializing data into a newer schema is accomplished by using an algorithm described in Schema Resolution"
> >
> > * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to read Avro data with a schema that does not..."
> > The whole paragraph is very vague. At first reading I thought that it is about schema resolution. After several
> > attempts to understand it I finally understood that the paragraph is about reading attempts without original writer
> > schema available at all. I propose removing whole paragraph or rewriting it completely
> >
> > * [Bug] [Line 385]. "For example, int and long are always serialized the same way". What this means? You probably mean
> > that *same* int and long (i. e. int and long, which are numerically identical) serialized the same way.
> >
> > * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. Do you mean no bytes at all? Or null bytes,
> > i. e. some undefined number of null bytes? (Of course, I understand that you mean the first variant, but I still don't
> > like the phrase)
> >
> > * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. e. smallest) encoding of numbers required?
> > Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307
> >
> > I still think canonical representations should be required. The more forms of encoding you allow the bigger is attack
> > surface.
> >
> > Also, it would be desirable property for binary representations be equal when data is equal. It would be good if you
> > guarantee this property at least for some subset of schemas (and of course, you should write explicitly for which
> > schemas the property is guaranteed). Non-canonical representations break this property
> >
> > * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is sequence of encoded scalar values (don't confuse
> > with code points). Unfortunately, not everybody knows this, and thus we see WTF-8 (i. e. encoding similar to UTF-8, but
> > with standalone surrogates) available in places where proper UTF-8 should reside. So everywhere where the spec says
> > "UTF-8" I propose to explicitly write that standalone surrogates are not allowed and that readers should fail if they
> > find them (I prefer to place this sentence to introduction of the spec)
> >
> > * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are practically an int, but theoretically a
> > long". Wat? So, other implementations use int (as per spec), but C++ uses long, right? So, go fix C++ implementation to
> > match spec and other implementations
> >
> > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON null". There is no reasons to special-case
> > nulls. This is additional requirement, which adds complexity to implementations without any reasons
> >
> > * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It seems I found a real bug. :) Consider this
> > schema:
> >
> > [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]
> >
> > As well as I understand such schema fully allowed. Now consider this encoded value: {"map":{"b":0}}. What is it? Map or
> > record named "map"?
> >
> > * [Bug] [Line 677]. "data is ordered by ascending numeric value". What about NaNs?
> >
> > * [Bug] [Line 682]. "compared lexicographically by Unicode code point". Replace with scalar values. UTF-8 consists of
> > scalar values
> >
> > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for this file". I don't like this point. It
> > implies that container files are usually not equal. Thus it is not possible to compare them bitwise to determine
> > equality. So, in my Avro implementation I write null bytes instead of this marker (yes, this possibly means that my
> > implementation is non-conforming)
> >
> > * [Opinion] [Line 717]. There is no any marker for end of container file. Thus there is no way to determine whether all
> > data was written
> >
> > * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to string". Wat? How these values are promoted?
> >
> > * [Bug] [Line 1153]. What implementation should do (when it does schema resolution)? It should first check that schemas
> > match (and report any errors) and then read data? Or proceed straight to reading data? This is important distinction.
> > For example, what happens when we attempt to read file container without data elements using schema resolution
> > algorithm? (Are such container allowed, by the way?) In the first case scheme check should be performed. In the second
> > such reading should always be successful.
> >
> > If you think the first case is correct, then the section should describe algorithm for determining matching of schemas
> > separately from algorithm of actual reading data
> >
> > * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" instead of {"type":"int"}>>, right?
> >
> > * [Bug] [Line 1331]. "replace any escaped characters". Any? What about "a\"b"? It is impossible to replace :)
> >
> > ----
> >
> > Some notes about my task. I want to implement this:
> > https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . I. e. I want to have shell utils in Linux,
> > which exchange some structured data. Here is how I chose format for representing that data.
> >
> > * I want format to be binary, not textual, this rules out JSON, XML, etc
> > * I want format to be typed, this rules out CBOR etc
> > * I want format to have support for proper sum types (similar to Haskell's), this rules out Microsoft Bond. As well as
> > I understand this also rules out using GVariants, proposed in above mentioned article. And this rules out Protobuf:
> > Protobuf has support for sum types (they are named OneOf), but this OneOfs are always optional (speaking in Avro
> > language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
> > * I want format to have support for recursive types, this rules out Bare ( baremessages.org )
> >
> > So, we have not so many formats left. Avro and possibly a few more. And I chose Avro. And I really like it. Because:
> >
> > * It is very compact
> > * It has very elegant way to support schema evolution (as opposed to Protobuf, where fields are tagged, i. e. you trade
> > space efficiency for schema evolution)
> > * It has container format with schema attached
> > * You don't need to write items count to container header (good for streaming)
> >
> > So, Avro is simply *best* for my task. But then I discovered its problems (listed above). How it is happened that such
> > good format has so bad spec? How it is happened that *best* format for this task happened to be so bad? What this says
> > about our industry?
> >
> > ==
> > Askar Safin
> > http://safinaskar.com
> > https://sr.ht/~safinaskar
> > https://github.com/safinaskar

Re: Gigantic list of Avro spec issues

Posted by Dan Schmitt <da...@gmail.com>.
Generally the "I'm going to lump all my complaints into
one big bug" is a good way to get them ignored.

I'll skip "the design is wrong and it should change because
I don't like it" and cite "it's used everywhere with lots of
implementations so you can't change it in an incompatible way".

I'll skip obvious typos and suggest you catch more flies with
honey than vinegar, and you can work a git repository and
make a pull request for those and they'd be fixed fast.  (Or
read 2 or 3 of the 8 implementations if the phrasing is confusing
to you.)

I'll make some suggestions on the technical details
So, specifically:

* [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker
for this file". I don't like this point. It
implies that container files are usually not equal. Thus it is not
possible to compare them bitwise to determine
equality. So, in my Avro implementation I write null bytes instead of
this marker (yes, this possibly means that my
implementation is non-conforming)
* [Opinion] [Line 717]. There is no any marker for end of container
file. Thus there is no way to determine whether all
data was written

If you use the sync marker, you don't need an end of container marker.
(Flush the sync and container block map with new data after the new
block is written, if you have the metadata and block list you know that
much is complete and written for you to read, if you read the metadata
and your sync byte marker is wrong, go re-read/continue read.)

* [Very theoretical bug, possible even security-related] [Line 435].
Since you have a test case that proves it doesn't crash systems, it's
sort of not a bug right?  You could at the test case to the test suite.

* [Bug] [Line 572]. "Currently for C/C++ implementations, the
positions are practically an int, but theoretically a
long". Wat? So, other implementations use int (as per spec), but C++
uses long, right? So, go fix C++ implementation to
match spec and other implementations

This is not a bug, but an acknowledgement that the C/C++ offsets
are internally implemented via pointer math to be efficient but if
you try to read in enough data that a long offset makes sense,
you will be sad/run out of memory.   That the internal implementation
for C/C++ supports the minimum required by the specification.

* [Bug] [Line 385]. "For example, int and long are always serialized
the same way". What this means? You probably mean
that *same* int and long (i. e. int and long, which are numerically
identical) serialized the same way.

That rewrite is wrong.  Your wording would allow serialization to be
altered by value (e.g. it would be allowable to use big endian storage
for odd numbers and little endian for even as each same int and long
would be serialized the same way.)

* [Opinion] [Line 596]. "if its type is null, then it is encoded as a
JSON null". There is no reasons to special-case
nulls. This is additional requirement, which adds complexity to
implementations without any reasons

You are making assumptions about implementation encoding of null.
A C developer would say writing 0x00 to the file that you will read back
later is fine for null or false or 0.

* [Bug] [Line 417]. - can't take action, the request breaks compatibility.

* [Bug] [Line 292]. "The null namespace may not be used in a
dot-separated sequence of names". You defined previously
null namespace as a empty string instead of *whole* namespace. I. e.
null namespace is lack of namespace (i. e. lack of
whole dot-separated sequence). So there is no sense in speaking on
using null namespace in dot-separated sequence. You
probably mean that one should not use empty string in dot-separated sequence

That doesn't read as a bug at all to me, you have introduce a "whole
namespace" and only then does it not make sense

a.b.c - a is the top level namespace, b is a namespace in a, c is a
namespace in b.

Probably easier to read up on C++ namespaces and nesting and full
specification to see how your introduction of "whole namespace" where
it doesn't exist is what is causing your confusion.

I look forward to more bug reports and pull requests to iron out typos
and standardize existing practice (you might want to read through
other implementations to make sure your suggestions are useful to
the community at large.)

On Sat, Feb 12, 2022 at 5:22 PM Askar Safin <sa...@mail.ru> wrote:
>
> Hi. I'm writing my own Avro implementation in Rust for personal use. During this work I found a lot of issues in Avro
> spec (the list follows).
>
> I send this mail not only to user and dev mailing lists of Avro, but also to Apache community list, Kafka list and 3
> semi-randomly chosen Materialize employees. Because I want to draw attention to this problems. I hope this wider
> community helps Avro fix their issues and possible will give necessary resources.
>
> As well as I understand Avro is used in Kafka. And Kafka, according to their site, is used in "80% of all Fortune 100
> companies". So Avro is critical piece of infrastructure of humanity. It should be absolutely perfect (and so I list
> even very small bugs). But it is not perfect.
>
> Some of items in this list are (small and big) bugs, some are typos, some are my objections to the design. Some are
> fixable while keeping compatibility, some are not. I don't want to spend my time to report them as separate bugs, but
> you can try to convince me to do so.
>
> I created this list simply by reading the spec from end to end (I skipped sections on RPC and logical types). And I
> even didn't look at implementations!
>
> I write this is hope to help Avro.
>
> I think big audit of spec and its implementations should be done.
>
> All line numbers apply to spec.xml from tag release-1.11.0 (i. e.
> https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml ). As well as I understand this tag
> corresponds to currently published version at https://avro.apache.org/docs/current/spec.html .
>
> So, here we go.
>
> * [Opinion] [No line]. In Avro one have to define named records inside each other like so:
>
> { "type": "record", "name": "a", "fields": [{"name":"b","type":{"type":"record","name":"c",...}}] }
>
> This is very unnatural. In popular programming languages one usually define named record next to each other, not one
> inside other. Such representation is not handy to deal programmatically. In my implementation I have to convert this
> representation to usual form "root type + list of named types" right after reading JSON and convert back just before
> writing.
>
> * [Opinion] [No line]. In this list you will see a lot of questions on Avro schema (encoded as JSON). Good JSON schema
> ( https://json-schema.org/ ) would resolve many of them
>
> * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, in whole spec phrase "type name" is used
> very vaguely. Sometimes it means strings like "record" and sometimes it means names of named types. I propose to define
> in very beginning of the spec terms for primitive types, terms for strings like "record" and terms for names of defined
> types. Here is one possible way to do this: name strings like "record" and "fixed" "type kinds" and never name them
> type names, thus reserving term "type name" to named types only (and possibly to primitive types).
>
> This issue already caused problems: look, for example, to this problems with {"type":"record","name":"record",...}:
> https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .
>
> * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit integers. Such type is present in languages such
> as C and Rust
>
> * [Very theoretical bug, possible even security-related] [Line 435]. "The float is converted into a 32-bit integer
> using a method equivalent to Java's floatToIntBits and then encoded in little-endian format". If we click at provided
> link, we will see that this Java function does NaN normalization. I think NaN normalization is good thing. But I think
> this quite possible spec implementers overlooked this NaN normalization requirement. So I propose: write explicitly
> directly in Avro spec that NaN are normalized. Audit all Avro implementations: whether they actually implemented this
> requirement. Create tests, which will actually test this requirement.
>
> Also I don't know whether bit pattern provided in that Java doc (0x7fc00000) is quiet NaN or signaling. If it is
> signaling, this is very bad.
>
> As well as I understand if you will configure your FPU particular way than merely copying signaling NaN from one place
> to another will abort your program. So, if your FPU is configured certain way then feeding particular binary Avro data
> to a program can crash it. I. e. this is security problem. So a reader should be careful to check whether input data is
> signaling NaN *before* storing it in floating point registers.
>
> I checked whether manipulating signaling NaN can actually crash a program in default settings in Windows and Linux. And
> it turned out that a program will not crash. Still I think signaling NaN should be handled carefully.
>
> Write to spec that writers should normalize NaNs, that readers should reject non-normalized NaNs and that readers
> should be careful not to store incoming floating number to floating-point variable before its sanitizing. Write that
> this is security issue.
>
> * [Opinion] [Line 68]. "unicode character sequence". As well as I understand Unicode character sequence means sequence
> of Unicode scalar values. Note that scalar value is not same thing as code point. Unfortunately, some people don't know
> this, so please write explicitly: "this is sequence of scalar values, not code points", to make sure implementations
> will be correct
>
> * [Bug] [Line 71]. "Primitive types have no specified attributes". This is lie. At line 1527 you specify logical type
> based on primitive type int. Thus you specify particular meaning of attribute "logicalType" for primitive type "int".
> Be careful at your wording. The spec should be rock-solid
>
> * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing alternate names for this record (optional)". Is
> empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say this is nitpicking, but I don't think so.
> Avro has important place in our infrastructure, so everything is important. Think carefully whether empty list (and
> duplicates) is allowed everywhere in the spec where you see some kind of list. I think empty arrays (and duplicates)
> should be disallowed in this particular case. Because the more things we allow, the bigger is attack surface
>
> * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields allowed? Already reported by me at
> https://issues.apache.org/jira/browse/AVRO-3279
>
> * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already reported by me at
> https://issues.apache.org/jira/browse/AVRO-3280
>
> * [Bug] [Line 101]. "name: a JSON string providing the name of the field (required), and". Word "and" usually placed
> immediately before last item in sequence. The text here looks like item "doc" was last, but then the text was edited
> not carefully. This is very stupid typographic issue which shows that authors are not careful about spec quality. Also,
> the spec is not consistent on placing dots after items (this applies to whole spec). Sometimes I see nothing in the end
> of item, sometimes "." and sometimes ";"
>
> * [Opinion] [Line 106]. "default: A default value for..." What follows is essentially description of JSON
> representation of Avro datum (except for unions). So, you managed to put very important part of your spec directly into
> one paragraph into second level bullet point?!
>
> * [Opinion] [Line 112]. "Default values for union fields correspond to the first schema in the union". This phrase is
> difference between JSON encoding for Avro data and JSON encoding for default field. And, of course, presence of this
> difference is design bug
>
> * [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255
> are mapped to unsigned 8-bit byte values 0-255". Wat? This is very unnatural encoding. You misuse JSON string. They are
> for strings, they are not for binary data. You should use array of numbers instead. I. e. encode bytes 0x0f 0x02 as
> [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C programs have difficulties with such strings
>
> * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what if given long is not a JSON-safe integer? As
> we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside of range [-(2**53)+1, (2**53)-1] are not
> JSON-safe
>
> * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, despite they seems to be allowed by Avro spec.
> So, JSON representation of Avro data is incomplete
>
> * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined by spec. For unknown reasons you decided to
> insert essentially whole description of JSON representation of Avro data into one small paragraph even before type
> system is fully described. Please use terms only after their definition
>
> * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional or required. Do you ever read your spec?
>
> * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" is marked using <em>, not <code>, and we
> can see this in rendered version. This is very stupid typographic bug, which is immediately obvious to anybody reading
> this document, even to non-technical people
>
> * [Bug] [Line 200]. "a single attribute". As we see in provided example, "default" is allowed, too. What is meaning of
> this "default" attribute? And how its meaning differs from meaning of "default" key in field description? (Same for
> maps)
>
> * [Bug] [Line 238]. "declares a schema which may be either a null or string". Lie. Schema is ["null", "string"].
> *Value* may be a null or string. Please check that you don't confuse types (schemas) with value through whole your spec
>
> * [Very opinionated opinion :)] [Line 235]. I don't like your unions at all. I'm coming from languages like Haskell and
> Rust, where true sum types are supported. They are similar to your unions, but their alternatives are named.
> Alternatives are identified by their names, so there is no restriction on duplicate types. So there is no need for very
> unnatural restriction "Unions may not contain more than one schema with the same type, except for the named types
> record, fixed and enum"
>
> * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, providing alternate names for this enum". You
> mean "fixed", right? So, you copy-pasted section on enums? Do you ever read your spec from end to end at least one time?
>
> * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per value". Is zero allowed?
>
> * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated sequence of names". You defined previously
> null namespace as a empty string instead of *whole* namespace. I. e. null namespace is lack of namespace (i. e. lack of
> whole dot-separated sequence). So there is no sense in speaking on using null namespace in dot-separated sequence. You
> probably mean that one should not use empty string in dot-separated sequence
>
> * [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished by specifying an additional schema, the
> results of which are described in Schema Resolution". Term "additional schema" is vague here. I would say so:
> "Deserializing data into a newer schema is accomplished by using an algorithm described in Schema Resolution"
>
> * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to read Avro data with a schema that does not..."
> The whole paragraph is very vague. At first reading I thought that it is about schema resolution. After several
> attempts to understand it I finally understood that the paragraph is about reading attempts without original writer
> schema available at all. I propose removing whole paragraph or rewriting it completely
>
> * [Bug] [Line 385]. "For example, int and long are always serialized the same way". What this means? You probably mean
> that *same* int and long (i. e. int and long, which are numerically identical) serialized the same way.
>
> * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. Do you mean no bytes at all? Or null bytes,
> i. e. some undefined number of null bytes? (Of course, I understand that you mean the first variant, but I still don't
> like the phrase)
>
> * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. e. smallest) encoding of numbers required?
> Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307
>
> I still think canonical representations should be required. The more forms of encoding you allow the bigger is attack
> surface.
>
> Also, it would be desirable property for binary representations be equal when data is equal. It would be good if you
> guarantee this property at least for some subset of schemas (and of course, you should write explicitly for which
> schemas the property is guaranteed). Non-canonical representations break this property
>
> * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is sequence of encoded scalar values (don't confuse
> with code points). Unfortunately, not everybody knows this, and thus we see WTF-8 (i. e. encoding similar to UTF-8, but
> with standalone surrogates) available in places where proper UTF-8 should reside. So everywhere where the spec says
> "UTF-8" I propose to explicitly write that standalone surrogates are not allowed and that readers should fail if they
> find them (I prefer to place this sentence to introduction of the spec)
>
> * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are practically an int, but theoretically a
> long". Wat? So, other implementations use int (as per spec), but C++ uses long, right? So, go fix C++ implementation to
> match spec and other implementations
>
> * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON null". There is no reasons to special-case
> nulls. This is additional requirement, which adds complexity to implementations without any reasons
>
> * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It seems I found a real bug. :) Consider this
> schema:
>
> [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]
>
> As well as I understand such schema fully allowed. Now consider this encoded value: {"map":{"b":0}}. What is it? Map or
> record named "map"?
>
> * [Bug] [Line 677]. "data is ordered by ascending numeric value". What about NaNs?
>
> * [Bug] [Line 682]. "compared lexicographically by Unicode code point". Replace with scalar values. UTF-8 consists of
> scalar values
>
> * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for this file". I don't like this point. It
> implies that container files are usually not equal. Thus it is not possible to compare them bitwise to determine
> equality. So, in my Avro implementation I write null bytes instead of this marker (yes, this possibly means that my
> implementation is non-conforming)
>
> * [Opinion] [Line 717]. There is no any marker for end of container file. Thus there is no way to determine whether all
> data was written
>
> * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to string". Wat? How these values are promoted?
>
> * [Bug] [Line 1153]. What implementation should do (when it does schema resolution)? It should first check that schemas
> match (and report any errors) and then read data? Or proceed straight to reading data? This is important distinction.
> For example, what happens when we attempt to read file container without data elements using schema resolution
> algorithm? (Are such container allowed, by the way?) In the first case scheme check should be performed. In the second
> such reading should always be successful.
>
> If you think the first case is correct, then the section should describe algorithm for determining matching of schemas
> separately from algorithm of actual reading data
>
> * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" instead of {"type":"int"}>>, right?
>
> * [Bug] [Line 1331]. "replace any escaped characters". Any? What about "a\"b"? It is impossible to replace :)
>
> ----
>
> Some notes about my task. I want to implement this:
> https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . I. e. I want to have shell utils in Linux,
> which exchange some structured data. Here is how I chose format for representing that data.
>
> * I want format to be binary, not textual, this rules out JSON, XML, etc
> * I want format to be typed, this rules out CBOR etc
> * I want format to have support for proper sum types (similar to Haskell's), this rules out Microsoft Bond. As well as
> I understand this also rules out using GVariants, proposed in above mentioned article. And this rules out Protobuf:
> Protobuf has support for sum types (they are named OneOf), but this OneOfs are always optional (speaking in Avro
> language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
> * I want format to have support for recursive types, this rules out Bare ( baremessages.org )
>
> So, we have not so many formats left. Avro and possibly a few more. And I chose Avro. And I really like it. Because:
>
> * It is very compact
> * It has very elegant way to support schema evolution (as opposed to Protobuf, where fields are tagged, i. e. you trade
> space efficiency for schema evolution)
> * It has container format with schema attached
> * You don't need to write items count to container header (good for streaming)
>
> So, Avro is simply *best* for my task. But then I discovered its problems (listed above). How it is happened that such
> good format has so bad spec? How it is happened that *best* format for this task happened to be so bad? What this says
> about our industry?
>
> ==
> Askar Safin
> http://safinaskar.com
> https://sr.ht/~safinaskar
> https://github.com/safinaskar

Re: [External] Gigantic list of Avro spec issues

Posted by Shannon Carey <sc...@expediagroup.com.INVALID>.
I am not an Avro expert, but here are some thoughts:

In general, I think you should submit a PR or PRs to suggest improvements, unless you are unsure what change should be made. That seems more productive than a list of things that you think are "stupid".

Another possibility would be putting together some of these suggestions into the contributor guidance https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute or maybe some kind of spec code review checklist.

You might be interested in using AVDL as a more natural way of defining types than AVSC. Also, you can combine multiple ACSC's together into the same context in order to normalize your type declarations. If you haven't already noticed, you can refer to an existing record type by its fully qualified name.

I agree with your critique of "derived type". It should probably say "complex type" instead.

"Type name" does not seem particularly confusing to me. It goes in the "typeName" part of "type": "typeName", and it's either one of the primitive type names, "record", "enum", "array", "map", or "fixed".

Regarding unsigned 64 bit integers, that's true, but languages like Java and Python don't have it, so you'd have to figure out what those other languages should do. 

Regarding signaling NaN, the Java spec states, "The Java Virtual Machine has no signaling NaN value." But it makes sense that this distinction is important for other languages.

> "Default values for union fields correspond to the first schema in the union". This phrase is 
    difference between JSON encoding for Avro data and JSON encoding for default field. And, of course, presence of this 
    difference is design bug

I don't follow you, maybe clarify in case other people have the same problem?

> "The null namespace may not be used in a dot-separated sequence of names". You defined previously 
    null namespace as a empty string instead of *whole* namespace. I. e. null namespace is lack of namespace (i. e. lack of 
    whole dot-separated sequence). So there is no sense in speaking on using null namespace in dot-separated sequence. You 
    probably mean that one should not use empty string in dot-separated sequence

I'm not sure you have that exactly right. The null namespace isn't defined as an empty string, the empty string is used to indicate the null namespace. Null namespace is not necessarily just lack of namespace. If you omit "namespace" and omit the namespace in "name", then it uses the namespace of the enclosing definition, and only uses the null namespace if there is no enclosing namespace. However, I agree that it could be reworded to say that each <name> component of a namespace must be a non-empty string.

>     * [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished by specifying an additional schema, the 
    results of which are described in Schema Resolution". Term "additional schema" is vague here. I would say so: 
    "Deserializing data into a newer schema is accomplished by using an algorithm described in Schema Resolution"

I wrote that in https://github.com/apache/avro/pull/91/files in an effort to improve the spec after suffering frustration similar to what you may be experiencing :) Though it did go through some code review comments where my questions were not fully answered, and resulted in the wording you see now in an effort to explain without misleading or making inaccurate claims. That paragraph begins by explaining the importance and use of the "writer's schema". The "additional schema" is referring to what I usually call the "reader's schema". I believe that a better rewording than what you suggest would be something like, "Deserializing data into a newer schema is accomplished by providing that newer schema as the <i>reader's</i> schema, which is utilized as described in Schema Resolution."

> "Therefore, it is possible, though not advisable, to read Avro data with a schema that does not..." 
    The whole paragraph is very vague. At first reading I thought that it is about schema resolution. After several 
    attempts to understand it I finally understood that the paragraph is about reading attempts without original writer 
    schema available at all. I propose removing whole paragraph or rewriting it completely

I mostly agree. That paragraph was re-worded in reaction to this comment https://github.com/apache/avro/pull/91/commits/1024216a68cdb7ec54ad3d126953755314966588#r77443864 and this comment https://github.com/apache/avro/pull/91/commits/69895cdf468eb75890a3f9dd9a0d89140905417f#r78588984 It doesn't make any sense to me that anyone should try to read Avro using a deserializing writer's schema that has the same type structure as the original writer's schema but not the same canonical form. It really doesn't make any sense to me that anyone would want to use anything but the original writer's schema. Also, the section about "Parsing Canonical Form for Schemas" is very specific that it is the definition of what it means for the reader to have "the same" schema as a writer. So I don't know why we have to confuse the issue by suggesting strange ways of using Avro, or that there are other ways for schemas to be "the same". But, I was trying to satisfy the code review comments because I thought the documentation really needed clarifications about the role of the reader & writer schema, and I didn't want it to be rejected.

I believe the paragraph does provide useful information about the serialization & deserialization approach, though, which helps clarify the physical format of the data in relation to the schema, and helps users understand the implications of schema changes.

> int and long are always serialized the same way, so an int could be deserialized as a long

The spec mentions, "int and long values are written using variable-length zig-zag coding." In other words, an int, once serialized, is indistinguishable from a long. Therefore, even if the writer schema had it as an "int", if your writer schema has it as a "long" it will still work: you will get back the correct value. Since they are both serialized the same way, an int can be deserialized as a long. If int and long each had their own serialization approach, that wouldn't be the case.

> "null is written as zero bytes". The phrase is vague.
Doesn't seem very vague to me. Zero bytes. How would you reword it?

That's all I have time for,

Shannon

On 2/12/22, 5:22 PM, "Askar Safin" <sa...@mail.ru.INVALID> wrote:

    Hi. I'm writing my own Avro implementation in Rust for personal use. During this work I found a lot of issues in Avro 
    spec (the list follows).

    I send this mail not only to user and dev mailing lists of Avro, but also to Apache community list, Kafka list and 3 
    semi-randomly chosen Materialize employees. Because I want to draw attention to this problems. I hope this wider 
    community helps Avro fix their issues and possible will give necessary resources.

    As well as I understand Avro is used in Kafka. And Kafka, according to their site, is used in "80% of all Fortune 100 
    companies". So Avro is critical piece of infrastructure of humanity. It should be absolutely perfect (and so I list 
    even very small bugs). But it is not perfect.

    Some of items in this list are (small and big) bugs, some are typos, some are my objections to the design. Some are 
    fixable while keeping compatibility, some are not. I don't want to spend my time to report them as separate bugs, but 
    you can try to convince me to do so.

    I created this list simply by reading the spec from end to end (I skipped sections on RPC and logical types). And I 
    even didn't look at implementations!

    I write this is hope to help Avro.

    I think big audit of spec and its implementations should be done.

    All line numbers apply to spec.xml from tag release-1.11.0 (i. e. 
    https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml ). As well as I understand this tag 
    corresponds to currently published version at https://avro.apache.org/docs/current/spec.html .

    So, here we go.

    * [Opinion] [No line]. In Avro one have to define named records inside each other like so:

    { "type": "record", "name": "a", "fields": [{"name":"b","type":{"type":"record","name":"c",...}}] }

    This is very unnatural. In popular programming languages one usually define named record next to each other, not one 
    inside other. Such representation is not handy to deal programmatically. In my implementation I have to convert this 
    representation to usual form "root type + list of named types" right after reading JSON and convert back just before 
    writing.

    * [Opinion] [No line]. In this list you will see a lot of questions on Avro schema (encoded as JSON). Good JSON schema 
    ( https://json-schema.org/ ) would resolve many of them

    * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, in whole spec phrase "type name" is used 
    very vaguely. Sometimes it means strings like "record" and sometimes it means names of named types. I propose to define 
    in very beginning of the spec terms for primitive types, terms for strings like "record" and terms for names of defined 
    types. Here is one possible way to do this: name strings like "record" and "fixed" "type kinds" and never name them 
    type names, thus reserving term "type name" to named types only (and possibly to primitive types).

    This issue already caused problems: look, for example, to this problems with {"type":"record","name":"record",...}: 
    https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .

    * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit integers. Such type is present in languages such 
    as C and Rust

    * [Very theoretical bug, possible even security-related] [Line 435]. "The float is converted into a 32-bit integer 
    using a method equivalent to Java's floatToIntBits and then encoded in little-endian format". If we click at provided 
    link, we will see that this Java function does NaN normalization. I think NaN normalization is good thing. But I think 
    this quite possible spec implementers overlooked this NaN normalization requirement. So I propose: write explicitly 
    directly in Avro spec that NaN are normalized. Audit all Avro implementations: whether they actually implemented this 
    requirement. Create tests, which will actually test this requirement.

    Also I don't know whether bit pattern provided in that Java doc (0x7fc00000) is quiet NaN or signaling. If it is 
    signaling, this is very bad.

    As well as I understand if you will configure your FPU particular way than merely copying signaling NaN from one place 
    to another will abort your program. So, if your FPU is configured certain way then feeding particular binary Avro data 
    to a program can crash it. I. e. this is security problem. So a reader should be careful to check whether input data is 
    signaling NaN *before* storing it in floating point registers.

    I checked whether manipulating signaling NaN can actually crash a program in default settings in Windows and Linux. And 
    it turned out that a program will not crash. Still I think signaling NaN should be handled carefully.

    Write to spec that writers should normalize NaNs, that readers should reject non-normalized NaNs and that readers 
    should be careful not to store incoming floating number to floating-point variable before its sanitizing. Write that 
    this is security issue.

    * [Opinion] [Line 68]. "unicode character sequence". As well as I understand Unicode character sequence means sequence 
    of Unicode scalar values. Note that scalar value is not same thing as code point. Unfortunately, some people don't know 
    this, so please write explicitly: "this is sequence of scalar values, not code points", to make sure implementations 
    will be correct

    * [Bug] [Line 71]. "Primitive types have no specified attributes". This is lie. At line 1527 you specify logical type 
    based on primitive type int. Thus you specify particular meaning of attribute "logicalType" for primitive type "int". 
    Be careful at your wording. The spec should be rock-solid

    * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing alternate names for this record (optional)". Is 
    empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say this is nitpicking, but I don't think so. 
    Avro has important place in our infrastructure, so everything is important. Think carefully whether empty list (and 
    duplicates) is allowed everywhere in the spec where you see some kind of list. I think empty arrays (and duplicates) 
    should be disallowed in this particular case. Because the more things we allow, the bigger is attack surface

    * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields allowed? Already reported by me at 
    https://issues.apache.org/jira/browse/AVRO-3279

    * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already reported by me at 
    https://issues.apache.org/jira/browse/AVRO-3280

    * [Bug] [Line 101]. "name: a JSON string providing the name of the field (required), and". Word "and" usually placed 
    immediately before last item in sequence. The text here looks like item "doc" was last, but then the text was edited 
    not carefully. This is very stupid typographic issue which shows that authors are not careful about spec quality. Also, 
    the spec is not consistent on placing dots after items (this applies to whole spec). Sometimes I see nothing in the end 
    of item, sometimes "." and sometimes ";"

    * [Opinion] [Line 106]. "default: A default value for..." What follows is essentially description of JSON 
    representation of Avro datum (except for unions). So, you managed to put very important part of your spec directly into 
    one paragraph into second level bullet point?!

    * [Opinion] [Line 112]. "Default values for union fields correspond to the first schema in the union". This phrase is 
    difference between JSON encoding for Avro data and JSON encoding for default field. And, of course, presence of this 
    difference is design bug

    * [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 
    are mapped to unsigned 8-bit byte values 0-255". Wat? This is very unnatural encoding. You misuse JSON string. They are 
    for strings, they are not for binary data. You should use array of numbers instead. I. e. encode bytes 0x0f 0x02 as 
    [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C programs have difficulties with such strings

    * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what if given long is not a JSON-safe integer? As 
    we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside of range [-(2**53)+1, (2**53)-1] are not 
    JSON-safe

    * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, despite they seems to be allowed by Avro spec. 
    So, JSON representation of Avro data is incomplete

    * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined by spec. For unknown reasons you decided to 
    insert essentially whole description of JSON representation of Avro data into one small paragraph even before type 
    system is fully described. Please use terms only after their definition

    * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional or required. Do you ever read your spec?

    * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" is marked using <em>, not <code>, and we 
    can see this in rendered version. This is very stupid typographic bug, which is immediately obvious to anybody reading 
    this document, even to non-technical people

    * [Bug] [Line 200]. "a single attribute". As we see in provided example, "default" is allowed, too. What is meaning of 
    this "default" attribute? And how its meaning differs from meaning of "default" key in field description? (Same for 
    maps)

    * [Bug] [Line 238]. "declares a schema which may be either a null or string". Lie. Schema is ["null", "string"]. 
    *Value* may be a null or string. Please check that you don't confuse types (schemas) with value through whole your spec

    * [Very opinionated opinion :)] [Line 235]. I don't like your unions at all. I'm coming from languages like Haskell and 
    Rust, where true sum types are supported. They are similar to your unions, but their alternatives are named. 
    Alternatives are identified by their names, so there is no restriction on duplicate types. So there is no need for very 
    unnatural restriction "Unions may not contain more than one schema with the same type, except for the named types 
    record, fixed and enum"

    * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, providing alternate names for this enum". You 
    mean "fixed", right? So, you copy-pasted section on enums? Do you ever read your spec from end to end at least one time?

    * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per value". Is zero allowed?

    * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated sequence of names". You defined previously 
    null namespace as a empty string instead of *whole* namespace. I. e. null namespace is lack of namespace (i. e. lack of 
    whole dot-separated sequence). So there is no sense in speaking on using null namespace in dot-separated sequence. You 
    probably mean that one should not use empty string in dot-separated sequence

    * [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished by specifying an additional schema, the 
    results of which are described in Schema Resolution". Term "additional schema" is vague here. I would say so: 
    "Deserializing data into a newer schema is accomplished by using an algorithm described in Schema Resolution"

    * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to read Avro data with a schema that does not..." 
    The whole paragraph is very vague. At first reading I thought that it is about schema resolution. After several 
    attempts to understand it I finally understood that the paragraph is about reading attempts without original writer 
    schema available at all. I propose removing whole paragraph or rewriting it completely

    * [Bug] [Line 385]. "For example, int and long are always serialized the same way". What this means? You probably mean 
    that *same* int and long (i. e. int and long, which are numerically identical) serialized the same way.

    * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. Do you mean no bytes at all? Or null bytes, 
    i. e. some undefined number of null bytes? (Of course, I understand that you mean the first variant, but I still don't 
    like the phrase)

    * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. e. smallest) encoding of numbers required? 
    Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307

    I still think canonical representations should be required. The more forms of encoding you allow the bigger is attack 
    surface.

    Also, it would be desirable property for binary representations be equal when data is equal. It would be good if you 
    guarantee this property at least for some subset of schemas (and of course, you should write explicitly for which 
    schemas the property is guaranteed). Non-canonical representations break this property

    * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is sequence of encoded scalar values (don't confuse 
    with code points). Unfortunately, not everybody knows this, and thus we see WTF-8 (i. e. encoding similar to UTF-8, but 
    with standalone surrogates) available in places where proper UTF-8 should reside. So everywhere where the spec says 
    "UTF-8" I propose to explicitly write that standalone surrogates are not allowed and that readers should fail if they 
    find them (I prefer to place this sentence to introduction of the spec)

    * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are practically an int, but theoretically a 
    long". Wat? So, other implementations use int (as per spec), but C++ uses long, right? So, go fix C++ implementation to 
    match spec and other implementations

    * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON null". There is no reasons to special-case 
    nulls. This is additional requirement, which adds complexity to implementations without any reasons

    * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It seems I found a real bug. :) Consider this 
    schema:

    [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]

    As well as I understand such schema fully allowed. Now consider this encoded value: {"map":{"b":0}}. What is it? Map or 
    record named "map"?

    * [Bug] [Line 677]. "data is ordered by ascending numeric value". What about NaNs?

    * [Bug] [Line 682]. "compared lexicographically by Unicode code point". Replace with scalar values. UTF-8 consists of 
    scalar values

    * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for this file". I don't like this point. It 
    implies that container files are usually not equal. Thus it is not possible to compare them bitwise to determine 
    equality. So, in my Avro implementation I write null bytes instead of this marker (yes, this possibly means that my 
    implementation is non-conforming)

    * [Opinion] [Line 717]. There is no any marker for end of container file. Thus there is no way to determine whether all 
    data was written

    * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to string". Wat? How these values are promoted?

    * [Bug] [Line 1153]. What implementation should do (when it does schema resolution)? It should first check that schemas 
    match (and report any errors) and then read data? Or proceed straight to reading data? This is important distinction. 
    For example, what happens when we attempt to read file container without data elements using schema resolution 
    algorithm? (Are such container allowed, by the way?) In the first case scheme check should be performed. In the second 
    such reading should always be successful.

    If you think the first case is correct, then the section should describe algorithm for determining matching of schemas 
    separately from algorithm of actual reading data

    * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" instead of {"type":"int"}>>, right?

    * [Bug] [Line 1331]. "replace any escaped characters". Any? What about "a\"b"? It is impossible to replace :)

    ----

    Some notes about my task. I want to implement this: 
    https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . I. e. I want to have shell utils in Linux, 
    which exchange some structured data. Here is how I chose format for representing that data.

    * I want format to be binary, not textual, this rules out JSON, XML, etc
    * I want format to be typed, this rules out CBOR etc
    * I want format to have support for proper sum types (similar to Haskell's), this rules out Microsoft Bond. As well as 
    I understand this also rules out using GVariants, proposed in above mentioned article. And this rules out Protobuf: 
    Protobuf has support for sum types (they are named OneOf), but this OneOfs are always optional (speaking in Avro 
    language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
    * I want format to have support for recursive types, this rules out Bare ( baremessages.org )

    So, we have not so many formats left. Avro and possibly a few more. And I chose Avro. And I really like it. Because:

    * It is very compact
    * It has very elegant way to support schema evolution (as opposed to Protobuf, where fields are tagged, i. e. you trade 
    space efficiency for schema evolution)
    * It has container format with schema attached
    * You don't need to write items count to container header (good for streaming)

    So, Avro is simply *best* for my task. But then I discovered its problems (listed above). How it is happened that such 
    good format has so bad spec? How it is happened that *best* format for this task happened to be so bad? What this says 
    about our industry?

    ==
    Askar Safin
    http://safinaskar.com
    https://sr.ht/~safinaskar
    https://github.com/safinaskar


Re: Gigantic list of Avro spec issues

Posted by Adam Bellemare <ad...@gmail.com>.
Hi Askar

This is certainly an extensive list. I wanted to email you just to let you
know that *someone* has seen it. Though I can't speak to all of it, I do
have a few of my own impressions:

> How it is happened that such good format has so bad spec? How it is
happened that *best* format for this task happened to be so bad? What this
says about our industry?

I think you're hitting on some very good points here. I contributed to Avro
a ways back, and after not having much in the way of traction or responses,
Doug Cutting himself stepped in to help me out and get my PR pushed
through. I think it took another year until the official version with the
fix I needed was released. I suspect, though I don't want to accuse in any
way, that there are not that many contributors to Avro. Part of this may be
that it has predominantly been Java-centric, and part of this may be that
many folks may have consider it "solved". I am not sure. This is only my
thoughts.

I am equally concerned about the seeming lack of involvement in this space.
I don't have an answer. I do know that many open source software projects
suffer a similar sort of "high-use, low-contribution" problem. I think it's
important that you voiced this as a concern. But I don't have a solution
for you at the moment. Perhaps someone with more knowledge or experience
working on or with Avro can chime in - again, I hope it is clear that I am
not finger pointing - I know maintaining OSS is *hard*, and I suspect that
we may simply have run out of people willing to put the effort in to make
Avro better than it currently is...

Adam



On Sat, Feb 12, 2022 at 6:22 PM Askar Safin <sa...@mail.ru.invalid>
wrote:

> Hi. I'm writing my own Avro implementation in Rust for personal use.
> During this work I found a lot of issues in Avro
> spec (the list follows).
>
> I send this mail not only to user and dev mailing lists of Avro, but also
> to Apache community list, Kafka list and 3
> semi-randomly chosen Materialize employees. Because I want to draw
> attention to this problems. I hope this wider
> community helps Avro fix their issues and possible will give necessary
> resources.
>
> As well as I understand Avro is used in Kafka. And Kafka, according to
> their site, is used in "80% of all Fortune 100
> companies". So Avro is critical piece of infrastructure of humanity. It
> should be absolutely perfect (and so I list
> even very small bugs). But it is not perfect.
>
> Some of items in this list are (small and big) bugs, some are typos, some
> are my objections to the design. Some are
> fixable while keeping compatibility, some are not. I don't want to spend
> my time to report them as separate bugs, but
> you can try to convince me to do so.
>
> I created this list simply by reading the spec from end to end (I skipped
> sections on RPC and logical types). And I
> even didn't look at implementations!
>
> I write this is hope to help Avro.
>
> I think big audit of spec and its implementations should be done.
>
> All line numbers apply to spec.xml from tag release-1.11.0 (i. e.
>
> https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml
> ). As well as I understand this tag
> corresponds to currently published version at
> https://avro.apache.org/docs/current/spec.html .
>
> So, here we go.
>
> * [Opinion] [No line]. In Avro one have to define named records inside
> each other like so:
>
> { "type": "record", "name": "a", "fields":
> [{"name":"b","type":{"type":"record","name":"c",...}}] }
>
> This is very unnatural. In popular programming languages one usually
> define named record next to each other, not one
> inside other. Such representation is not handy to deal programmatically.
> In my implementation I have to convert this
> representation to usual form "root type + list of named types" right after
> reading JSON and convert back just before
> writing.
>
> * [Opinion] [No line]. In this list you will see a lot of questions on
> Avro schema (encoded as JSON). Good JSON schema
> ( https://json-schema.org/ ) would resolve many of them
>
> * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact,
> in whole spec phrase "type name" is used
> very vaguely. Sometimes it means strings like "record" and sometimes it
> means names of named types. I propose to define
> in very beginning of the spec terms for primitive types, terms for strings
> like "record" and terms for names of defined
> types. Here is one possible way to do this: name strings like "record" and
> "fixed" "type kinds" and never name them
> type names, thus reserving term "type name" to named types only (and
> possibly to primitive types).
>
> This issue already caused problems: look, for example, to this problems
> with {"type":"record","name":"record",...}:
> https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .
>
> * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit
> integers. Such type is present in languages such
> as C and Rust
>
> * [Very theoretical bug, possible even security-related] [Line 435]. "The
> float is converted into a 32-bit integer
> using a method equivalent to Java's floatToIntBits and then encoded in
> little-endian format". If we click at provided
> link, we will see that this Java function does NaN normalization. I think
> NaN normalization is good thing. But I think
> this quite possible spec implementers overlooked this NaN normalization
> requirement. So I propose: write explicitly
> directly in Avro spec that NaN are normalized. Audit all Avro
> implementations: whether they actually implemented this
> requirement. Create tests, which will actually test this requirement.
>
> Also I don't know whether bit pattern provided in that Java doc
> (0x7fc00000) is quiet NaN or signaling. If it is
> signaling, this is very bad.
>
> As well as I understand if you will configure your FPU particular way than
> merely copying signaling NaN from one place
> to another will abort your program. So, if your FPU is configured certain
> way then feeding particular binary Avro data
> to a program can crash it. I. e. this is security problem. So a reader
> should be careful to check whether input data is
> signaling NaN *before* storing it in floating point registers.
>
> I checked whether manipulating signaling NaN can actually crash a program
> in default settings in Windows and Linux. And
> it turned out that a program will not crash. Still I think signaling NaN
> should be handled carefully.
>
> Write to spec that writers should normalize NaNs, that readers should
> reject non-normalized NaNs and that readers
> should be careful not to store incoming floating number to floating-point
> variable before its sanitizing. Write that
> this is security issue.
>
> * [Opinion] [Line 68]. "unicode character sequence". As well as I
> understand Unicode character sequence means sequence
> of Unicode scalar values. Note that scalar value is not same thing as code
> point. Unfortunately, some people don't know
> this, so please write explicitly: "this is sequence of scalar values, not
> code points", to make sure implementations
> will be correct
>
> * [Bug] [Line 71]. "Primitive types have no specified attributes". This is
> lie. At line 1527 you specify logical type
> based on primitive type int. Thus you specify particular meaning of
> attribute "logicalType" for primitive type "int".
> Be careful at your wording. The spec should be rock-solid
>
> * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing
> alternate names for this record (optional)". Is
> empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say
> this is nitpicking, but I don't think so.
> Avro has important place in our infrastructure, so everything is
> important. Think carefully whether empty list (and
> duplicates) is allowed everywhere in the spec where you see some kind of
> list. I think empty arrays (and duplicates)
> should be disallowed in this particular case. Because the more things we
> allow, the bigger is attack surface
>
> * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields
> allowed? Already reported by me at
> https://issues.apache.org/jira/browse/AVRO-3279
>
> * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already
> reported by me at
> https://issues.apache.org/jira/browse/AVRO-3280
>
> * [Bug] [Line 101]. "name: a JSON string providing the name of the field
> (required), and". Word "and" usually placed
> immediately before last item in sequence. The text here looks like item
> "doc" was last, but then the text was edited
> not carefully. This is very stupid typographic issue which shows that
> authors are not careful about spec quality. Also,
> the spec is not consistent on placing dots after items (this applies to
> whole spec). Sometimes I see nothing in the end
> of item, sometimes "." and sometimes ";"
>
> * [Opinion] [Line 106]. "default: A default value for..." What follows is
> essentially description of JSON
> representation of Avro datum (except for unions). So, you managed to put
> very important part of your spec directly into
> one paragraph into second level bullet point?!
>
> * [Opinion] [Line 112]. "Default values for union fields correspond to the
> first schema in the union". This phrase is
> difference between JSON encoding for Avro data and JSON encoding for
> default field. And, of course, presence of this
> difference is design bug
>
> * [Opinion] [Line 113]. "Default values for bytes and fixed fields are
> JSON strings, where Unicode code points 0-255
> are mapped to unsigned 8-bit byte values 0-255". Wat? This is very
> unnatural encoding. You misuse JSON string. They are
> for strings, they are not for binary data. You should use array of numbers
> instead. I. e. encode bytes 0x0f 0x02 as
> [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C
> programs have difficulties with such strings
>
> * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what
> if given long is not a JSON-safe integer? As
> we know from https://datatracker.ietf.org/doc/html/rfc7159 integers
> outside of range [-(2**53)+1, (2**53)-1] are not
> JSON-safe
>
> * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON,
> despite they seems to be allowed by Avro spec.
> So, JSON representation of Avro data is incomplete
>
> * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined
> by spec. For unknown reasons you decided to
> insert essentially whole description of JSON representation of Avro data
> into one small paragraph even before type
> system is fully described. Please use terms only after their definition
>
> * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as
> optional or required. Do you ever read your spec?
>
> * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word
> "namespace" is marked using <em>, not <code>, and we
> can see this in rendered version. This is very stupid typographic bug,
> which is immediately obvious to anybody reading
> this document, even to non-technical people
>
> * [Bug] [Line 200]. "a single attribute". As we see in provided example,
> "default" is allowed, too. What is meaning of
> this "default" attribute? And how its meaning differs from meaning of
> "default" key in field description? (Same for
> maps)
>
> * [Bug] [Line 238]. "declares a schema which may be either a null or
> string". Lie. Schema is ["null", "string"].
> *Value* may be a null or string. Please check that you don't confuse types
> (schemas) with value through whole your spec
>
> * [Very opinionated opinion :)] [Line 235]. I don't like your unions at
> all. I'm coming from languages like Haskell and
> Rust, where true sum types are supported. They are similar to your unions,
> but their alternatives are named.
> Alternatives are identified by their names, so there is no restriction on
> duplicate types. So there is no need for very
> unnatural restriction "Unions may not contain more than one schema with
> the same type, except for the named types
> record, fixed and enum"
>
> * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings,
> providing alternate names for this enum". You
> mean "fixed", right? So, you copy-pasted section on enums? Do you ever
> read your spec from end to end at least one time?
>
> * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per
> value". Is zero allowed?
>
> * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated
> sequence of names". You defined previously
> null namespace as a empty string instead of *whole* namespace. I. e. null
> namespace is lack of namespace (i. e. lack of
> whole dot-separated sequence). So there is no sense in speaking on using
> null namespace in dot-separated sequence. You
> probably mean that one should not use empty string in dot-separated
> sequence
>
> * [Bug] [Line 374]. "Deserializing data into a newer schema is
> accomplished by specifying an additional schema, the
> results of which are described in Schema Resolution". Term "additional
> schema" is vague here. I would say so:
> "Deserializing data into a newer schema is accomplished by using an
> algorithm described in Schema Resolution"
>
> * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to
> read Avro data with a schema that does not..."
> The whole paragraph is very vague. At first reading I thought that it is
> about schema resolution. After several
> attempts to understand it I finally understood that the paragraph is about
> reading attempts without original writer
> schema available at all. I propose removing whole paragraph or rewriting
> it completely
>
> * [Bug] [Line 385]. "For example, int and long are always serialized the
> same way". What this means? You probably mean
> that *same* int and long (i. e. int and long, which are numerically
> identical) serialized the same way.
>
> * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague.
> Do you mean no bytes at all? Or null bytes,
> i. e. some undefined number of null bytes? (Of course, I understand that
> you mean the first variant, but I still don't
> like the phrase)
>
> * [Bug] [Line 417]. "int and long values are written..." Is canonical (i.
> e. smallest) encoding of numbers required?
> Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307
>
> I still think canonical representations should be required. The more forms
> of encoding you allow the bigger is attack
> surface.
>
> Also, it would be desirable property for binary representations be equal
> when data is equal. It would be good if you
> guarantee this property at least for some subset of schemas (and of
> course, you should write explicitly for which
> schemas the property is guaranteed). Non-canonical representations break
> this property
>
> * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is
> sequence of encoded scalar values (don't confuse
> with code points). Unfortunately, not everybody knows this, and thus we
> see WTF-8 (i. e. encoding similar to UTF-8, but
> with standalone surrogates) available in places where proper UTF-8 should
> reside. So everywhere where the spec says
> "UTF-8" I propose to explicitly write that standalone surrogates are not
> allowed and that readers should fail if they
> find them (I prefer to place this sentence to introduction of the spec)
>
> * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions
> are practically an int, but theoretically a
> long". Wat? So, other implementations use int (as per spec), but C++ uses
> long, right? So, go fix C++ implementation to
> match spec and other implementations
>
> * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON
> null". There is no reasons to special-case
> nulls. This is additional requirement, which adds complexity to
> implementations without any reasons
>
> * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..."
> It seems I found a real bug. :) Consider this
> schema:
>
>
> [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]
>
> As well as I understand such schema fully allowed. Now consider this
> encoded value: {"map":{"b":0}}. What is it? Map or
> record named "map"?
>
> * [Bug] [Line 677]. "data is ordered by ascending numeric value". What
> about NaNs?
>
> * [Bug] [Line 682]. "compared lexicographically by Unicode code point".
> Replace with scalar values. UTF-8 consists of
> scalar values
>
> * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for
> this file". I don't like this point. It
> implies that container files are usually not equal. Thus it is not
> possible to compare them bitwise to determine
> equality. So, in my Avro implementation I write null bytes instead of this
> marker (yes, this possibly means that my
> implementation is non-conforming)
>
> * [Opinion] [Line 717]. There is no any marker for end of container file.
> Thus there is no way to determine whether all
> data was written
>
> * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable
> to string". Wat? How these values are promoted?
>
> * [Bug] [Line 1153]. What implementation should do (when it does schema
> resolution)? It should first check that schemas
> match (and report any errors) and then read data? Or proceed straight to
> reading data? This is important distinction.
> For example, what happens when we attempt to read file container without
> data elements using schema resolution
> algorithm? (Are such container allowed, by the way?) In the first case
> scheme check should be performed. In the second
> such reading should always be successful.
>
> If you think the first case is correct, then the section should describe
> algorithm for determining matching of schemas
> separately from algorithm of actual reading data
>
> * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int"
> instead of {"type":"int"}>>, right?
>
> * [Bug] [Line 1331]. "replace any escaped characters". Any? What about
> "a\"b"? It is impossible to replace :)
>
> ----
>
> Some notes about my task. I want to implement this:
> https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ .
> I. e. I want to have shell utils in Linux,
> which exchange some structured data. Here is how I chose format for
> representing that data.
>
> * I want format to be binary, not textual, this rules out JSON, XML, etc
> * I want format to be typed, this rules out CBOR etc
> * I want format to have support for proper sum types (similar to
> Haskell's), this rules out Microsoft Bond. As well as
> I understand this also rules out using GVariants, proposed in above
> mentioned article. And this rules out Protobuf:
> Protobuf has support for sum types (they are named OneOf), but this OneOfs
> are always optional (speaking in Avro
> language: you always get ["null", "int", "bool"] instead of ["int",
> "bool"])
> * I want format to have support for recursive types, this rules out Bare (
> baremessages.org )
>
> So, we have not so many formats left. Avro and possibly a few more. And I
> chose Avro. And I really like it. Because:
>
> * It is very compact
> * It has very elegant way to support schema evolution (as opposed to
> Protobuf, where fields are tagged, i. e. you trade
> space efficiency for schema evolution)
> * It has container format with schema attached
> * You don't need to write items count to container header (good for
> streaming)
>
> So, Avro is simply *best* for my task. But then I discovered its problems
> (listed above). How it is happened that such
> good format has so bad spec? How it is happened that *best* format for
> this task happened to be so bad? What this says
> about our industry?
>
> ==
> Askar Safin
> http://safinaskar.com
> https://sr.ht/~safinaskar
> https://github.com/safinaskar
>

Re: Gigantic list of Avro spec issues

Posted by Adam Bellemare <ad...@gmail.com>.
Hi Askar

This is certainly an extensive list. I wanted to email you just to let you
know that *someone* has seen it. Though I can't speak to all of it, I do
have a few of my own impressions:

> How it is happened that such good format has so bad spec? How it is
happened that *best* format for this task happened to be so bad? What this
says about our industry?

I think you're hitting on some very good points here. I contributed to Avro
a ways back, and after not having much in the way of traction or responses,
Doug Cutting himself stepped in to help me out and get my PR pushed
through. I think it took another year until the official version with the
fix I needed was released. I suspect, though I don't want to accuse in any
way, that there are not that many contributors to Avro. Part of this may be
that it has predominantly been Java-centric, and part of this may be that
many folks may have consider it "solved". I am not sure. This is only my
thoughts.

I am equally concerned about the seeming lack of involvement in this space.
I don't have an answer. I do know that many open source software projects
suffer a similar sort of "high-use, low-contribution" problem. I think it's
important that you voiced this as a concern. But I don't have a solution
for you at the moment. Perhaps someone with more knowledge or experience
working on or with Avro can chime in - again, I hope it is clear that I am
not finger pointing - I know maintaining OSS is *hard*, and I suspect that
we may simply have run out of people willing to put the effort in to make
Avro better than it currently is...

Adam



On Sat, Feb 12, 2022 at 6:22 PM Askar Safin <sa...@mail.ru.invalid>
wrote:

> Hi. I'm writing my own Avro implementation in Rust for personal use.
> During this work I found a lot of issues in Avro
> spec (the list follows).
>
> I send this mail not only to user and dev mailing lists of Avro, but also
> to Apache community list, Kafka list and 3
> semi-randomly chosen Materialize employees. Because I want to draw
> attention to this problems. I hope this wider
> community helps Avro fix their issues and possible will give necessary
> resources.
>
> As well as I understand Avro is used in Kafka. And Kafka, according to
> their site, is used in "80% of all Fortune 100
> companies". So Avro is critical piece of infrastructure of humanity. It
> should be absolutely perfect (and so I list
> even very small bugs). But it is not perfect.
>
> Some of items in this list are (small and big) bugs, some are typos, some
> are my objections to the design. Some are
> fixable while keeping compatibility, some are not. I don't want to spend
> my time to report them as separate bugs, but
> you can try to convince me to do so.
>
> I created this list simply by reading the spec from end to end (I skipped
> sections on RPC and logical types). And I
> even didn't look at implementations!
>
> I write this is hope to help Avro.
>
> I think big audit of spec and its implementations should be done.
>
> All line numbers apply to spec.xml from tag release-1.11.0 (i. e.
>
> https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml
> ). As well as I understand this tag
> corresponds to currently published version at
> https://avro.apache.org/docs/current/spec.html .
>
> So, here we go.
>
> * [Opinion] [No line]. In Avro one have to define named records inside
> each other like so:
>
> { "type": "record", "name": "a", "fields":
> [{"name":"b","type":{"type":"record","name":"c",...}}] }
>
> This is very unnatural. In popular programming languages one usually
> define named record next to each other, not one
> inside other. Such representation is not handy to deal programmatically.
> In my implementation I have to convert this
> representation to usual form "root type + list of named types" right after
> reading JSON and convert back just before
> writing.
>
> * [Opinion] [No line]. In this list you will see a lot of questions on
> Avro schema (encoded as JSON). Good JSON schema
> ( https://json-schema.org/ ) would resolve many of them
>
> * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact,
> in whole spec phrase "type name" is used
> very vaguely. Sometimes it means strings like "record" and sometimes it
> means names of named types. I propose to define
> in very beginning of the spec terms for primitive types, terms for strings
> like "record" and terms for names of defined
> types. Here is one possible way to do this: name strings like "record" and
> "fixed" "type kinds" and never name them
> type names, thus reserving term "type name" to named types only (and
> possibly to primitive types).
>
> This issue already caused problems: look, for example, to this problems
> with {"type":"record","name":"record",...}:
> https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .
>
> * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit
> integers. Such type is present in languages such
> as C and Rust
>
> * [Very theoretical bug, possible even security-related] [Line 435]. "The
> float is converted into a 32-bit integer
> using a method equivalent to Java's floatToIntBits and then encoded in
> little-endian format". If we click at provided
> link, we will see that this Java function does NaN normalization. I think
> NaN normalization is good thing. But I think
> this quite possible spec implementers overlooked this NaN normalization
> requirement. So I propose: write explicitly
> directly in Avro spec that NaN are normalized. Audit all Avro
> implementations: whether they actually implemented this
> requirement. Create tests, which will actually test this requirement.
>
> Also I don't know whether bit pattern provided in that Java doc
> (0x7fc00000) is quiet NaN or signaling. If it is
> signaling, this is very bad.
>
> As well as I understand if you will configure your FPU particular way than
> merely copying signaling NaN from one place
> to another will abort your program. So, if your FPU is configured certain
> way then feeding particular binary Avro data
> to a program can crash it. I. e. this is security problem. So a reader
> should be careful to check whether input data is
> signaling NaN *before* storing it in floating point registers.
>
> I checked whether manipulating signaling NaN can actually crash a program
> in default settings in Windows and Linux. And
> it turned out that a program will not crash. Still I think signaling NaN
> should be handled carefully.
>
> Write to spec that writers should normalize NaNs, that readers should
> reject non-normalized NaNs and that readers
> should be careful not to store incoming floating number to floating-point
> variable before its sanitizing. Write that
> this is security issue.
>
> * [Opinion] [Line 68]. "unicode character sequence". As well as I
> understand Unicode character sequence means sequence
> of Unicode scalar values. Note that scalar value is not same thing as code
> point. Unfortunately, some people don't know
> this, so please write explicitly: "this is sequence of scalar values, not
> code points", to make sure implementations
> will be correct
>
> * [Bug] [Line 71]. "Primitive types have no specified attributes". This is
> lie. At line 1527 you specify logical type
> based on primitive type int. Thus you specify particular meaning of
> attribute "logicalType" for primitive type "int".
> Be careful at your wording. The spec should be rock-solid
>
> * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing
> alternate names for this record (optional)". Is
> empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say
> this is nitpicking, but I don't think so.
> Avro has important place in our infrastructure, so everything is
> important. Think carefully whether empty list (and
> duplicates) is allowed everywhere in the spec where you see some kind of
> list. I think empty arrays (and duplicates)
> should be disallowed in this particular case. Because the more things we
> allow, the bigger is attack surface
>
> * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields
> allowed? Already reported by me at
> https://issues.apache.org/jira/browse/AVRO-3279
>
> * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already
> reported by me at
> https://issues.apache.org/jira/browse/AVRO-3280
>
> * [Bug] [Line 101]. "name: a JSON string providing the name of the field
> (required), and". Word "and" usually placed
> immediately before last item in sequence. The text here looks like item
> "doc" was last, but then the text was edited
> not carefully. This is very stupid typographic issue which shows that
> authors are not careful about spec quality. Also,
> the spec is not consistent on placing dots after items (this applies to
> whole spec). Sometimes I see nothing in the end
> of item, sometimes "." and sometimes ";"
>
> * [Opinion] [Line 106]. "default: A default value for..." What follows is
> essentially description of JSON
> representation of Avro datum (except for unions). So, you managed to put
> very important part of your spec directly into
> one paragraph into second level bullet point?!
>
> * [Opinion] [Line 112]. "Default values for union fields correspond to the
> first schema in the union". This phrase is
> difference between JSON encoding for Avro data and JSON encoding for
> default field. And, of course, presence of this
> difference is design bug
>
> * [Opinion] [Line 113]. "Default values for bytes and fixed fields are
> JSON strings, where Unicode code points 0-255
> are mapped to unsigned 8-bit byte values 0-255". Wat? This is very
> unnatural encoding. You misuse JSON string. They are
> for strings, they are not for binary data. You should use array of numbers
> instead. I. e. encode bytes 0x0f 0x02 as
> [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C
> programs have difficulties with such strings
>
> * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what
> if given long is not a JSON-safe integer? As
> we know from https://datatracker.ietf.org/doc/html/rfc7159 integers
> outside of range [-(2**53)+1, (2**53)-1] are not
> JSON-safe
>
> * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON,
> despite they seems to be allowed by Avro spec.
> So, JSON representation of Avro data is incomplete
>
> * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined
> by spec. For unknown reasons you decided to
> insert essentially whole description of JSON representation of Avro data
> into one small paragraph even before type
> system is fully described. Please use terms only after their definition
>
> * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as
> optional or required. Do you ever read your spec?
>
> * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word
> "namespace" is marked using <em>, not <code>, and we
> can see this in rendered version. This is very stupid typographic bug,
> which is immediately obvious to anybody reading
> this document, even to non-technical people
>
> * [Bug] [Line 200]. "a single attribute". As we see in provided example,
> "default" is allowed, too. What is meaning of
> this "default" attribute? And how its meaning differs from meaning of
> "default" key in field description? (Same for
> maps)
>
> * [Bug] [Line 238]. "declares a schema which may be either a null or
> string". Lie. Schema is ["null", "string"].
> *Value* may be a null or string. Please check that you don't confuse types
> (schemas) with value through whole your spec
>
> * [Very opinionated opinion :)] [Line 235]. I don't like your unions at
> all. I'm coming from languages like Haskell and
> Rust, where true sum types are supported. They are similar to your unions,
> but their alternatives are named.
> Alternatives are identified by their names, so there is no restriction on
> duplicate types. So there is no need for very
> unnatural restriction "Unions may not contain more than one schema with
> the same type, except for the named types
> record, fixed and enum"
>
> * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings,
> providing alternate names for this enum". You
> mean "fixed", right? So, you copy-pasted section on enums? Do you ever
> read your spec from end to end at least one time?
>
> * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per
> value". Is zero allowed?
>
> * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated
> sequence of names". You defined previously
> null namespace as a empty string instead of *whole* namespace. I. e. null
> namespace is lack of namespace (i. e. lack of
> whole dot-separated sequence). So there is no sense in speaking on using
> null namespace in dot-separated sequence. You
> probably mean that one should not use empty string in dot-separated
> sequence
>
> * [Bug] [Line 374]. "Deserializing data into a newer schema is
> accomplished by specifying an additional schema, the
> results of which are described in Schema Resolution". Term "additional
> schema" is vague here. I would say so:
> "Deserializing data into a newer schema is accomplished by using an
> algorithm described in Schema Resolution"
>
> * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to
> read Avro data with a schema that does not..."
> The whole paragraph is very vague. At first reading I thought that it is
> about schema resolution. After several
> attempts to understand it I finally understood that the paragraph is about
> reading attempts without original writer
> schema available at all. I propose removing whole paragraph or rewriting
> it completely
>
> * [Bug] [Line 385]. "For example, int and long are always serialized the
> same way". What this means? You probably mean
> that *same* int and long (i. e. int and long, which are numerically
> identical) serialized the same way.
>
> * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague.
> Do you mean no bytes at all? Or null bytes,
> i. e. some undefined number of null bytes? (Of course, I understand that
> you mean the first variant, but I still don't
> like the phrase)
>
> * [Bug] [Line 417]. "int and long values are written..." Is canonical (i.
> e. smallest) encoding of numbers required?
> Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307
>
> I still think canonical representations should be required. The more forms
> of encoding you allow the bigger is attack
> surface.
>
> Also, it would be desirable property for binary representations be equal
> when data is equal. It would be good if you
> guarantee this property at least for some subset of schemas (and of
> course, you should write explicitly for which
> schemas the property is guaranteed). Non-canonical representations break
> this property
>
> * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is
> sequence of encoded scalar values (don't confuse
> with code points). Unfortunately, not everybody knows this, and thus we
> see WTF-8 (i. e. encoding similar to UTF-8, but
> with standalone surrogates) available in places where proper UTF-8 should
> reside. So everywhere where the spec says
> "UTF-8" I propose to explicitly write that standalone surrogates are not
> allowed and that readers should fail if they
> find them (I prefer to place this sentence to introduction of the spec)
>
> * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions
> are practically an int, but theoretically a
> long". Wat? So, other implementations use int (as per spec), but C++ uses
> long, right? So, go fix C++ implementation to
> match spec and other implementations
>
> * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON
> null". There is no reasons to special-case
> nulls. This is additional requirement, which adds complexity to
> implementations without any reasons
>
> * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..."
> It seems I found a real bug. :) Consider this
> schema:
>
>
> [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]
>
> As well as I understand such schema fully allowed. Now consider this
> encoded value: {"map":{"b":0}}. What is it? Map or
> record named "map"?
>
> * [Bug] [Line 677]. "data is ordered by ascending numeric value". What
> about NaNs?
>
> * [Bug] [Line 682]. "compared lexicographically by Unicode code point".
> Replace with scalar values. UTF-8 consists of
> scalar values
>
> * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for
> this file". I don't like this point. It
> implies that container files are usually not equal. Thus it is not
> possible to compare them bitwise to determine
> equality. So, in my Avro implementation I write null bytes instead of this
> marker (yes, this possibly means that my
> implementation is non-conforming)
>
> * [Opinion] [Line 717]. There is no any marker for end of container file.
> Thus there is no way to determine whether all
> data was written
>
> * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable
> to string". Wat? How these values are promoted?
>
> * [Bug] [Line 1153]. What implementation should do (when it does schema
> resolution)? It should first check that schemas
> match (and report any errors) and then read data? Or proceed straight to
> reading data? This is important distinction.
> For example, what happens when we attempt to read file container without
> data elements using schema resolution
> algorithm? (Are such container allowed, by the way?) In the first case
> scheme check should be performed. In the second
> such reading should always be successful.
>
> If you think the first case is correct, then the section should describe
> algorithm for determining matching of schemas
> separately from algorithm of actual reading data
>
> * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int"
> instead of {"type":"int"}>>, right?
>
> * [Bug] [Line 1331]. "replace any escaped characters". Any? What about
> "a\"b"? It is impossible to replace :)
>
> ----
>
> Some notes about my task. I want to implement this:
> https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ .
> I. e. I want to have shell utils in Linux,
> which exchange some structured data. Here is how I chose format for
> representing that data.
>
> * I want format to be binary, not textual, this rules out JSON, XML, etc
> * I want format to be typed, this rules out CBOR etc
> * I want format to have support for proper sum types (similar to
> Haskell's), this rules out Microsoft Bond. As well as
> I understand this also rules out using GVariants, proposed in above
> mentioned article. And this rules out Protobuf:
> Protobuf has support for sum types (they are named OneOf), but this OneOfs
> are always optional (speaking in Avro
> language: you always get ["null", "int", "bool"] instead of ["int",
> "bool"])
> * I want format to have support for recursive types, this rules out Bare (
> baremessages.org )
>
> So, we have not so many formats left. Avro and possibly a few more. And I
> chose Avro. And I really like it. Because:
>
> * It is very compact
> * It has very elegant way to support schema evolution (as opposed to
> Protobuf, where fields are tagged, i. e. you trade
> space efficiency for schema evolution)
> * It has container format with schema attached
> * You don't need to write items count to container header (good for
> streaming)
>
> So, Avro is simply *best* for my task. But then I discovered its problems
> (listed above). How it is happened that such
> good format has so bad spec? How it is happened that *best* format for
> this task happened to be so bad? What this says
> about our industry?
>
> ==
> Askar Safin
> http://safinaskar.com
> https://sr.ht/~safinaskar
> https://github.com/safinaskar
>