You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by roger peppe <ro...@gmail.com> on 2020/01/06 17:36:26 UTC

More idiomatic JSON encoding for unions

Hi,

The JSON encoding in the specification
<https://avro.apache.org/docs/current/spec.html#json_encoding> includes an
explicit type name for all kinds of object other than null. This means that
a JSON-encoded Avro value with a union is very rarely directly compatible
with normal JSON formats.

For example, it's very common for a JSON-encoded value to allow a value
that's either null or string. In Avro, that's trivially expressed as the
union type ["null", "string"]. With conventional JSON, a string value "foo"
would be encoded just as "foo", which is easily distinguished from null
when decoding. However when using the Avro JSON format it must be encoded
as {"string": "foo"}.

This means that Avro JSON-encoded values don't interchange easily with
other JSON-encoded values.

AFAICS the main reason that the type name is always required in
JSON-encoded unions is to avoid ambiguity. This particularly applies to
record and map types, where it's not possible in general to tell which
member of the union has been specified by looking at the data itself.

However, that reasoning doesn't apply if all the members of the union can
be distinguished from their JSON token type.

I am considering using a JSON encoding that omits the type name when all
the members of the union encode to distinct JSON token types (the JSON
token types being: null, boolean, string, number, object and array).

For example, JSON-encoded values using the Avro schema ["null", "string",
"int"] would encode as the literal values themselves (e.g. null, "foo", 999),
but JSON-encoded values using the Avro schema ["int", "double"] would
require the type name because the JSON lexeme doesn't distinguish between
different kinds of number.

This would mean that it would be possible to represent a significant subset
of "normal" JSON schemas with Avro. It seems to me that would potentially
be very useful.

Thoughts? Is this a really bad idea to be contemplating? :)

  cheers,
    rog.

Re: More idiomatic JSON encoding for unions

Posted by Zoltan Farkas <zo...@yahoo.com>.

I have hacked logical types in my fork to add this capability, if you want to take a look see:
https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/LogicalType.java#L78 <https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/LogicalType.java#L78> 

my goal was to make decimal being a number in json.
but it is a hack, it works but won’t win any beauty contests :-) and right now I don’t see how to make this clean to the point of being something that would be accepted main-stream.

It would be a lot cleaner to elevate these logical types to first class types, and standardize the encoding appropriately.
decimal clearly needs to be a first class type, not sure about timestamp-micros...

—Z


> On Jan 16, 2020, at 2:20 PM, roger peppe <ro...@gmail.com> wrote:
> 
> On Thu, 16 Jan 2020, 18:59 Zoltan Farkas, <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
> answers inline
> 
>> On Jan 16, 2020, at 5:51 AM, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>> 
>> On Wed, 15 Jan 2020 at 18:51, Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
>> What I mean with timestamp-micros, is that it is currently restricted to being bound to long,
>> I see no reason why it should not be allowed to be bound to string as well. (the change should be simple to implement)
>> 
>> Wouldn't have the implication of changing the binary representation too, which is not necessarily desirable (it's bulkier, slower to decode and has more potential error cases) ?
> 
> yes, it would, but this is how logical types work, and I see no good way to change this.  (this is what i meant by paying the readability cost in place where it is irrelevant)
> 
> So you think that the JSON representation should always match the underlying type and ignore the logical type? I can understand the reasoning behind that, but it doesn't feel very user friendly in some cases (thinking of decimal and duration in particular).
> 
> Given their privileged place in the specification, I was thinking that some logical types could gain privilege here.
> 
> Aside: I'm a bit concerned about the potential for data corruption from interchange between timestamp-micros and timestamp-millis, which, as far as understand the spec, look like they'll be treated as compatible with each other.
> 
> 
>> 
>> 
>> regarding the media type, something like: application/avro.2+json would be fine.
>> 
>> Attaching the ".2" to "avro" rather than "json" seems to be implying a new Avro version, rather than a new JSON-encoding version? Or is the idea that the version number here is implying both the JSON-encoding version and the underlying Avro version?  The MIME standard seems to be silent on this AFAICS.
>> 
> 
> the reason why I would use +json at the end is because it would be a subtype sufix: https://en.wikipedia.org/wiki/Media_type#Suffix <https://en.wikipedia.org/wiki/Media_type#Suffix> and most browsers will recognize it as json, and potentially format it...
> 
> Ah, nice, I wasn't aware of RFC 6838.
> 
>> 
>> Other then that the proposal looks good. can you start a PR with the spec update?
>> 
>> I can do, but I don't hold out much hope of it getting merged. I started a PR with a much more minor change <https://github.com/apache/avro/pull/738> almost 2 months ago and haven't seen any response yet.
> 
> Send out a email on the dev mailing list, the committers seem more responsive lately...
> 
> I'll give it a go :)
> 
>   cheers,
>     rog.
> 
>> 
>>   cheers,
>>     rog.
>> 
>> —Z
>> 
>>> On Jan 15, 2020, at 12:30 PM, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
>>> See comments in-line below:
>>> 
>>>> On Jan 15, 2020, at 3:42 AM, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Oops, I left arrays out! Two other thoughts: 
>>>> 
>>>> I wonder if it might be worth hedging bets about logical types. It would be nice if (for example) a `timestamp-micros` value could be encoded as an RFC3339 string, so perhaps that should be allowed for, but maybe that's a step too far.
>>> I think logical types should should stay above the encoding/decoding…  
>>> With timestamp-micros we could extend it to make it applicable to string and implement the converters, and then in json you would have something readable, but you would then have the same in binary and pay the readability cost there as well.
>>> 
>>> I'm not sure what you mean there. I wouldn't expect the Avro binary format to be readable at all.
>>> 
>>> I implemented special handling for decimal logical type in my encoder/decoder, but the best implementation I could do still feels like a hack...
>>> 
>>>> I wonder if there should be some indication of version so that you know which JSON encoding version you're reading. Perhaps the Avro schema could include a version field (maybe as part of a definition) so you know which version of the spec to use when encoding/decoding. Then bet-hedging wouldn't be quite as important.
>>> I think Schema needs to stay decoupled from the encoding. The same schema can be encoded in various ways (I have a csv encoder/decoder for example, https://demo.spf4j.org/example/records?_Accept=text/csv <https://demo.spf4j.org/example/records?_Accept=text/csv> ).
>>> I think the right abstraction for what you are looking for is the Media Type(https://en.wikipedia.org/wiki/Media_type <https://en.wikipedia.org/wiki/Media_type> ), 
>>> It would be helpful to “standardize” the media types for the avro encodings:
>>> 
>>> Yes, on reflection, I agree, even though not every possible medium has a media type. For example, what if we're storing JSON data in a file? I guess it would be up to us to store the type along with the data, as the registry message wire format <https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format> does, for example by wrapping the entire value in another JSON object.
>>>  
>>> Here is what I mean, (with some examples where the same schema is served with different encodings):
>>> 
>>> 1) Binary: “application/avro” https://demo.spf4j.org/example/records?_Accept=application/avro <https://demo.spf4j.org/example/records?_Accept=application/avro>
>>> 2) Current Json: “application/avro+json" https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>>> 3) New Json: “application/avro-x+json” ?  https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>>> 
>>> ISTM that "x" isn't a hugely descriptive qualifier there. How about "application/avro+json.v2" ? Then it's clear what to do if we want to make another version.
>>> 
>>>  
>>> The media type including the avro schema (like you can see in the response ContentType in the headers above) can provide complete type  information to be able to read a avro object from a byte stream.
>>> 
>>> application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”
>>> 
>>> In HTTP context this fits well with content negotiation, and a client can ask for a previous version like:
>>> 
>>> https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22 <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22> 
>>> 
>>> Note on $ref,  it is an extension to avsc I use to reference schemas from maven repos. (see https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences <https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> if interested in more detail)
>>> 
>>> Interesting stuff. I like the idea of being able to get the server to check the desired client encoding, although I'm somewhat wary of the potential security implications of $ref with arbitrary URLs.
>>> 
>>> Apart from the issues you raised, does my description of the proposed semantics seem reasonable? It could be slightly cleverer and avoid type-name wrapping in more situations, but this seemed like a nice balance between easy-to-explain and idiomatic-in-most-situations.
>>> 
>>>    cheers,
>>>      rog.
>>> 
>> 
>

Re: More idiomatic JSON encoding for unions

Posted by roger peppe <ro...@gmail.com>.

On Thu, 16 Jan 2020, 18:59 Zoltan Farkas, <zo...@yahoo.com> wrote:

> answers inline
>
> On Jan 16, 2020, at 5:51 AM, roger peppe <ro...@gmail.com> wrote:
>
> On Wed, 15 Jan 2020 at 18:51, Zoltan Farkas <zo...@yahoo.com> wrote:
>
>> What I mean with timestamp-micros, is that it is currently restricted to
>> being bound to long,
>> I see no reason why it should not be allowed to be bound to string as
>> well. (the change should be simple to implement)
>>
>
> Wouldn't have the implication of changing the binary representation too,
> which is not necessarily desirable (it's bulkier, slower to decode and has
> more potential error cases) ?
>
>
> yes, it would, but this is how logical types work, and I see no good way
> to change this.  (this is what i meant by paying the readability cost in
> place where it is irrelevant)
>

So you think that the JSON representation should always match the
underlying type and ignore the logical type? I can understand the reasoning
behind that, but it doesn't feel very user friendly in some cases (thinking
of decimal and duration in particular).

Given their privileged place in the specification, I was thinking that some
logical types could gain privilege here.

Aside: I'm a bit concerned about the potential for data corruption from
interchange between timestamp-micros and timestamp-millis, which, as far as
understand the spec, look like they'll be treated as compatible with each
other.


>
>
>> regarding the media type, something like: application/avro.2+json would
>> be fine.
>>
>
> Attaching the ".2" to "avro" rather than "json" seems to be implying a new
> Avro version, rather than a new JSON-encoding version? Or is the idea that
> the version number here is implying both the JSON-encoding version *and* the
> underlying Avro version?  The MIME standard seems to be silent on this
> AFAICS.
>
>
> the reason why I would use +json at the end is because it would be a
> subtype sufix: https://en.wikipedia.org/wiki/Media_type#Suffix and most
> browsers will recognize it as json, and potentially format it...
>

Ah, nice, I wasn't aware of RFC 6838.

>
>
>> Other then that the proposal looks good. can you start a PR with the spec
>> update?
>>
>
> I can do, but I don't hold out much hope of it getting merged. I started a
> PR with a much more minor change <https://github.com/apache/avro/pull/738>
> almost 2 months ago and haven't seen any response yet.
>
>
> Send out a email on the dev mailing list, the committers seem more
> responsive lately...
>

I'll give it a go :)

  cheers,
    rog.

>
>
>   cheers,
>     rog.
>
>>
>> —Z
>>
>> On Jan 15, 2020, at 12:30 PM, roger peppe <ro...@gmail.com> wrote:
>>
>> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas <zo...@yahoo.com> wrote:
>>
>>> See comments in-line below:
>>>
>>> On Jan 15, 2020, at 3:42 AM, roger peppe <ro...@gmail.com> wrote:
>>>
>>> Oops, I left arrays out! Two other thoughts:
>>>
>>>
>>>    - I wonder if it might be worth hedging bets about logical types. It
>>>    would be nice if (for example) a `timestamp-micros` value could be encoded
>>>    as an RFC3339 string, so perhaps that should be allowed for, but maybe
>>>    that's a step too far.
>>>
>>> I think logical types should should stay above the encoding/decoding…
>>> With timestamp-micros we could extend it to make it applicable to string
>>> and implement the converters, and then in json you would have something
>>> readable, but you would then have the same in binary and pay the
>>> readability cost there as well.
>>>
>>
>> I'm not sure what you mean there. I wouldn't expect the Avro binary
>> format to be readable at all.
>>
>> I implemented special handling for decimal logical type in my
>>> encoder/decoder, but the best implementation I could do still feels like a
>>> hack...
>>>
>>>
>>>    - I wonder if there should be some indication of version so that you
>>>    know which JSON encoding version you're reading. Perhaps the Avro schema
>>>    could include a version field (maybe as part of a definition) so you know
>>>    which version of the spec to use when encoding/decoding. Then bet-hedging
>>>    wouldn't be quite as important.
>>>
>>> I think Schema needs to stay decoupled from the encoding. The same
>>> schema can be encoded in various ways (I have a csv encoder/decoder for
>>> example, https://demo.spf4j.org/example/records?_Accept=text/csv ).
>>> I think the right abstraction for what you are looking for is the Media
>>> Type(https://en.wikipedia.org/wiki/Media_type ),
>>> It would be helpful to “standardize” the media types for the avro
>>> encodings:
>>>
>>
>> Yes, on reflection, I agree, even though not every possible medium has a
>> media type. For example, what if we're storing JSON data in a file? I guess
>> it would be up to us to store the type along with the data, as the registry
>> message wire format
>> <https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format>
>> does, for example by wrapping the entire value in another JSON object.
>>
>>
>>> Here is what I mean, (with some examples where the same schema is served
>>> with different encodings):
>>>
>>> 1) Binary: “application/avro”
>>> https://demo.spf4j.org/example/records?_Accept=application/avro
>>> 2) Current Json: “application/avro+json"
>>> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson
>>> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>>> 3) New Json: “application/avro-x+json” ?
>>> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson
>>> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>>>
>>
>> ISTM that "x" isn't a hugely descriptive qualifier there. How about
>> "application/avro+json.v2" ? Then it's clear what to do if we want to make
>> another version.
>>
>>
>>
>>> The media type including the avro schema (like you can see in the
>>> response ContentType in the headers above) can provide complete type
>>>  information to be able to read a avro object from a byte stream.
>>>
>>>
>>> application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”
>>>
>>> In HTTP context this fits well with content negotiation, and a client
>>> can ask for a previous version like:
>>>
>>>
>>> https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22
>>> <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22>
>>>
>>>
>>
>>> Note on $ref,  it is an extension to avsc I use to reference schemas
>>> from maven repos. (see
>>> https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences if
>>> interested in more detail)
>>>
>>
>> Interesting stuff. I like the idea of being able to get the server to
>> check the desired client encoding, although I'm somewhat wary of the
>> potential security implications of $ref with arbitrary URLs.
>>
>> Apart from the issues you raised, does my description of the proposed
>> semantics seem reasonable? It could be slightly cleverer and avoid
>> type-name wrapping in more situations, but this seemed like a nice balance
>> between easy-to-explain and idiomatic-in-most-situations.
>>
>>    cheers,
>>      rog.
>>
>>
>>
>

Re: More idiomatic JSON encoding for unions

Posted by Zoltan Farkas <zo...@yahoo.com>.

answers inline

> On Jan 16, 2020, at 5:51 AM, roger peppe <ro...@gmail.com> wrote:
> 
> On Wed, 15 Jan 2020 at 18:51, Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
> What I mean with timestamp-micros, is that it is currently restricted to being bound to long,
> I see no reason why it should not be allowed to be bound to string as well. (the change should be simple to implement)
> 
> Wouldn't have the implication of changing the binary representation too, which is not necessarily desirable (it's bulkier, slower to decode and has more potential error cases) ?

yes, it would, but this is how logical types work, and I see no good way to change this.  (this is what i meant by paying the readability cost in place where it is irrelevant)

> 
> 
> regarding the media type, something like: application/avro.2+json would be fine.
> 
> Attaching the ".2" to "avro" rather than "json" seems to be implying a new Avro version, rather than a new JSON-encoding version? Or is the idea that the version number here is implying both the JSON-encoding version and the underlying Avro version?  The MIME standard seems to be silent on this AFAICS.
> 

the reason why I would use +json at the end is because it would be a subtype sufix: https://en.wikipedia.org/wiki/Media_type#Suffix <https://en.wikipedia.org/wiki/Media_type#Suffix> and most browsers will recognize it as json, and potentially format it...

> 
> Other then that the proposal looks good. can you start a PR with the spec update?
> 
> I can do, but I don't hold out much hope of it getting merged. I started a PR with a much more minor change <https://github.com/apache/avro/pull/738> almost 2 months ago and haven't seen any response yet.

Send out a email on the dev mailing list, the committers seem more responsive lately...

> 
>   cheers,
>     rog.
> 
> —Z
> 
>> On Jan 15, 2020, at 12:30 PM, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>> 
>> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
>> See comments in-line below:
>> 
>>> On Jan 15, 2020, at 3:42 AM, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Oops, I left arrays out! Two other thoughts: 
>>> 
>>> I wonder if it might be worth hedging bets about logical types. It would be nice if (for example) a `timestamp-micros` value could be encoded as an RFC3339 string, so perhaps that should be allowed for, but maybe that's a step too far.
>> I think logical types should should stay above the encoding/decoding…  
>> With timestamp-micros we could extend it to make it applicable to string and implement the converters, and then in json you would have something readable, but you would then have the same in binary and pay the readability cost there as well.
>> 
>> I'm not sure what you mean there. I wouldn't expect the Avro binary format to be readable at all.
>> 
>> I implemented special handling for decimal logical type in my encoder/decoder, but the best implementation I could do still feels like a hack...
>> 
>>> I wonder if there should be some indication of version so that you know which JSON encoding version you're reading. Perhaps the Avro schema could include a version field (maybe as part of a definition) so you know which version of the spec to use when encoding/decoding. Then bet-hedging wouldn't be quite as important.
>> I think Schema needs to stay decoupled from the encoding. The same schema can be encoded in various ways (I have a csv encoder/decoder for example, https://demo.spf4j.org/example/records?_Accept=text/csv <https://demo.spf4j.org/example/records?_Accept=text/csv> ).
>> I think the right abstraction for what you are looking for is the Media Type(https://en.wikipedia.org/wiki/Media_type <https://en.wikipedia.org/wiki/Media_type> ), 
>> It would be helpful to “standardize” the media types for the avro encodings:
>> 
>> Yes, on reflection, I agree, even though not every possible medium has a media type. For example, what if we're storing JSON data in a file? I guess it would be up to us to store the type along with the data, as the registry message wire format <https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format> does, for example by wrapping the entire value in another JSON object.
>>  
>> Here is what I mean, (with some examples where the same schema is served with different encodings):
>> 
>> 1) Binary: “application/avro” https://demo.spf4j.org/example/records?_Accept=application/avro <https://demo.spf4j.org/example/records?_Accept=application/avro>
>> 2) Current Json: “application/avro+json" https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>> 3) New Json: “application/avro-x+json” ?  https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>> 
>> ISTM that "x" isn't a hugely descriptive qualifier there. How about "application/avro+json.v2" ? Then it's clear what to do if we want to make another version.
>> 
>>  
>> The media type including the avro schema (like you can see in the response ContentType in the headers above) can provide complete type  information to be able to read a avro object from a byte stream.
>> 
>> application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”
>> 
>> In HTTP context this fits well with content negotiation, and a client can ask for a previous version like:
>> 
>> https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22 <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22> 
>> 
>> Note on $ref,  it is an extension to avsc I use to reference schemas from maven repos. (see https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences <https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> if interested in more detail)
>> 
>> Interesting stuff. I like the idea of being able to get the server to check the desired client encoding, although I'm somewhat wary of the potential security implications of $ref with arbitrary URLs.
>> 
>> Apart from the issues you raised, does my description of the proposed semantics seem reasonable? It could be slightly cleverer and avoid type-name wrapping in more situations, but this seemed like a nice balance between easy-to-explain and idiomatic-in-most-situations.
>> 
>>    cheers,
>>      rog.
>> 
>

Re: More idiomatic JSON encoding for unions

Posted by roger peppe <ro...@gmail.com>.

On Wed, 15 Jan 2020 at 18:51, Zoltan Farkas <zo...@yahoo.com> wrote:

> What I mean with timestamp-micros, is that it is currently restricted to
> being bound to long,
> I see no reason why it should not be allowed to be bound to string as
> well. (the change should be simple to implement)
>

Wouldn't have the implication of changing the binary representation too,
which is not necessarily desirable (it's bulkier, slower to decode and has
more potential error cases) ?


> regarding the media type, something like: application/avro.2+json would be
> fine.
>

Attaching the ".2" to "avro" rather than "json" seems to be implying a new
Avro version, rather than a new JSON-encoding version? Or is the idea that
the version number here is implying both the JSON-encoding version *and* the
underlying Avro version?  The MIME standard seems to be silent on this
AFAICS.


> Other then that the proposal looks good. can you start a PR with the spec
> update?
>

I can do, but I don't hold out much hope of it getting merged. I started a
PR with a much more minor change <https://github.com/apache/avro/pull/738>
almost 2 months ago and haven't seen any response yet.

  cheers,
    rog.

>
> —Z
>
> On Jan 15, 2020, at 12:30 PM, roger peppe <ro...@gmail.com> wrote:
>
> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas <zo...@yahoo.com> wrote:
>
>> See comments in-line below:
>>
>> On Jan 15, 2020, at 3:42 AM, roger peppe <ro...@gmail.com> wrote:
>>
>> Oops, I left arrays out! Two other thoughts:
>>
>>
>>    - I wonder if it might be worth hedging bets about logical types. It
>>    would be nice if (for example) a `timestamp-micros` value could be encoded
>>    as an RFC3339 string, so perhaps that should be allowed for, but maybe
>>    that's a step too far.
>>
>> I think logical types should should stay above the encoding/decoding…
>> With timestamp-micros we could extend it to make it applicable to string
>> and implement the converters, and then in json you would have something
>> readable, but you would then have the same in binary and pay the
>> readability cost there as well.
>>
>
> I'm not sure what you mean there. I wouldn't expect the Avro binary format
> to be readable at all.
>
> I implemented special handling for decimal logical type in my
>> encoder/decoder, but the best implementation I could do still feels like a
>> hack...
>>
>>
>>    - I wonder if there should be some indication of version so that you
>>    know which JSON encoding version you're reading. Perhaps the Avro schema
>>    could include a version field (maybe as part of a definition) so you know
>>    which version of the spec to use when encoding/decoding. Then bet-hedging
>>    wouldn't be quite as important.
>>
>> I think Schema needs to stay decoupled from the encoding. The same schema
>> can be encoded in various ways (I have a csv encoder/decoder for example,
>> https://demo.spf4j.org/example/records?_Accept=text/csv ).
>> I think the right abstraction for what you are looking for is the Media
>> Type(https://en.wikipedia.org/wiki/Media_type ),
>> It would be helpful to “standardize” the media types for the avro
>> encodings:
>>
>
> Yes, on reflection, I agree, even though not every possible medium has a
> media type. For example, what if we're storing JSON data in a file? I guess
> it would be up to us to store the type along with the data, as the registry
> message wire format
> <https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format>
> does, for example by wrapping the entire value in another JSON object.
>
>
>> Here is what I mean, (with some examples where the same schema is served
>> with different encodings):
>>
>> 1) Binary: “application/avro”
>> https://demo.spf4j.org/example/records?_Accept=application/avro
>> 2) Current Json: “application/avro+json"
>> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson
>> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>> 3) New Json: “application/avro-x+json” ?
>> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson
>> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>>
>
> ISTM that "x" isn't a hugely descriptive qualifier there. How about
> "application/avro+json.v2" ? Then it's clear what to do if we want to make
> another version.
>
>
>
>> The media type including the avro schema (like you can see in the
>> response ContentType in the headers above) can provide complete type
>>  information to be able to read a avro object from a byte stream.
>>
>>
>> application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”
>>
>> In HTTP context this fits well with content negotiation, and a client can
>> ask for a previous version like:
>>
>>
>> https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22
>> <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22>
>>
>>
>
>> Note on $ref,  it is an extension to avsc I use to reference schemas from
>> maven repos. (see
>> https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences if
>> interested in more detail)
>>
>
> Interesting stuff. I like the idea of being able to get the server to
> check the desired client encoding, although I'm somewhat wary of the
> potential security implications of $ref with arbitrary URLs.
>
> Apart from the issues you raised, does my description of the proposed
> semantics seem reasonable? It could be slightly cleverer and avoid
> type-name wrapping in more situations, but this seemed like a nice balance
> between easy-to-explain and idiomatic-in-most-situations.
>
>    cheers,
>      rog.
>
>
>

Re: More idiomatic JSON encoding for unions

Posted by Zoltan Farkas <zo...@yahoo.com>.

What I mean with timestamp-micros, is that it is currently restricted to being bound to long,
I see no reason why it should not be allowed to be bound to string as well. (the change should be simple to implement)

regarding the media type, something like: application/avro.2+json would be fine.

Other then that the proposal looks good. can you start a PR with the spec update?

—Z

> On Jan 15, 2020, at 12:30 PM, roger peppe <ro...@gmail.com> wrote:
> 
> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
> See comments in-line below:
> 
>> On Jan 15, 2020, at 3:42 AM, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Oops, I left arrays out! Two other thoughts: 
>> 
>> I wonder if it might be worth hedging bets about logical types. It would be nice if (for example) a `timestamp-micros` value could be encoded as an RFC3339 string, so perhaps that should be allowed for, but maybe that's a step too far.
> I think logical types should should stay above the encoding/decoding…  
> With timestamp-micros we could extend it to make it applicable to string and implement the converters, and then in json you would have something readable, but you would then have the same in binary and pay the readability cost there as well.
> 
> I'm not sure what you mean there. I wouldn't expect the Avro binary format to be readable at all.
> 
> I implemented special handling for decimal logical type in my encoder/decoder, but the best implementation I could do still feels like a hack...
> 
>> I wonder if there should be some indication of version so that you know which JSON encoding version you're reading. Perhaps the Avro schema could include a version field (maybe as part of a definition) so you know which version of the spec to use when encoding/decoding. Then bet-hedging wouldn't be quite as important.
> I think Schema needs to stay decoupled from the encoding. The same schema can be encoded in various ways (I have a csv encoder/decoder for example, https://demo.spf4j.org/example/records?_Accept=text/csv <https://demo.spf4j.org/example/records?_Accept=text/csv> ).
> I think the right abstraction for what you are looking for is the Media Type(https://en.wikipedia.org/wiki/Media_type <https://en.wikipedia.org/wiki/Media_type> ), 
> It would be helpful to “standardize” the media types for the avro encodings:
> 
> Yes, on reflection, I agree, even though not every possible medium has a media type. For example, what if we're storing JSON data in a file? I guess it would be up to us to store the type along with the data, as the registry message wire format <https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format> does, for example by wrapping the entire value in another JSON object.
>  
> Here is what I mean, (with some examples where the same schema is served with different encodings):
> 
> 1) Binary: “application/avro” https://demo.spf4j.org/example/records?_Accept=application/avro <https://demo.spf4j.org/example/records?_Accept=application/avro>
> 2) Current Json: “application/avro+json" https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
> 3) New Json: “application/avro-x+json” ?  https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
> 
> ISTM that "x" isn't a hugely descriptive qualifier there. How about "application/avro+json.v2" ? Then it's clear what to do if we want to make another version.
> 
>  
> The media type including the avro schema (like you can see in the response ContentType in the headers above) can provide complete type  information to be able to read a avro object from a byte stream.
> 
> application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”
> 
> In HTTP context this fits well with content negotiation, and a client can ask for a previous version like:
> 
> https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22 <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22> 
> 
> Note on $ref,  it is an extension to avsc I use to reference schemas from maven repos. (see https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences <https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> if interested in more detail)
> 
> Interesting stuff. I like the idea of being able to get the server to check the desired client encoding, although I'm somewhat wary of the potential security implications of $ref with arbitrary URLs.
> 
> Apart from the issues you raised, does my description of the proposed semantics seem reasonable? It could be slightly cleverer and avoid type-name wrapping in more situations, but this seemed like a nice balance between easy-to-explain and idiomatic-in-most-situations.
> 
>    cheers,
>      rog.
>

Re: More idiomatic JSON encoding for unions

Posted by roger peppe <ro...@gmail.com>.

On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas <zo...@yahoo.com> wrote:

> See comments in-line below:
>
> On Jan 15, 2020, at 3:42 AM, roger peppe <ro...@gmail.com> wrote:
>
> Oops, I left arrays out! Two other thoughts:
>
>
>    - I wonder if it might be worth hedging bets about logical types. It
>    would be nice if (for example) a `timestamp-micros` value could be encoded
>    as an RFC3339 string, so perhaps that should be allowed for, but maybe
>    that's a step too far.
>
> I think logical types should should stay above the encoding/decoding…
> With timestamp-micros we could extend it to make it applicable to string
> and implement the converters, and then in json you would have something
> readable, but you would then have the same in binary and pay the
> readability cost there as well.
>

I'm not sure what you mean there. I wouldn't expect the Avro binary format
to be readable at all.

I implemented special handling for decimal logical type in my
> encoder/decoder, but the best implementation I could do still feels like a
> hack...
>
>
>    - I wonder if there should be some indication of version so that you
>    know which JSON encoding version you're reading. Perhaps the Avro schema
>    could include a version field (maybe as part of a definition) so you know
>    which version of the spec to use when encoding/decoding. Then bet-hedging
>    wouldn't be quite as important.
>
> I think Schema needs to stay decoupled from the encoding. The same schema
> can be encoded in various ways (I have a csv encoder/decoder for example,
> https://demo.spf4j.org/example/records?_Accept=text/csv ).
> I think the right abstraction for what you are looking for is the Media
> Type(https://en.wikipedia.org/wiki/Media_type ),
> It would be helpful to “standardize” the media types for the avro
> encodings:
>

Yes, on reflection, I agree, even though not every possible medium has a
media type. For example, what if we're storing JSON data in a file? I guess
it would be up to us to store the type along with the data, as the registry
message wire format
<https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format>
does, for example by wrapping the entire value in another JSON object.


> Here is what I mean, (with some examples where the same schema is served
> with different encodings):
>
> 1) Binary: “application/avro”
> https://demo.spf4j.org/example/records?_Accept=application/avro
> 2) Current Json: “application/avro+json"
> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson
> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
> 3) New Json: “application/avro-x+json” ?
> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson
> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
>

ISTM that "x" isn't a hugely descriptive qualifier there. How about
"application/avro+json.v2" ? Then it's clear what to do if we want to make
another version.



> The media type including the avro schema (like you can see in the response
> ContentType in the headers above) can provide complete type  information to
> be able to read a avro object from a byte stream.
>
>
> application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”
>
> In HTTP context this fits well with content negotiation, and a client can
> ask for a previous version like:
>
>
> https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22
> <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22>
>
>

> Note on $ref,  it is an extension to avsc I use to reference schemas from
> maven repos. (see
> https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences if
> interested in more detail)
>

Interesting stuff. I like the idea of being able to get the server to check
the desired client encoding, although I'm somewhat wary of the potential
security implications of $ref with arbitrary URLs.

Apart from the issues you raised, does my description of the proposed
semantics seem reasonable? It could be slightly cleverer and avoid
type-name wrapping in more situations, but this seemed like a nice balance
between easy-to-explain and idiomatic-in-most-situations.

   cheers,
     rog.

Re: More idiomatic JSON encoding for unions

Posted by Zoltan Farkas <zo...@yahoo.com>.

See comments in-line bellow:

> On Jan 15, 2020, at 3:42 AM, roger peppe <ro...@gmail.com> wrote:
> 
> Oops, I left arrays out! Two other thoughts: 
> 
> I wonder if it might be worth hedging bets about logical types. It would be nice if (for example) a `timestamp-micros` value could be encoded as an RFC3339 string, so perhaps that should be allowed for, but maybe that's a step too far.
I think logical types should should stay above the encoding/decoding…  
With timestamp-micros we could extend it to make it applicable to string and implement the converters, and then in json you would have something readable, but you would then have the same in binary and pay the readability cost there as well.
I implemented special handling for decimal logical type in my encoder/decoder, but the best implementation I could do still feels like a hack...

> I wonder if there should be some indication of version so that you know which JSON encoding version you're reading. Perhaps the Avro schema could include a version field (maybe as part of a definition) so you know which version of the spec to use when encoding/decoding. Then bet-hedging wouldn't be quite as important.
I think Schema needs to stay decoupled from the encoding. The same schema can be encoded in various ways (I have a csv encoder/decoder for example, https://demo.spf4j.org/example/records?_Accept=text/csv <https://demo.spf4j.org/example/records?_Accept=text/csv> ).
I think the right abstraction for what you are looking for is the Media Type(https://en.wikipedia.org/wiki/Media_type <https://en.wikipedia.org/wiki/Media_type> ), 
It would be helpful to “standardize” the media types for the avro encodings:

Here is what I mean, (with some examples where the same schema is served with different encodings):

1) Binary: “application/avro” https://demo.spf4j.org/example/records?_Accept=application/avro <https://demo.spf4j.org/example/records?_Accept=application/avro>
2) Current Json: “application/avro+json" https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
3) New Json: “application/avro-x+json” ?  https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson <https://demo.spf4j.org/example/records?_Accept=application/avro+json>

The media type including the avro schema (like you can see in the response ContentType in the headers above) can provide complete type  information to be able to read a avro object from a byte stream.

application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”

In HTTP context this fits well with content negotiation, and a client can ask for a previous version like:

https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22 <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22>

Note on $ref,  it is an extension to avsc I use to reference schemas from maven repos. (see https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences <https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> if interested in more detail)

The google protobuf world does not  seems to be in better shape on this front: https://stackoverflow.com/questions/30505408/what-is-the-correct-protobuf-content-type <https://stackoverflow.com/questions/30505408/what-is-the-correct-protobuf-content-type> 

let me know if you have any questions... 


> 
> JSON Encoding 
>  
> Except for unions, the JSON encoding is the same as is used to encode field default values.
> The value of a union is encoded in JSON as follows:
> if all values of the union can be distinguished unambiguously (see below), the JSON encoding is the same as is used to encode field default values for the type
> otherwise it is encoded as a JSON object with one name/value pair whose name is the type's name and whose value is the recursively encoded value. For Avro's named types (record, fixed or enum) the user-specified name is used, for other types the type name is used.
> Unambiguity is defined as follows: 
>  
> An Avro value can be encoded as one of a set of JSON types:
> null encodes as {null}
> boolean encodes as {boolean}
> int encodes as {number}
> long encodes as {number}
> float encodes as {number, string}
> double encodes as {number, string}
> bytes encodes as {string}
> string encodes as {string}
> any enum type encodes as {string}
> any array type encodes as {array}
> any map type encodes as {object}
> any record type encodes as {object}
> A union is considered unambiguous if the JSON type sets for all the members of the union form mutually disjoint sets. 
>  
> Note that float and double are considered ambiguous with respect to string because in the future, Avro might support encoding NaN and infinity values as strings.

LGTM, lets but this in a PR that covers the spec only.

> 
> On Tue, 14 Jan 2020 at 21:57, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
> On Tue, 14 Jan 2020 at 19:26, Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
> Makes sense, 
> 
> We have to agree on he scope of this implementation.
> 
> Right now the implementation I have in java, handles only the:
> 
> union {null, [some type]} situation.
> 
> Are we ok with this for a start?
> 
> I'm not sure that it's worth publishing a half-way solution, as if people start using it and a fuller solution is implemented, there will be three incompatible standards, which isn't ideal.
> 
> What I see more, is to handle:
> 
> 1) union {string, double}, (although we have to specify behavior for NAN, Positive and negative infinity);  union {string, boolean}; ….
> 
> My thought, as mentioned at the beginning of this thread, is to omit the wrapping when all the members of the union encode to distinct JSON token types (the JSON token types being: null, boolean, string, number, object and array).
> 
> I think that we could probably leave out explicit mention of NaN and infinity, as that's an issue with schemas too, and there's no obviously good solution. That said, if we did want to solve the issue of NaN and infinity in the future, things might get awkward with respect to this thread's proposal, because it's likely that the only reasonable way to solve that issue is to encode NaN and infinity as "NaN" and "±Infinity", which means that the union ["string", "float"] becomes ambiguous if we leave out the type name for that case.
> 
> It seems that it's not unheard-of to a string representation for these float values (see https://issues.apache.org/jira/browse/AVRO-1290 <https://issues.apache.org/jira/browse/AVRO-1290>).
> 
> So perhaps we could define the format something like this:
>  
> JSON Encoding 
>  
> Except for unions, the JSON encoding is the same as is used to encode field default values.
> The value of a union is encoded in JSON as follows:
> if all values of the union can be distinguished unambiguously (see below), the JSON encoding is the same as is used to encode field default values for the type
> otherwise it is encoded as a JSON object with one name/value pair whose name is the type's name and whose value is the recursively encoded value. For Avro's named types (record, fixed or enum) the user-specified name is used, for other types the type name is used.
> Unambiguity is defined as follows: 
>  
> An Avro value can be encoded as one of a set of JSON types:
> null encodes as {null}
> boolean encodes as {boolean}
> int encodes as {number}
> long encodes as {number}
> float encodes as {number, string}
> double encodes as {number, string}
> bytes encodes as {string}
> string encodes as {string}
> any enum encodes as {string}
> any map encodes as {object}
> any record encodes as {object}
> A union is considered unambiguous if the JSON type sets for all the members of the union form mutually disjoint sets. 
>  
> Note that float and double are considered ambiguous with respect to string because in the future, Avro might support encoding NaN and infinity values as strings.
> 
> WDYT?
> 
> 2) Make decimal an avro first class type. Current logical type approach is not natural in JSON. (see https://issues.apache.org/jira/browse/AVRO-2164 <https://issues.apache.org/jira/browse/AVRO-2164>). 
> 
> For 1.9.x    2) is probably a non-starter
> 
> Yes, this sounds a bit out of scope to me. It would be nice if decimal values were represented as a human-readable decimal number (possibly a JSON string to survive round-trips), but that should perhaps be part of a larger change to improve decimal support in general. Interestingly, if we were to be able to represent decimal values as JSON numbers (for example when they're unambiguously representable as such), that would fit fine with the above description, because bytes would be considered ambiguous with respect to float.
> 
>   cheers,
>     rog.

Re: More idiomatic JSON encoding for unions

Posted by roger peppe <ro...@gmail.com>.

Oops, I left arrays out! Two other thoughts:


   - I wonder if it might be worth hedging bets about logical types. It
   would be nice if (for example) a `timestamp-micros` value could be encoded
   as an RFC3339 string, so perhaps that should be allowed for, but maybe
   that's a step too far.
   - I wonder if there should be some indication of version so that you
   know which JSON encoding version you're reading. Perhaps the Avro schema
   could include a version field (maybe as part of a definition) so you know
   which version of the spec to use when encoding/decoding. Then bet-hedging
   wouldn't be quite as important.



*JSON Encoding *
>
> Except for unions, the JSON encoding is the same as is used to encode
field default values.

> The value of a union is encoded in JSON as follows:

>
   - if all values of the union can be distinguished *unambiguously* (see
   below), the JSON encoding is the same as is used to encode field default
   values for the type
   - otherwise it is encoded as a JSON object with one name/value pair
   whose name is the type's name and whose value is the recursively encoded
   value. For Avro's named types (record, fixed or enum) the user-specified
   name is used, for other types the type name is used.

Unambiguity is defined as follows:

>
> An Avro value can be encoded as one of a set of JSON types:

>
   - null encodes as {null}
   - boolean encodes as {boolean}
   - int encodes as {number}
   - long encodes as {number}
   - float encodes as {number, string}
   - double encodes as {number, string}
   - bytes encodes as {string}
   - string encodes as {string}
   - any enum type encodes as {string}
   - any array type encodes as {array}
   - any map type encodes as {object}
   - any record type encodes as {object}

A union is considered *unambiguous* if the JSON type sets for all the
members of the union form mutually disjoint sets.

Note that float and double are considered ambiguous with respect to string
because in the future, Avro might support encoding NaN and infinity values
as strings.

On Tue, 14 Jan 2020 at 21:57, roger peppe <ro...@gmail.com> wrote:

> On Tue, 14 Jan 2020 at 19:26, Zoltan Farkas <zo...@yahoo.com> wrote:
>
>> Makes sense,
>>
>> We have to agree on he scope of this implementation.
>>
>> Right now the implementation I have in java, handles only the:
>>
>> union {null, [some type]} situation.
>>
>> Are we ok with this for a start?
>>
>
> I'm not sure that it's worth publishing a half-way solution, as if people
> start using it and a fuller solution is implemented, there will be three
> incompatible standards, which isn't ideal.
>
>>
>> What I see more, is to handle:
>>
>> 1) union {string, double}, (although we have to specify behavior for NAN,
>> Positive and negative infinity);  union {string, boolean}; ….
>>
>
> My thought, as mentioned at the beginning of this thread, is to omit the
> wrapping when all the members of the union encode to distinct JSON token
> types (the JSON token types being: null, boolean, string, number, object
> and array).
>
> I think that we could probably leave out explicit mention of NaN and
> infinity, as that's an issue with schemas too, and there's no obviously
> good solution. That said, if we *did* want to solve the issue of NaN and
> infinity in the future, things might get awkward with respect to this
> thread's proposal, because it's likely that the only reasonable way to
> solve that issue is to encode NaN and infinity as "NaN" and "±Infinity",
> which means that the union ["string", "float"] becomes ambiguous if we
> leave out the type name for that case.
>
> It seems that it's not unheard-of to a string representation for these
> float values (see https://issues.apache.org/jira/browse/AVRO-1290).
>
> So perhaps we could define the format something like this:
>
>
> *JSON Encoding *
>>
>> Except for unions, the JSON encoding is the same as is used to encode
> field default values.
>
>> The value of a union is encoded in JSON as follows:
>
>>
>    - if all values of the union can be distinguished *unambiguously* (see
>    below), the JSON encoding is the same as is used to encode field default
>    values for the type
>    - otherwise it is encoded as a JSON object with one name/value pair
>    whose name is the type's name and whose value is the recursively encoded
>    value. For Avro's named types (record, fixed or enum) the user-specified
>    name is used, for other types the type name is used.
>
> Unambiguity is defined as follows:
>
>>
>> An Avro value can be encoded as one of a set of JSON types:
>
>>
>    - null encodes as {null}
>    - boolean encodes as {boolean}
>    - int encodes as {number}
>    - long encodes as {number}
>    - float encodes as {number, string}
>    - double encodes as {number, string}
>    - bytes encodes as {string}
>    - string encodes as {string}
>    - any enum encodes as {string}
>    - any map encodes as {object}
>    - any record encodes as {object}
>
> A union is considered *unambiguous* if the JSON type sets for all the
> members of the union form mutually disjoint sets.
>
> Note that float and double are considered ambiguous with respect to string
> because in the future, Avro might support encoding NaN and infinity values
> as strings.
>
> WDYT?
>
> 2) Make decimal an avro first class type. Current logical type approach is
>> not natural in JSON. (see https://issues.apache.org/jira/browse/AVRO-2164
>> ).
>>
>
>> For 1.9.x    2) is probably a non-starter
>>
>
> Yes, this sounds a bit out of scope to me. It would be nice if decimal
> values were represented as a human-readable decimal number (possibly a JSON
> string to survive round-trips), but that should perhaps be part of a larger
> change to improve decimal support in general. Interestingly, if we were to
> be able to represent decimal values as JSON numbers (for example when
> they're unambiguously representable as such), that would fit fine with the
> above description, because bytes would be considered ambiguous with respect
> to float.
>
>   cheers,
>     rog.
>

Re: More idiomatic JSON encoding for unions

Posted by roger peppe <ro...@gmail.com>.

On Tue, 14 Jan 2020 at 19:26, Zoltan Farkas <zo...@yahoo.com> wrote:

> Makes sense,
>
> We have to agree on he scope of this implementation.
>
> Right now the implementation I have in java, handles only the:
>
> union {null, [some type]} situation.
>
> Are we ok with this for a start?
>

I'm not sure that it's worth publishing a half-way solution, as if people
start using it and a fuller solution is implemented, there will be three
incompatible standards, which isn't ideal.

>
> What I see more, is to handle:
>
> 1) union {string, double}, (although we have to specify behavior for NAN,
> Positive and negative infinity);  union {string, boolean}; ….
>

My thought, as mentioned at the beginning of this thread, is to omit the
wrapping when all the members of the union encode to distinct JSON token
types (the JSON token types being: null, boolean, string, number, object
and array).

I think that we could probably leave out explicit mention of NaN and
infinity, as that's an issue with schemas too, and there's no obviously
good solution. That said, if we *did* want to solve the issue of NaN and
infinity in the future, things might get awkward with respect to this
thread's proposal, because it's likely that the only reasonable way to
solve that issue is to encode NaN and infinity as "NaN" and "±Infinity",
which means that the union ["string", "float"] becomes ambiguous if we
leave out the type name for that case.

It seems that it's not unheard-of to a string representation for these
float values (see https://issues.apache.org/jira/browse/AVRO-1290).

So perhaps we could define the format something like this:

*JSON Encoding *
>
> Except for unions, the JSON encoding is the same as is used to encode
field default values.

> The value of a union is encoded in JSON as follows:

>
   - if all values of the union can be distinguished *unambiguously* (see
   below), the JSON encoding is the same as is used to encode field default
   values for the type
   - otherwise it is encoded as a JSON object with one name/value pair
   whose name is the type's name and whose value is the recursively encoded
   value. For Avro's named types (record, fixed or enum) the user-specified
   name is used, for other types the type name is used.

Unambiguity is defined as follows:

>
> An Avro value can be encoded as one of a set of JSON types:

>
   - null encodes as {null}
   - boolean encodes as {boolean}
   - int encodes as {number}
   - long encodes as {number}
   - float encodes as {number, string}
   - double encodes as {number, string}
   - bytes encodes as {string}
   - string encodes as {string}
   - any enum encodes as {string}
   - any map encodes as {object}
   - any record encodes as {object}

A union is considered *unambiguous* if the JSON type sets for all the
members of the union form mutually disjoint sets.

Note that float and double are considered ambiguous with respect to string
because in the future, Avro might support encoding NaN and infinity values
as strings.

WDYT?

2) Make decimal an avro first class type. Current logical type approach is
> not natural in JSON. (see https://issues.apache.org/jira/browse/AVRO-2164
> ).
>

> For 1.9.x    2) is probably a non-starter
>

Yes, this sounds a bit out of scope to me. It would be nice if decimal
values were represented as a human-readable decimal number (possibly a JSON
string to survive round-trips), but that should perhaps be part of a larger
change to improve decimal support in general. Interestingly, if we were to
be able to represent decimal values as JSON numbers (for example when
they're unambiguously representable as such), that would fit fine with the
above description, because bytes would be considered ambiguous with respect
to float.

  cheers,
    rog.

Re: More idiomatic JSON encoding for unions

Posted by Zoltan Farkas <zo...@yahoo.com>.

Makes sense, 

We have to agree on he scope of this implementation.

Right now the implementation I have in java, handles only the:

union {null, [some type]} situation.

Are we ok with this for a start?

What I see more, is to handle:

1) union {string, double}, (although we have to specify behavior for NAN, Positive and negative infinity);  union {string, boolean}; ….

2) Make decimal an avro first class type. Current logical type approach is not natural in JSON. (see https://issues.apache.org/jira/browse/AVRO-2164).

For 1.9.x    2) is probably a non-starter

let me know.

—Z


> On Jan 14, 2020, at 12:09 PM, roger peppe <ro...@gmail.com> wrote:
> 
> 
> On Tue, 14 Jan 2020 at 15:00, Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
> I can go ahead create a PR to add the Encoder/Decoder implementations.
> let me know if anyone else plans to do that. (to avoid wasting time)
> 
> Hi,
> 
> Before you do that, would it be possible to write a specification for exactly what the conventions are and publish it somewhere? There are a bunch of edge cases that could be done in different ways, I think.
> 
> That way people like me that don't use Java can implement the same spec. (and also it's useful to know exactly what one is implementing before diving in and writing the code :])
> 
>   cheers,
>     rog.
> 
> 
> thanks
> 
> —Z
> 
>> On Jan 9, 2020, at 3:51 AM, Driesprong, Fokko <fokko@driesprong.frl <ma...@driesprong.frl>> wrote:
>> 
>> Thanks for chipping in Zoltan and Sean. I did not plan to change the current JSON encoder. My initial suggestion would make this an option that the user can set. The default will be the current situation, so nothing should change when upgrading to a newer version of Avro.
>> 
>> Cheers, Fokko
>> 
>> Op wo 8 jan. 2020 om 21:39 schreef Sean Busbey <busbey@apache.org <ma...@apache.org>>:
>> I agree with Zoltan here. We have a really long history of maintaining compatibility for encoders.
>> 
>> On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
>> Fokko, 
>> 
>> I am not sure we should be changing the existing json encoder,
>> I think we should just add another encoder, and devs can use either one of them based on their use case… and stay backward compatible.
>> 
>> we should maybe standardize the content types for them… I have seen application/avro being used for binary, we could have for json:
>> application/avro+json for the current format, application/avro.2+json for the new format…. 
>> 
>> At some point in the future we could deprecate the old one…
>> 
>> —Z
>> 
>> 
>>> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko <fokko@driesprong.frl <ma...@driesprong.frl>> wrote:
>>> 
>>> I would be a great fan of this as well. This also bothered me. The tricky part here is to see when to release this because it will break the existing JSON structure. We could make this configurable as well.
>>> 
>>> Cheers, Fokko
>>> 
>>> Op ma 6 jan. 2020 om 22:36 schreef roger peppe <rogpeppe@gmail.com <ma...@gmail.com>>:
>>> That's great, thanks! I thought this would probably have come up before.
>>> 
>>> Have you written down your changes in a somewhat more formal specification document, by any chance?
>>> 
>>>   cheers,
>>>     rog.
>>> 
>>> 
>>> On Mon, 6 Jan 2020, 18:50 zoly farkas, <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
>>> I think there is consensus that this should be implemented, see [AVRO-1582] Json serialization of nullable fileds and fields with default values improvement. - ASF JIRA <https://issues.apache.org/jira/browse/AVRO-1582>
>>> 
>>> [AVRO-1582] Json serialization of nullable fileds and fields with defaul...
>>>  <https://issues.apache.org/jira/browse/AVRO-1582>
>>> 
>>> 
>>> Here is a live example to get some sample data in avro json: https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson <https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson>
>>> and the "Natural" https://demo.spf4j.org/example/records/1?_Accept=application/json <https://demo.spf4j.org/example/records/1?_Accept=application/json> using the encoder suggested as implementation in the jira.
>>> 
>>> Somebody needs to find the time do the work to integrate this...
>>> 
>>> --Z
>>> 
>>> 
>>> 
>>> 
>>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 
>>> Hi,
>>> 
>>> The JSON encoding in the specification <https://avro.apache.org/docs/current/spec.html#json_encoding> includes an explicit type name for all kinds of object other than null. This means that a JSON-encoded Avro value with a union is very rarely directly compatible with normal JSON formats.
>>> 
>>> For example, it's very common for a JSON-encoded value to allow a value that's either null or string. In Avro, that's trivially expressed as the union type ["null", "string"]. With conventional JSON, a string value "foo" would be encoded just as "foo", which is easily distinguished from null when decoding. However when using the Avro JSON format it must be encoded as {"string": "foo"}.
>>> 
>>> This means that Avro JSON-encoded values don't interchange easily with other JSON-encoded values.
>>> 
>>> AFAICS the main reason that the type name is always required in JSON-encoded unions is to avoid ambiguity. This particularly applies to record and map types, where it's not possible in general to tell which member of the union has been specified by looking at the data itself.
>>> 
>>> However, that reasoning doesn't apply if all the members of the union can be distinguished from their JSON token type.
>>> 
>>> I am considering using a JSON encoding that omits the type name when all the members of the union encode to distinct JSON token types (the JSON token types being: null, boolean, string, number, object and array).
>>> 
>>> For example, JSON-encoded values using the Avro schema ["null", "string", "int"] would encode as the literal values themselves (e.g. null, "foo", 999), but JSON-encoded values using the Avro schema ["int", "double"] would require the type name because the JSON lexeme doesn't distinguish between different kinds of number.
>>> 
>>> This would mean that it would be possible to represent a significant subset of "normal" JSON schemas with Avro. It seems to me that would potentially be very useful.
>>> 
>>> Thoughts? Is this a really bad idea to be contemplating? :)
>>> 
>>>   cheers,
>>>     rog.
>>> 
>>> 
>> 
>

Re: More idiomatic JSON encoding for unions

Posted by roger peppe <ro...@gmail.com>.

On Tue, 14 Jan 2020 at 15:00, Zoltan Farkas <zo...@yahoo.com> wrote:

> I can go ahead create a PR to add the Encoder/Decoder implementations.
> let me know if anyone else plans to do that. (to avoid wasting time)
>

Hi,

Before you do that, would it be possible to write a specification for
exactly what the conventions are and publish it somewhere? There are a
bunch of edge cases that could be done in different ways, I think.

That way people like me that don't use Java can implement the same spec.
(and also it's useful to know exactly what one is implementing before
diving in and writing the code :])

  cheers,
    rog.


> thanks
>
> —Z
>
> On Jan 9, 2020, at 3:51 AM, Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> Thanks for chipping in Zoltan and Sean. I did not plan to change the
> current JSON encoder. My initial suggestion would make this an option that
> the user can set. The default will be the current situation, so nothing
> should change when upgrading to a newer version of Avro.
>
> Cheers, Fokko
>
> Op wo 8 jan. 2020 om 21:39 schreef Sean Busbey <bu...@apache.org>:
>
>> I agree with Zoltan here. We have a really long history of maintaining
>> compatibility for encoders.
>>
>> On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas <zo...@yahoo.com>
>> wrote:
>>
>>> Fokko,
>>>
>>> I am not sure we should be changing the existing json encoder,
>>> I think we should just add another encoder, and devs can use either one
>>> of them based on their use case… and stay backward compatible.
>>>
>>> we should maybe standardize the content types for them… I have seen
>>> application/avro being used for binary, we could have for json:
>>> application/avro+json for the current format, application/avro.2+json
>>> for the new format….
>>>
>>> At some point in the future we could deprecate the old one…
>>>
>>> —Z
>>>
>>>
>>> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko <fo...@driesprong.frl>
>>> wrote:
>>>
>>> I would be a great fan of this as well. This also bothered me. The
>>> tricky part here is to see when to release this because it will break the
>>> existing JSON structure. We could make this configurable as well.
>>>
>>> Cheers, Fokko
>>>
>>> Op ma 6 jan. 2020 om 22:36 schreef roger peppe <ro...@gmail.com>:
>>>
>>>> That's great, thanks! I thought this would probably have come up before.
>>>>
>>>> Have you written down your changes in a somewhat more formal
>>>> specification document, by any chance?
>>>>
>>>>   cheers,
>>>>     rog.
>>>>
>>>>
>>>> On Mon, 6 Jan 2020, 18:50 zoly farkas, <zo...@yahoo.com> wrote:
>>>>
>>>>> I think there is consensus that this should be implemented, see [AVRO-1582]
>>>>> Json serialization of nullable fileds and fields with default values
>>>>> improvement. - ASF JIRA
>>>>> <https://issues.apache.org/jira/browse/AVRO-1582>
>>>>>
>>>>> [AVRO-1582] Json serialization of nullable fileds and fields with
>>>>> defaul...
>>>>>
>>>>> <https://issues.apache.org/jira/browse/AVRO-1582>
>>>>>
>>>>>
>>>>> Here is a live example to get some sample data in avro json:
>>>>> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson
>>>>> and the "Natural"
>>>>> https://demo.spf4j.org/example/records/1?_Accept=application/json using
>>>>> the encoder suggested as implementation in the jira.
>>>>>
>>>>> Somebody needs to find the time do the work to integrate this...
>>>>>
>>>>> --Z
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <
>>>>> rogpeppe@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> The JSON encoding in the specification
>>>>> <https://avro.apache.org/docs/current/spec.html#json_encoding> includes
>>>>> an explicit type name for all kinds of object other than null. This means
>>>>> that a JSON-encoded Avro value with a union is very rarely directly
>>>>> compatible with normal JSON formats.
>>>>>
>>>>> For example, it's very common for a JSON-encoded value to allow a
>>>>> value that's either null or string. In Avro, that's trivially expressed as
>>>>> the union type ["null", "string"]. With conventional JSON, a string
>>>>> value "foo" would be encoded just as "foo", which is easily
>>>>> distinguished from null when decoding. However when using the Avro
>>>>> JSON format it must be encoded as {"string": "foo"}.
>>>>>
>>>>> This means that Avro JSON-encoded values don't interchange easily with
>>>>> other JSON-encoded values.
>>>>>
>>>>> AFAICS the main reason that the type name is always required in
>>>>> JSON-encoded unions is to avoid ambiguity. This particularly applies to
>>>>> record and map types, where it's not possible in general to tell which
>>>>> member of the union has been specified by looking at the data itself.
>>>>>
>>>>> However, that reasoning doesn't apply if all the members of the union
>>>>> can be distinguished from their JSON token type.
>>>>>
>>>>> I am considering using a JSON encoding that omits the type name when
>>>>> all the members of the union encode to distinct JSON token types (the JSON
>>>>> token types being: null, boolean, string, number, object and array).
>>>>>
>>>>> For example, JSON-encoded values using the Avro schema ["null",
>>>>> "string", "int"] would encode as the literal values themselves (e.g.
>>>>> null, "foo", 999), but JSON-encoded values using the Avro schema ["int",
>>>>> "double"] would require the type name because the JSON lexeme doesn't
>>>>> distinguish between different kinds of number.
>>>>>
>>>>> This would mean that it would be possible to represent a significant
>>>>> subset of "normal" JSON schemas with Avro. It seems to me that would
>>>>> potentially be very useful.
>>>>>
>>>>> Thoughts? Is this a really bad idea to be contemplating? :)
>>>>>
>>>>>   cheers,
>>>>>     rog.
>>>>>
>>>>>
>>>>>
>>>
>

Re: More idiomatic JSON encoding for unions

Posted by Zoltan Farkas <zo...@yahoo.com>.

I can go ahead create a PR to add the Encoder/Decoder implementations.
let me know if anyone else plans to do that. (to avoid wasting time)

thanks

—Z

> On Jan 9, 2020, at 3:51 AM, Driesprong, Fokko <fo...@driesprong.frl> wrote:
> 
> Thanks for chipping in Zoltan and Sean. I did not plan to change the current JSON encoder. My initial suggestion would make this an option that the user can set. The default will be the current situation, so nothing should change when upgrading to a newer version of Avro.
> 
> Cheers, Fokko
> 
> Op wo 8 jan. 2020 om 21:39 schreef Sean Busbey <busbey@apache.org <ma...@apache.org>>:
> I agree with Zoltan here. We have a really long history of maintaining compatibility for encoders.
> 
> On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
> Fokko, 
> 
> I am not sure we should be changing the existing json encoder,
> I think we should just add another encoder, and devs can use either one of them based on their use case… and stay backward compatible.
> 
> we should maybe standardize the content types for them… I have seen application/avro being used for binary, we could have for json:
> application/avro+json for the current format, application/avro.2+json for the new format…. 
> 
> At some point in the future we could deprecate the old one…
> 
> —Z
> 
> 
>> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko <fokko@driesprong.frl <ma...@driesprong.frl>> wrote:
>> 
>> I would be a great fan of this as well. This also bothered me. The tricky part here is to see when to release this because it will break the existing JSON structure. We could make this configurable as well.
>> 
>> Cheers, Fokko
>> 
>> Op ma 6 jan. 2020 om 22:36 schreef roger peppe <rogpeppe@gmail.com <ma...@gmail.com>>:
>> That's great, thanks! I thought this would probably have come up before.
>> 
>> Have you written down your changes in a somewhat more formal specification document, by any chance?
>> 
>>   cheers,
>>     rog.
>> 
>> 
>> On Mon, 6 Jan 2020, 18:50 zoly farkas, <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
>> I think there is consensus that this should be implemented, see [AVRO-1582] Json serialization of nullable fileds and fields with default values improvement. - ASF JIRA <https://issues.apache.org/jira/browse/AVRO-1582>
>> 
>> [AVRO-1582] Json serialization of nullable fileds and fields with defaul...
>>  <https://issues.apache.org/jira/browse/AVRO-1582>
>> 
>> 
>> Here is a live example to get some sample data in avro json: https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson <https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson>
>> and the "Natural" https://demo.spf4j.org/example/records/1?_Accept=application/json <https://demo.spf4j.org/example/records/1?_Accept=application/json> using the encoder suggested as implementation in the jira.
>> 
>> Somebody needs to find the time do the work to integrate this...
>> 
>> --Z
>> 
>> 
>> 
>> 
>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
>> 
>> 
>> Hi,
>> 
>> The JSON encoding in the specification <https://avro.apache.org/docs/current/spec.html#json_encoding> includes an explicit type name for all kinds of object other than null. This means that a JSON-encoded Avro value with a union is very rarely directly compatible with normal JSON formats.
>> 
>> For example, it's very common for a JSON-encoded value to allow a value that's either null or string. In Avro, that's trivially expressed as the union type ["null", "string"]. With conventional JSON, a string value "foo" would be encoded just as "foo", which is easily distinguished from null when decoding. However when using the Avro JSON format it must be encoded as {"string": "foo"}.
>> 
>> This means that Avro JSON-encoded values don't interchange easily with other JSON-encoded values.
>> 
>> AFAICS the main reason that the type name is always required in JSON-encoded unions is to avoid ambiguity. This particularly applies to record and map types, where it's not possible in general to tell which member of the union has been specified by looking at the data itself.
>> 
>> However, that reasoning doesn't apply if all the members of the union can be distinguished from their JSON token type.
>> 
>> I am considering using a JSON encoding that omits the type name when all the members of the union encode to distinct JSON token types (the JSON token types being: null, boolean, string, number, object and array).
>> 
>> For example, JSON-encoded values using the Avro schema ["null", "string", "int"] would encode as the literal values themselves (e.g. null, "foo", 999), but JSON-encoded values using the Avro schema ["int", "double"] would require the type name because the JSON lexeme doesn't distinguish between different kinds of number.
>> 
>> This would mean that it would be possible to represent a significant subset of "normal" JSON schemas with Avro. It seems to me that would potentially be very useful.
>> 
>> Thoughts? Is this a really bad idea to be contemplating? :)
>> 
>>   cheers,
>>     rog.
>> 
>> 
>

Re: More idiomatic JSON encoding for unions

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Thanks for chipping in Zoltan and Sean. I did not plan to change the
current JSON encoder. My initial suggestion would make this an option that
the user can set. The default will be the current situation, so nothing
should change when upgrading to a newer version of Avro.

Cheers, Fokko

Op wo 8 jan. 2020 om 21:39 schreef Sean Busbey <bu...@apache.org>:

> I agree with Zoltan here. We have a really long history of maintaining
> compatibility for encoders.
>
> On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas <zo...@yahoo.com>
> wrote:
>
>> Fokko,
>>
>> I am not sure we should be changing the existing json encoder,
>> I think we should just add another encoder, and devs can use either one
>> of them based on their use case… and stay backward compatible.
>>
>> we should maybe standardize the content types for them… I have seen
>> application/avro being used for binary, we could have for json:
>> application/avro+json for the current format, application/avro.2+json for
>> the new format….
>>
>> At some point in the future we could deprecate the old one…
>>
>> —Z
>>
>>
>> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko <fo...@driesprong.frl>
>> wrote:
>>
>> I would be a great fan of this as well. This also bothered me. The tricky
>> part here is to see when to release this because it will break the existing
>> JSON structure. We could make this configurable as well.
>>
>> Cheers, Fokko
>>
>> Op ma 6 jan. 2020 om 22:36 schreef roger peppe <ro...@gmail.com>:
>>
>>> That's great, thanks! I thought this would probably have come up before.
>>>
>>> Have you written down your changes in a somewhat more formal
>>> specification document, by any chance?
>>>
>>>   cheers,
>>>     rog.
>>>
>>>
>>> On Mon, 6 Jan 2020, 18:50 zoly farkas, <zo...@yahoo.com> wrote:
>>>
>>>> I think there is consensus that this should be implemented, see [AVRO-1582]
>>>> Json serialization of nullable fileds and fields with default values
>>>> improvement. - ASF JIRA
>>>> <https://issues.apache.org/jira/browse/AVRO-1582>
>>>>
>>>> [AVRO-1582] Json serialization of nullable fileds and fields with
>>>> defaul...
>>>>
>>>> <https://issues.apache.org/jira/browse/AVRO-1582>
>>>>
>>>>
>>>> Here is a live example to get some sample data in avro json:
>>>> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson
>>>> and the "Natural"
>>>> https://demo.spf4j.org/example/records/1?_Accept=application/json using
>>>> the encoder suggested as implementation in the jira.
>>>>
>>>> Somebody needs to find the time do the work to integrate this...
>>>>
>>>> --Z
>>>>
>>>>
>>>>
>>>>
>>>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <
>>>> rogpeppe@gmail.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> The JSON encoding in the specification
>>>> <https://avro.apache.org/docs/current/spec.html#json_encoding> includes
>>>> an explicit type name for all kinds of object other than null. This means
>>>> that a JSON-encoded Avro value with a union is very rarely directly
>>>> compatible with normal JSON formats.
>>>>
>>>> For example, it's very common for a JSON-encoded value to allow a value
>>>> that's either null or string. In Avro, that's trivially expressed as the
>>>> union type ["null", "string"]. With conventional JSON, a string value
>>>> "foo" would be encoded just as "foo", which is easily distinguished
>>>> from null when decoding. However when using the Avro JSON format it
>>>> must be encoded as {"string": "foo"}.
>>>>
>>>> This means that Avro JSON-encoded values don't interchange easily with
>>>> other JSON-encoded values.
>>>>
>>>> AFAICS the main reason that the type name is always required in
>>>> JSON-encoded unions is to avoid ambiguity. This particularly applies to
>>>> record and map types, where it's not possible in general to tell which
>>>> member of the union has been specified by looking at the data itself.
>>>>
>>>> However, that reasoning doesn't apply if all the members of the union
>>>> can be distinguished from their JSON token type.
>>>>
>>>> I am considering using a JSON encoding that omits the type name when
>>>> all the members of the union encode to distinct JSON token types (the JSON
>>>> token types being: null, boolean, string, number, object and array).
>>>>
>>>> For example, JSON-encoded values using the Avro schema ["null",
>>>> "string", "int"] would encode as the literal values themselves (e.g.
>>>> null, "foo", 999), but JSON-encoded values using the Avro schema ["int",
>>>> "double"] would require the type name because the JSON lexeme doesn't
>>>> distinguish between different kinds of number.
>>>>
>>>> This would mean that it would be possible to represent a significant
>>>> subset of "normal" JSON schemas with Avro. It seems to me that would
>>>> potentially be very useful.
>>>>
>>>> Thoughts? Is this a really bad idea to be contemplating? :)
>>>>
>>>>   cheers,
>>>>     rog.
>>>>
>>>>
>>>>
>>

Re: More idiomatic JSON encoding for unions

Posted by Sean Busbey <bu...@apache.org>.

I agree with Zoltan here. We have a really long history of maintaining
compatibility for encoders.

On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas <zo...@yahoo.com> wrote:

> Fokko,
>
> I am not sure we should be changing the existing json encoder,
> I think we should just add another encoder, and devs can use either one of
> them based on their use case… and stay backward compatible.
>
> we should maybe standardize the content types for them… I have seen
> application/avro being used for binary, we could have for json:
> application/avro+json for the current format, application/avro.2+json for
> the new format….
>
> At some point in the future we could deprecate the old one…
>
> —Z
>
>
> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> I would be a great fan of this as well. This also bothered me. The tricky
> part here is to see when to release this because it will break the existing
> JSON structure. We could make this configurable as well.
>
> Cheers, Fokko
>
> Op ma 6 jan. 2020 om 22:36 schreef roger peppe <ro...@gmail.com>:
>
>> That's great, thanks! I thought this would probably have come up before.
>>
>> Have you written down your changes in a somewhat more formal
>> specification document, by any chance?
>>
>>   cheers,
>>     rog.
>>
>>
>> On Mon, 6 Jan 2020, 18:50 zoly farkas, <zo...@yahoo.com> wrote:
>>
>>> I think there is consensus that this should be implemented, see [AVRO-1582]
>>> Json serialization of nullable fileds and fields with default values
>>> improvement. - ASF JIRA
>>> <https://issues.apache.org/jira/browse/AVRO-1582>
>>>
>>> [AVRO-1582] Json serialization of nullable fileds and fields with
>>> defaul...
>>>
>>> <https://issues.apache.org/jira/browse/AVRO-1582>
>>>
>>>
>>> Here is a live example to get some sample data in avro json:
>>> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson
>>> and the "Natural"
>>> https://demo.spf4j.org/example/records/1?_Accept=application/json using
>>> the encoder suggested as implementation in the jira.
>>>
>>> Somebody needs to find the time do the work to integrate this...
>>>
>>> --Z
>>>
>>>
>>>
>>>
>>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <
>>> rogpeppe@gmail.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> The JSON encoding in the specification
>>> <https://avro.apache.org/docs/current/spec.html#json_encoding> includes
>>> an explicit type name for all kinds of object other than null. This means
>>> that a JSON-encoded Avro value with a union is very rarely directly
>>> compatible with normal JSON formats.
>>>
>>> For example, it's very common for a JSON-encoded value to allow a value
>>> that's either null or string. In Avro, that's trivially expressed as the
>>> union type ["null", "string"]. With conventional JSON, a string value
>>> "foo" would be encoded just as "foo", which is easily distinguished
>>> from null when decoding. However when using the Avro JSON format it
>>> must be encoded as {"string": "foo"}.
>>>
>>> This means that Avro JSON-encoded values don't interchange easily with
>>> other JSON-encoded values.
>>>
>>> AFAICS the main reason that the type name is always required in
>>> JSON-encoded unions is to avoid ambiguity. This particularly applies to
>>> record and map types, where it's not possible in general to tell which
>>> member of the union has been specified by looking at the data itself.
>>>
>>> However, that reasoning doesn't apply if all the members of the union
>>> can be distinguished from their JSON token type.
>>>
>>> I am considering using a JSON encoding that omits the type name when all
>>> the members of the union encode to distinct JSON token types (the JSON
>>> token types being: null, boolean, string, number, object and array).
>>>
>>> For example, JSON-encoded values using the Avro schema ["null",
>>> "string", "int"] would encode as the literal values themselves (e.g.
>>> null, "foo", 999), but JSON-encoded values using the Avro schema ["int",
>>> "double"] would require the type name because the JSON lexeme doesn't
>>> distinguish between different kinds of number.
>>>
>>> This would mean that it would be possible to represent a significant
>>> subset of "normal" JSON schemas with Avro. It seems to me that would
>>> potentially be very useful.
>>>
>>> Thoughts? Is this a really bad idea to be contemplating? :)
>>>
>>>   cheers,
>>>     rog.
>>>
>>>
>>>
>

Re: More idiomatic JSON encoding for unions

Posted by Zoltan Farkas <zo...@yahoo.com>.

Fokko, 

I am not sure we should be changing the existing json encoder,
I think we should just add another encoder, and devs can use either one of them based on their use case… and stay backward compatible.

we should maybe standardize the content types for them… I have seen application/avro being used for binary, we could have for json:
application/avro+json for the current format, application/avro.2+json for the new format…. 

At some point in the future we could deprecate the old one…

—Z


> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko <fo...@driesprong.frl> wrote:
> 
> I would be a great fan of this as well. This also bothered me. The tricky part here is to see when to release this because it will break the existing JSON structure. We could make this configurable as well.
> 
> Cheers, Fokko
> 
> Op ma 6 jan. 2020 om 22:36 schreef roger peppe <rogpeppe@gmail.com <ma...@gmail.com>>:
> That's great, thanks! I thought this would probably have come up before.
> 
> Have you written down your changes in a somewhat more formal specification document, by any chance?
> 
>   cheers,
>     rog.
> 
> 
> On Mon, 6 Jan 2020, 18:50 zoly farkas, <zolyfarkas@yahoo.com <ma...@yahoo.com>> wrote:
> I think there is consensus that this should be implemented, see [AVRO-1582] Json serialization of nullable fileds and fields with default values improvement. - ASF JIRA <https://issues.apache.org/jira/browse/AVRO-1582>
> 
> [AVRO-1582] Json serialization of nullable fileds and fields with defaul...
>  <https://issues.apache.org/jira/browse/AVRO-1582>
> 
> 
> Here is a live example to get some sample data in avro json: https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson <https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson>
> and the "Natural" https://demo.spf4j.org/example/records/1?_Accept=application/json <https://demo.spf4j.org/example/records/1?_Accept=application/json> using the encoder suggested as implementation in the jira.
> 
> Somebody needs to find the time do the work to integrate this...
> 
> --Z
> 
> 
> 
> 
> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <rogpeppe@gmail.com <ma...@gmail.com>> wrote:
> 
> 
> Hi,
> 
> The JSON encoding in the specification <https://avro.apache.org/docs/current/spec.html#json_encoding> includes an explicit type name for all kinds of object other than null. This means that a JSON-encoded Avro value with a union is very rarely directly compatible with normal JSON formats.
> 
> For example, it's very common for a JSON-encoded value to allow a value that's either null or string. In Avro, that's trivially expressed as the union type ["null", "string"]. With conventional JSON, a string value "foo" would be encoded just as "foo", which is easily distinguished from null when decoding. However when using the Avro JSON format it must be encoded as {"string": "foo"}.
> 
> This means that Avro JSON-encoded values don't interchange easily with other JSON-encoded values.
> 
> AFAICS the main reason that the type name is always required in JSON-encoded unions is to avoid ambiguity. This particularly applies to record and map types, where it's not possible in general to tell which member of the union has been specified by looking at the data itself.
> 
> However, that reasoning doesn't apply if all the members of the union can be distinguished from their JSON token type.
> 
> I am considering using a JSON encoding that omits the type name when all the members of the union encode to distinct JSON token types (the JSON token types being: null, boolean, string, number, object and array).
> 
> For example, JSON-encoded values using the Avro schema ["null", "string", "int"] would encode as the literal values themselves (e.g. null, "foo", 999), but JSON-encoded values using the Avro schema ["int", "double"] would require the type name because the JSON lexeme doesn't distinguish between different kinds of number.
> 
> This would mean that it would be possible to represent a significant subset of "normal" JSON schemas with Avro. It seems to me that would potentially be very useful.
> 
> Thoughts? Is this a really bad idea to be contemplating? :)
> 
>   cheers,
>     rog.
> 
>

Re: More idiomatic JSON encoding for unions

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

I would be a great fan of this as well. This also bothered me. The tricky
part here is to see when to release this because it will break the existing
JSON structure. We could make this configurable as well.

Cheers, Fokko

Op ma 6 jan. 2020 om 22:36 schreef roger peppe <ro...@gmail.com>:

> That's great, thanks! I thought this would probably have come up before.
>
> Have you written down your changes in a somewhat more formal specification
> document, by any chance?
>
>   cheers,
>     rog.
>
>
> On Mon, 6 Jan 2020, 18:50 zoly farkas, <zo...@yahoo.com> wrote:
>
>> I think there is consensus that this should be implemented, see [AVRO-1582]
>> Json serialization of nullable fileds and fields with default values
>> improvement. - ASF JIRA <https://issues.apache.org/jira/browse/AVRO-1582>
>>
>> [AVRO-1582] Json serialization of nullable fileds and fields with
>> defaul...
>>
>> <https://issues.apache.org/jira/browse/AVRO-1582>
>>
>>
>> Here is a live example to get some sample data in avro json:
>> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson
>> and the "Natural"
>> https://demo.spf4j.org/example/records/1?_Accept=application/json using
>> the encoder suggested as implementation in the jira.
>>
>> Somebody needs to find the time do the work to integrate this...
>>
>> --Z
>>
>>
>>
>>
>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <
>> rogpeppe@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> The JSON encoding in the specification
>> <https://avro.apache.org/docs/current/spec.html#json_encoding> includes
>> an explicit type name for all kinds of object other than null. This means
>> that a JSON-encoded Avro value with a union is very rarely directly
>> compatible with normal JSON formats.
>>
>> For example, it's very common for a JSON-encoded value to allow a value
>> that's either null or string. In Avro, that's trivially expressed as the
>> union type ["null", "string"]. With conventional JSON, a string value
>> "foo" would be encoded just as "foo", which is easily distinguished from
>> null when decoding. However when using the Avro JSON format it must be
>> encoded as {"string": "foo"}.
>>
>> This means that Avro JSON-encoded values don't interchange easily with
>> other JSON-encoded values.
>>
>> AFAICS the main reason that the type name is always required in
>> JSON-encoded unions is to avoid ambiguity. This particularly applies to
>> record and map types, where it's not possible in general to tell which
>> member of the union has been specified by looking at the data itself.
>>
>> However, that reasoning doesn't apply if all the members of the union can
>> be distinguished from their JSON token type.
>>
>> I am considering using a JSON encoding that omits the type name when all
>> the members of the union encode to distinct JSON token types (the JSON
>> token types being: null, boolean, string, number, object and array).
>>
>> For example, JSON-encoded values using the Avro schema ["null",
>> "string", "int"] would encode as the literal values themselves (e.g. null,
>> "foo", 999), but JSON-encoded values using the Avro schema ["int",
>> "double"] would require the type name because the JSON lexeme doesn't
>> distinguish between different kinds of number.
>>
>> This would mean that it would be possible to represent a significant
>> subset of "normal" JSON schemas with Avro. It seems to me that would
>> potentially be very useful.
>>
>> Thoughts? Is this a really bad idea to be contemplating? :)
>>
>>   cheers,
>>     rog.
>>
>>
>>

Re: More idiomatic JSON encoding for unions

Posted by roger peppe <ro...@gmail.com>.

That's great, thanks! I thought this would probably have come up before.

Have you written down your changes in a somewhat more formal specification
document, by any chance?

  cheers,
    rog.


On Mon, 6 Jan 2020, 18:50 zoly farkas, <zo...@yahoo.com> wrote:

> I think there is consensus that this should be implemented, see [AVRO-1582]
> Json serialization of nullable fileds and fields with default values
> improvement. - ASF JIRA <https://issues.apache.org/jira/browse/AVRO-1582>
>
> [AVRO-1582] Json serialization of nullable fileds and fields with defaul...
>
> <https://issues.apache.org/jira/browse/AVRO-1582>
>
>
> Here is a live example to get some sample data in avro json:
> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson
> and the "Natural"
> https://demo.spf4j.org/example/records/1?_Accept=application/json using
> the encoder suggested as implementation in the jira.
>
> Somebody needs to find the time do the work to integrate this...
>
> --Z
>
>
>
>
> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <
> rogpeppe@gmail.com> wrote:
>
>
> Hi,
>
> The JSON encoding in the specification
> <https://avro.apache.org/docs/current/spec.html#json_encoding> includes
> an explicit type name for all kinds of object other than null. This means
> that a JSON-encoded Avro value with a union is very rarely directly
> compatible with normal JSON formats.
>
> For example, it's very common for a JSON-encoded value to allow a value
> that's either null or string. In Avro, that's trivially expressed as the
> union type ["null", "string"]. With conventional JSON, a string value
> "foo" would be encoded just as "foo", which is easily distinguished from
> null when decoding. However when using the Avro JSON format it must be
> encoded as {"string": "foo"}.
>
> This means that Avro JSON-encoded values don't interchange easily with
> other JSON-encoded values.
>
> AFAICS the main reason that the type name is always required in
> JSON-encoded unions is to avoid ambiguity. This particularly applies to
> record and map types, where it's not possible in general to tell which
> member of the union has been specified by looking at the data itself.
>
> However, that reasoning doesn't apply if all the members of the union can
> be distinguished from their JSON token type.
>
> I am considering using a JSON encoding that omits the type name when all
> the members of the union encode to distinct JSON token types (the JSON
> token types being: null, boolean, string, number, object and array).
>
> For example, JSON-encoded values using the Avro schema ["null", "string",
> "int"] would encode as the literal values themselves (e.g. null, "foo",
> 999), but JSON-encoded values using the Avro schema ["int", "double"]
> would require the type name because the JSON lexeme doesn't distinguish
> between different kinds of number.
>
> This would mean that it would be possible to represent a significant
> subset of "normal" JSON schemas with Avro. It seems to me that would
> potentially be very useful.
>
> Thoughts? Is this a really bad idea to be contemplating? :)
>
>   cheers,
>     rog.
>
>
>

Re: More idiomatic JSON encoding for unions

Posted by zoly farkas <zo...@yahoo.com>.

I think there is consensus that this should be implemented, see [AVRO-1582] Json serialization of nullable fileds and fields with default values improvement. - ASF JIRA

|
|
| |
[AVRO-1582] Json serialization of nullable fileds and fields with defaul...

Here is a live example to get some sample data in avro json: https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjsonand the "Natural" https://demo.spf4j.org/example/records/1?_Accept=application/json using the encoder suggested as implementation in the jira.
Somebody needs to find the time do the work to integrate this...
--Z

On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe <ro...@gmail.com> wrote:

Hi,

The JSON encoding in the specification includes an explicit type name for all kinds of object other than null. This means that a JSON-encoded Avro value with a union is very rarely directly compatible with normal JSON formats.
For example, it's very common for a JSON-encoded value to allow a value that's either null or string. In Avro, that's trivially expressed as the union type ["null", "string"]. With conventional JSON, a string value "foo" would be encoded just as "foo", which is easily distinguished from null when decoding. However when using the Avro JSON format it must be encoded as {"string": "foo"}.
This means that Avro JSON-encoded values don't interchange easily with other JSON-encoded values.
AFAICS the main reason that the type name is always required in JSON-encoded unions is to avoid ambiguity. This particularly applies to record and map types, where it's not possible in general to tell which member of the union has been specified by looking at the data itself.
However, that reasoning doesn't apply if all the members of the union can be distinguished from their JSON token type.
I am considering using a JSON encoding that omits the type name when all the members of the union encode to distinct JSON token types (the JSON token types being: null, boolean, string, number, object and array).
For example, JSON-encoded values using the Avro schema ["null", "string", "int"] would encode as the literal values themselves (e.g. null, "foo", 999), but JSON-encoded values using the Avro schema ["int", "double"] would require the type name because the JSON lexeme doesn't distinguish between different kinds of number.
This would mean that it would be possible to represent a significant subset of "normal" JSON schemas with Avro. It seems to me that would potentially be very useful.
Thoughts? Is this a really bad idea to be contemplating? :)
cheers, rog.