You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by Zoltan Ivanfi <zi...@cloudera.com> on 2017/10/17 15:16:30 UTC

(Default) values for logical types in human-readable form

Hi,

I would like to start a discussion about making default values and values
in general human-readable for logical types.

Currently default values for logical types have to be specified in a JSON
string as the binary representation of the backing primary type (e.g.,
"\u0000"). Some users intuitively try to specify a human-readable logical
value in this string instead (e.g., "0.00"). This is of course a valid byte
sequence and as such is accepted, but it results in unexpected behaviour (a
different default value than intended). Apart from being error prone,
specifying default values this way is also tedious. To keep this e-mail
brief, I won't list specific examples here, please see AVRO-2087
<https://issues.apache.org/jira/browse/AVRO-2087> for details instead.

The problem of non-human-readable values applies to JSON encoding of actual
data as well. One reason for using JSON is that it is human readable and
therefore easy to debug. Seeing "\u00018" in a JSON file is not too
intuitive and this specific example is actually quite misleading as well
(it can be easily misread as "\u0018").

Introducing a new default value field (called human-readable-default or
logical-default for example) would allow easier specification of default
values. (It doesn't solve the problem of accidentally misusing the existing
field though.) It is, however, not backwards compatible. An older Avro
library would ignore the new field and use a different default value.

Introducing human-readable values in the JSON encoding is even more clearly
a breaking change. (Although for JSON we could add the human-readable value
as a separate extra field that gets ignored when reading. Problem is, users
may be tempted to change the value and be surprised. It's a pity that JSON
does not allow comments.)

In your opinions, what would be the best way to deal with this problem?

Thanks,

Zoltan

Re: (Default) values for logical types in human-readable form

Posted by Bridger Howell <bh...@sofi.org>.

>
> Sorry, I can see how that was confusing. I would use the logical type to
> determine the transform, but wouldn't require the user to have configured a
> conversion for the type, which is optional. Basically, I'm saying that this
> feature would support decimal, date, time, and timestamp from string.

I think I confused things a bit by suggesting that we shouldn't base this
default on logical type.

My initial motivation for this was primarily based on the idea of
representing bytes in base64. I wouldn't really think of "base64" as a
logical type, since it isn't really a "higher-level" encoding of bytes.
It's more convenient for humans, but I don't think I'd really want to be
manipulating base64 strings in my code very often, and definitely not just
because I wanted to write the default value in base64.

My other additional reasoning is that there might be more than one distinct
way of representing a default as a string for a given type. If we go back
to the bytes example, I think both hex and base64 are reasonable
human-readable encodings for bytes.
I don't have any good examples of this for logical types, though.

We can also extend IDL to use an annotation or something that bakes
down to "string-default"
> in the schema. I'm not very familiar with the IDL, though, so I can't say
> exactly what we would need to do here.

We could use the existing lightly-documented feature of IDL that it maps
annotations to fields.

So a field declared like:
timestamp_ms @strdefault("2000-01-01T00:00:00.000Z") createdDt;

(Just tested this) Is converted to a schema field like:
{"name":"createdDt","type":{"type":"long","logicalType":"timestamp-millis"},"strdefault":"2000-01-
01T00:00:00.000Z"}

There would still be some work needed on the IDL side if we went with a
field named "string-default" or "default-as-string", since IDL treats
"string" as a keyword regardless of the context.

Alternatively, having some special syntax for human-readable defaults might
be okay, but I don't have any great suggestions.

On Fri, Oct 20, 2017 at 11:08 AM, Doug Cutting <cu...@gmail.com> wrote:

> On Fri, Oct 20, 2017 at 12:13 AM, Bridger Howell <bh...@sofi.org> wrote:
>
> > But then would I end up with a few bits of ugliness:
> > - the schemas have to be wrapped in a made-up protocol
> > - the generated code includes a generated protocol class I wouldn't care
> > about
> >
>
> Wouldn't these be eliminated if we simply permitted 'record', 'enum',
> 'union' and 'fixed' as top-level items in the IDL syntax?  That would not
> be a difficult change.
>
>
Correct, that would be a relatively easy change and that should also
address my third concern (since all the weird protocol stuff is hidden
behind protocol blocks).

I assume then we'd want to change the output of the IDL parser become a
mixture of schemas and protocols instead of just one protocol (since that
would be equivalent to unwrapping a protocol)?
I'm not sure if with the new SchemaResolver logic we can deal with IDL
schemas living in different files.

- Bridger Howell

-- 

The information contained in this email message is PRIVATE and intended 
only for the personal and confidential use of the recipient named above. If 
the reader of this message is not the intended recipient or an agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that you have received this message in error and that any review, 
dissemination, distribution or copying of this message is strictly 
prohibited.  If you have received this communication in error, please 
notify us immediately by email, and delete the original message.

Re: (Default) values for logical types in human-readable form

Posted by Doug Cutting <cu...@gmail.com>.

On Fri, Oct 20, 2017 at 12:13 AM, Bridger Howell <bh...@sofi.org> wrote:

> But then would I end up with a few bits of ugliness:
> - the schemas have to be wrapped in a made-up protocol
> - the generated code includes a generated protocol class I wouldn't care
> about
>

Wouldn't these be eliminated if we simply permitted 'record', 'enum',
'union' and 'fixed' as top-level items in the IDL syntax?  That would not
be a difficult change.

Doug

Re: (Default) values for logical types in human-readable form

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

> So if I understand correctly, you support the idea of human-readable
defaults but not logical-type-dependent interpretation. I don't see how we
could achieve the first without the second, since different logical types
have different human-readable representations. So it seems that the
optional nature of logical types actually makes this feature impossible.

Sorry, I can see how that was confusing. I would use the logical type to
determine the transform, but wouldn't require the user to have configured a
conversion for the type, which is optional. Basically, I'm saying that this
feature would support decimal, date, time, and timestamp from string.

The conversion from string would happen when parsing a schema and would set
the "default" field from another string-based field, if "default" doesn't
already exist. That way, we always have the "default" that all readers use.

For example, instead of
defining {"name":"d","type":"fixed","logical-type":"decimal",...,"default":"\u000C\u006C"},
you would use {"name":"d","type":"fixed",...,"string-default":"31.80"} that
gets translated when parsed to the one with a "default" field. I think that
would work in all cases.

On the subject of AVDL, I think we clearly have a case where people are
editing schemas directly so it makes sense to support this. We can also
extend IDL to use an annotation or something that bakes down to
"string-default" in the schema. I'm not very familiar with the IDL, though,
so I can't say exactly what we would need to do here.

rb

On Fri, Oct 20, 2017 at 6:08 AM, Bridger Howell <bh...@sofi.org> wrote:

> On Fri, Oct 20, 2017 at 2:04 AM, Frédéric SOUCHU <
> Frederic.SOUCHU@ingenico.com> wrote:
>
> > In line with Philip Zeyliger on IDL being a good tool for a human to
> > produce schema.
> > Key features (IMHO):
> > - support for includes (killer feature)
> > - simpler syntax (a *lot* less '{' and '['...)
> > - simpler comments syntax
> >
>
> I'm afraid you're missing my point. I'm not arguing that IDL isn't a "good"
> tool for producing schemas. I'm arguing that I don't think, as it is, it
> should be the preferred tool for writing schemas.
>
> Implicitly, I'm also extending that to mean we shouldn't currently prefer
> to give new features like this only to IDL, unless we want to make the
> process for using IDL for schemas cleaner and simpler. If we're willing to
> do that, then I have no issues.
>
> I have a toolchain going from IDL to Java + C# classes that wouldn't work
> > using JSON schema (the many holes in the AVRO C# side not helping
> either..).
> >
>
> Cool. I helped build something similar and I work with others who use it
> regularly.
>
>
> >  (btw, how did we end up with different json/IDL logical names?!?!)
> >
>
> I assume you're referring to the keywords like  "timestamp_ms" and "date"
> added to IDL to refer to the "timestamp-millis" and "date" logical types?
>
> These are special keywords, not a general mechanism that produces schemas
> with the given logical type so there's no particular reason that they have
> to match the logical type that they implement (although it does seem
> inconsistent). I looked around AVRO-1684 (
> https://issues.apache.org/jira/browse/AVRO-1684) where this was
> implemented
> for some justification, but I didn't find anything.
>
> Suggest to have an 'encoding' attribute to indicate how the default value
> > is defined.
> > {
> >   "type": "bytes",
> >   "logicalType": "decimal",
> >   "precision": 4,
> >   "scale": 2,
> >   "default":"3.151351351",
> >   "default-encoding":"string" // default encoding being 'AVRO' default
> > (e.g. binary)
> > }
> >
>
> This has the same problem as one of my earlier suggestions; if you change
> the meaning of the default field, then older readers will read the schema
> with an incorrect default value.
>
> - Bridger Howell
>
> --
>
>
> The information contained in this email message is PRIVATE and intended
> only for the personal and confidential use of the recipient named above. If
> the reader of this message is not the intended recipient or an agent
> responsible for delivering it to the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution or copying of this message is strictly
> prohibited.  If you have received this communication in error, please
> notify us immediately by email, and delete the original message.
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: (Default) values for logical types in human-readable form

Posted by Bridger Howell <bh...@sofi.org>.

On Fri, Oct 20, 2017 at 2:04 AM, Frédéric SOUCHU <
Frederic.SOUCHU@ingenico.com> wrote:

> In line with Philip Zeyliger on IDL being a good tool for a human to
> produce schema.
> Key features (IMHO):
> - support for includes (killer feature)
> - simpler syntax (a *lot* less '{' and '['...)
> - simpler comments syntax
>

I'm afraid you're missing my point. I'm not arguing that IDL isn't a "good"
tool for producing schemas. I'm arguing that I don't think, as it is, it
should be the preferred tool for writing schemas.

Implicitly, I'm also extending that to mean we shouldn't currently prefer
to give new features like this only to IDL, unless we want to make the
process for using IDL for schemas cleaner and simpler. If we're willing to
do that, then I have no issues.

I have a toolchain going from IDL to Java + C# classes that wouldn't work
> using JSON schema (the many holes in the AVRO C# side not helping either..).
>

Cool. I helped build something similar and I work with others who use it
regularly.

>  (btw, how did we end up with different json/IDL logical names?!?!)
>

I assume you're referring to the keywords like  "timestamp_ms" and "date"
added to IDL to refer to the "timestamp-millis" and "date" logical types?

These are special keywords, not a general mechanism that produces schemas
with the given logical type so there's no particular reason that they have
to match the logical type that they implement (although it does seem
inconsistent). I looked around AVRO-1684 (
https://issues.apache.org/jira/browse/AVRO-1684) where this was implemented
for some justification, but I didn't find anything.

Suggest to have an 'encoding' attribute to indicate how the default value
> is defined.
> {
>   "type": "bytes",
>   "logicalType": "decimal",
>   "precision": 4,
>   "scale": 2,
>   "default":"3.151351351",
>   "default-encoding":"string" // default encoding being 'AVRO' default
> (e.g. binary)
> }
>

This has the same problem as one of my earlier suggestions; if you change
the meaning of the default field, then older readers will read the schema
with an incorrect default value.

- Bridger Howell

-- 

The information contained in this email message is PRIVATE and intended 
only for the personal and confidential use of the recipient named above. If 
the reader of this message is not the intended recipient or an agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that you have received this message in error and that any review, 
dissemination, distribution or copying of this message is strictly 
prohibited.  If you have received this communication in error, please 
notify us immediately by email, and delete the original message.

RE: (Default) values for logical types in human-readable form

Posted by Frédéric SOUCHU <Fr...@ingenico.com>.

In line with Philip Zeyliger on IDL being a good tool for a human to produce schema.
Key features (IMHO):
- support for includes (killer feature)
- simpler syntax (a *lot* less '{' and '['...)
- simpler comments syntax
I have a toolchain going from IDL to Java + C# classes that wouldn't work using JSON schema (the many holes in the AVRO C# side not helping either..).

For the current discussion, all logical types have an equivalent, non-ambiguous representation as far as can tell:
- decimal: string representation ("3.14159265358979323846264338327950288419 "). The decimal definition provides the necessary information to decode it.
- date: "2017-10-5" (using the C-locale yyyy-MM-dd format)
- time_xx: either long value or text format as defined the specs (hh:mm:dd)
- date: number of text format
 (btw, how did we end up with different json/IDL logical names?!?!)

Looks like all logical types have a 'better' human representation as a text value. Suggest to have an 'encoding' attribute to indicate how the default value is defined.
{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2,
  "default":"3.151351351",
  "default-encoding":"string" // default encoding being 'AVRO' default (e.g. binary)
}

Frederic Souchu

-----Original Message-----
From: Bridger Howell [mailto:bhowell@sofi.org]
Sent: vendredi 20 octobre 2017 09:13
To: dev@avro.apache.org
Subject: Re: (Default) values for logical types in human-readable form

On Thu, Oct 19, 2017 at 9:17 PM, Philip Zeyliger <ph...@cloudera.com>
wrote:

> I'm shaky on the details here, but shouldn't humans be using the
> *.avdl form of specifying schemas?

Maybe.

As it is, I've seen a good number of open projects that work rely on JSON schemas (.avsc files).

IDL is really more tailored towards being a convenience for writing protocols. Schemas come as a nice bonus, but the tools don't really make schemas a first-class use case.

For example, suppose that I want to use IDL to write schemas and then generate specific classes for each schema, I could bring in the maven plugin (or any number of community plugins for other tools) that reads IDL and then writes out the generated code.

But then would I end up with a few bits of ugliness:
- the schemas have to be wrapped in a made-up protocol
- the generated code includes a generated protocol class I wouldn't care about
- there are language features that are completely unrelated to my use of schemas - I wouldn't care about errors or messages at all

This process _is_ sufficient for writing schemas, but I think the unneeded inputs and outputs and unrelated functionality really contribute to a sense that IDL isn't for writing schemas. If there was a more focused subset of IDL that mapped directly onto schemas without some of the extra baggage, I think it would be easier to recommend that as a high-level schema language.

- Bridger Howell

--

The information contained in this email message is PRIVATE and intended only for the personal and confidential use of the recipient named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution or copying of this message is strictly prohibited.  If you have received this communication in error, please notify us immediately by email, and delete the original message.
This email and its content belong to Ingenico Group. The enclosed information is confidential and may not be disclosed to any unauthorized person. If you have received it by mistake do not forward it and delete it from your system. Cet email et son contenu sont la propriété du Groupe Ingenico. L’information qu’il contient est confidentielle et ne peut être communiquée à des personnes non autorisées. Si vous l’avez reçu par erreur ne le transférez pas et supprimez-le.

Re: (Default) values for logical types in human-readable form

Posted by Bridger Howell <bh...@sofi.org>.

On Thu, Oct 19, 2017 at 9:17 PM, Philip Zeyliger <ph...@cloudera.com>
wrote:

> I'm shaky on the details here, but shouldn't humans be using the *.avdl
> form of specifying schemas?

Maybe.

As it is, I've seen a good number of open projects that work rely on JSON
schemas (.avsc files).

IDL is really more tailored towards being a convenience for writing
protocols. Schemas come as a nice bonus, but the tools don't really make
schemas a first-class use case.

For example, suppose that I want to use IDL to write schemas and then
generate specific classes for each schema, I could bring in the maven
plugin (or any number of community plugins for other tools) that reads IDL
and then writes out the generated code.

But then would I end up with a few bits of ugliness:
- the schemas have to be wrapped in a made-up protocol
- the generated code includes a generated protocol class I wouldn't care
about
- there are language features that are completely unrelated to my use of
schemas - I wouldn't care about errors or messages at all

This process _is_ sufficient for writing schemas, but I think the unneeded
inputs and outputs and unrelated functionality really contribute to a sense
that IDL isn't for writing schemas. If there was a more focused subset of
IDL that mapped directly onto schemas without some of the extra baggage, I
think it would be easier to recommend that as a high-level schema language.

- Bridger Howell

-- 

The information contained in this email message is PRIVATE and intended 
only for the personal and confidential use of the recipient named above. If 
the reader of this message is not the intended recipient or an agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that you have received this message in error and that any review, 
dissemination, distribution or copying of this message is strictly 
prohibited.  If you have received this communication in error, please 
notify us immediately by email, and delete the original message.

Re: (Default) values for logical types in human-readable form

Posted by Philip Zeyliger <ph...@cloudera.com>.

I'm shaky on the details here, but shouldn't humans be using the *.avdl
form of specifying schemas?

On Thu, Oct 19, 2017 at 9:18 AM, Doug Cutting <cu...@gmail.com> wrote:

> On Thu, Oct 19, 2017 at 8:49 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > > So then if an older reader reads a schema field with
> "default-as-string"
> > > used instead of "default", it will decide that field has no default? I
> > > don't really like that, but it's better than using the wrong value
> (e.g.
> > > "default" + "default-parser")
> >
> >
> > I think ignoring the user-specified default value is just as bad using a
> > wrong value. I equally consider both breaking changes.
> >
>
> Since defaults are only used for fields not present in the written data,
> ignoring a default value means failing to read the data.  This seems
> reasonable: if the user requires a feature that the runtime they're using
> does not yet support, then an error is signalled.
>
>
> > > > > I think that the parsing canonical form of a schema
> > > > > <https://avro.apache.org/docs/
> > 1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas>
> > > > > doesn't include the default. I think that makes sense because the
> > > > > canonical form is what's needed to read encoded data.
> >
> > That's strange, since according to the specification, the default is used
> > when reading instances that lack a value for the field, so I think it is
> > needed for reading encoded data.
> >
>
> That depends on what you mean by "reading".  A record is first read using
> the schema it was written with.  Through resolution, it can be subsequently
> altered to match various other schemas.  Defaults only come into play when
> such a schema has a field not in the written schema.
>
> Parsing Canonical Form indicates whether a schema can be used for that
> first, raw, read.  There is no single canonical form for all the various
> schemas that it can be resolved to.
>
> Doug
>

Re: (Default) values for logical types in human-readable form

Posted by Doug Cutting <cu...@gmail.com>.

On Thu, Oct 19, 2017 at 8:49 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> > So then if an older reader reads a schema field with "default-as-string"
> > used instead of "default", it will decide that field has no default? I
> > don't really like that, but it's better than using the wrong value (e.g.
> > "default" + "default-parser")
>
>
> I think ignoring the user-specified default value is just as bad using a
> wrong value. I equally consider both breaking changes.
>

Since defaults are only used for fields not present in the written data,
ignoring a default value means failing to read the data.  This seems
reasonable: if the user requires a feature that the runtime they're using
does not yet support, then an error is signalled.

> > > > I think that the parsing canonical form of a schema
> > > > <https://avro.apache.org/docs/
> 1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas>
> > > > doesn't include the default. I think that makes sense because the
> > > > canonical form is what's needed to read encoded data.
>
> That's strange, since according to the specification, the default is used
> when reading instances that lack a value for the field, so I think it is
> needed for reading encoded data.
>

That depends on what you mean by "reading".  A record is first read using
the schema it was written with.  Through resolution, it can be subsequently
altered to match various other schemas.  Defaults only come into play when
such a schema has a field not in the written schema.

Parsing Canonical Form indicates whether a schema can be used for that
first, raw, read.  There is no single canonical form for all the various
schemas that it can be resolved to.

Doug

Re: (Default) values for logical types in human-readable form

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

On Thu, Oct 19, 2017 at 7:16 AM, Bridger Howell <bh...@sofi.org> wrote:

> So then if an older reader reads a schema field with "default-as-string"
> used instead of "default", it will decide that field has no default? I
> don't really like that, but it's better than using the wrong value (e.g.
> "default" + "default-parser")

I think ignoring the user-specified default value is just as bad using a
wrong value. I equally consider both breaking changes.

> or erroring on most data reads (changing the "default" field to an object).

But if we can't make this feature non-breaking and have to put it in a new
major release, then I think that it's better to cause an explicit error in
old versions rather than silently getting unexpected behaviour.

> I don't think we can make old readers fail
> properly, since they would have to already have the future knowledge that
> there is supposed to be a default value. Someone correct me if I'm wrong on
> this.
>

What do you mean by failing properly? I think specifying a value that does
not belong to the types allowed by the older specification can reliable
cause a failure, albeit certainly not with an error message that would
describe the cause properly.

On Wed, Oct 18, 2017 at 9:56 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> I suggest that we add an optional key, like "default-as-string", that is
> used to fill in a missing "default" key if there is a reasonable
> conversion.

This would still be a breaking change though, since older versions will
ignore the "default-as-string" field.

> On write, the write schema would convert to the normal
> "default" field for backward-compatibility.

I'm sorry, I can't quite follow, could you please elaborate?

> On read, you can supply only
> the string default to use that instead of the binary one.

I don't understand this either, could you please explain this through an
example?

On Tue, Oct 17, 2017 at 11:53 PM, Bridger Howell <bh...@sofi.org> wrote:

> > I really like the idea of having support for human-readable default
> > values.
> >
> > I think I prefer to keep the way defaults are interpreted separate from
> > logical types, since logical types having are basically optional.

So if I understand correctly, you support the idea of human-readable
defaults but not logical-type-dependent interpretation. I don't see how we
could achieve the first without the second, since different logical types
have different human-readable representations. So it seems that the
optional nature of logical types actually makes this feature impossible.

> >         "doc": "'hello world' as a base64-encoded string"

I like the idea of having a doc field. This matches the much-desired JSON
commenting ability the closest. I don't see how this would help with
default values in schemas, since schemas are written directly by users. (Or
is there a tool for doing so?) However, we could do this with the actual
values written to JSON as well. As I wrote earlier, I was afraid to suggest
an additional field like this:

"num": "\u000C\u006C",
"num-human-readable": 31.80

Because users may be tempted to modify the "num-human-readable" field,
thinking that the change will have some effect. However, if we use a doc
string instead:

"num": "\u000C\u006C",
"num-doc": "binary representation of the decimal value 31.80"

then I think most users will realize that they can't modify the value of
"num" by modifying "num-doc".

> > I think this type of approach keeps us neatly separated from logical
types,
> > so that having a parser for a default value doesn't require a logical
type,

Wouldn't the separate parser approach lead to the same problem in the end?
It is more general and thus allows more use-cases, but if you would like to
specify a decimal value as a number, you still have to have a parser
implemented for it.

> > > I think that the parsing canonical form of a schema
> > > <https://avro.apache.org/docs/
1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas>
> > > doesn't include the default. I think that makes sense because the
> > > canonical form is what's needed to read encoded data.

That's strange, since according to the specification, the default is used
when reading instances that lack a value for the field, so I think it is
needed for reading encoded data.

So far the discussion focused on default values in schemas, I would
encourage everyone to also share their opinions about actual data written
using JSON encoding.

Br,

Zoltan

Re: (Default) values for logical types in human-readable form

Posted by Bridger Howell <bh...@sofi.org>.

> I don't think we can change the behavior of the "default" key. Otherwise, older
readers would use the wrong value.

This is true, but the "human-readable default" feature is inherently
incompatible with older readers. My hope was that giving an invalid type
for the default would cause an error when older readers try to parse it,
but that's not the case and you're right. There would still always be an
issue with specially crafted record types.

> I suggest that we add an optional key, like "default-as-string", that is used
to fill in a missing "default" key if there is a reasonable conversion.

So then if an older reader reads a schema field with "default-as-string"
used instead of "default", it will decide that field has no default? I
don't really like that, but it's better than using the wrong value (e.g.
"default" + "default-parser") or erroring on most data reads (changing the
"default" field to an object). I don't think we can make old readers fail
properly, since they would have to already have the future knowledge that
there is supposed to be a default value. Someone correct me if I'm wrong on
this. (Generically it should be possible if we included schema spec
versions in schemas.)

What would be your criteria for there being a reasonable conversion? Field
type and logical type?

> On write, the write schema would convert to the normal "default" field
for backward-compatibility.

Good idea - this should be generically possible no matter how
human-readable defaults are implemented in the spec.

> On read, you can supply only the string default to use that instead of
the binary one. I think we could take care of this entirely in the schema
parser.

On the same page here.

- Bridger Howell

On Wed, Oct 18, 2017 at 9:56 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> I don't think we can change the behavior of the "default" key. Otherwise,
> older readers would use the wrong value.
>
> I suggest that we add an optional key, like "default-as-string", that is
> used to fill in a missing "default" key if there is a reasonable
> conversion. On write, the write schema would convert to the normal
> "default" field for backward-compatibility. On read, you can supply only
> the string default to use that instead of the binary one. I think we could
> take care of this entirely in the schema parser.
>
> rb
>
> On Tue, Oct 17, 2017 at 11:53 PM, Bridger Howell <bh...@sofi.org> wrote:
>
> > I really like the idea of having support for human-readable default
> values.
> >
> > I think I prefer to keep the way defaults are interpreted separate from
> > logical types, since logical types having are basically optional. I would
> > be surprised if my language of choice could understand an ISO-8601
> > formatted local-date for a field default based on logical type, but I
> still
> > had to interface with a numeric value in my code.
> >
> > If this doesn't conflict too much with the default value for record
> fields
> > (?), I would suggest having an object syntax with a "parser" or "type"
> > field in addition to the default property.
> >
> > A sample record:
> > {
> >   "type": "record",
> >   "name": "Foo",
> >   "fields": [
> >     {
> >       "name: "body",
> >       "type": "bytes",
> >       "default": {
> >         "value": "aGVsbG8gd29ybGQ",
> >         "parser": "base64",
> >         "doc": "'hello world' as a base64-encoded string"
> >       }
> >   ]
> > }
> >
> > If changing the "default" property like that has too many issues, I
> suppose
> > a parallel "default-parser" property would do the trick too.
> >
> > I think this type of approach keeps us neatly separated from logical
> types,
> > so that having a parser for a default value doesn't require a logical
> type,
> > and maybe makes it clearer which procedure is being performed on the JSON
> > data to convert it to the base field type.
> >
> > -Bridger Howell
> >
> > On Tue, Oct 17, 2017 at 9:57 AM, Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > I think that the parsing canonical form of a schema
> > > <https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canoni
> > > cal+Form+for+Schemas>
> > > doesn't include the default. I think that makes sense because the
> > canonical
> > > form is what's needed to read encoded data. Anyone with more context:
> is
> > > that correct?
> > >
> > > In my opinion, that makes how we handle defaults a bit more flexible
> > > because schemas with different defaults are "the same". I'd support
> > adding
> > > a new default field that handles values more naturally. We've always
> had
> > a
> > > problem with binary as well and I'd like to see us use base64 encoded
> > > values instead of the current strategy.
> > >
> > > rb
> > >
> > > On Tue, Oct 17, 2017 at 8:16 AM, Zoltan Ivanfi <zi...@cloudera.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to start a discussion about making default values and
> > values
> > > > in general human-readable for logical types.
> > > >
> > > > Currently default values for logical types have to be specified in a
> > JSON
> > > > string as the binary representation of the backing primary type
> (e.g.,
> > > > "\u0000"). Some users intuitively try to specify a human-readable
> > logical
> > > > value in this string instead (e.g., "0.00"). This is of course a
> valid
> > > byte
> > > > sequence and as such is accepted, but it results in unexpected
> > behaviour
> > > (a
> > > > different default value than intended). Apart from being error prone,
> > > > specifying default values this way is also tedious. To keep this
> e-mail
> > > > brief, I won't list specific examples here, please see AVRO-2087
> > > > <https://issues.apache.org/jira/browse/AVRO-2087> for details
> instead.
> > > >
> > > > The problem of non-human-readable values applies to JSON encoding of
> > > actual
> > > > data as well. One reason for using JSON is that it is human readable
> > and
> > > > therefore easy to debug. Seeing "\u00018" in a JSON file is not too
> > > > intuitive and this specific example is actually quite misleading as
> > well
> > > > (it can be easily misread as "\u0018").
> > > >
> > > > Introducing a new default value field (called human-readable-default
> or
> > > > logical-default for example) would allow easier specification of
> > default
> > > > values. (It doesn't solve the problem of accidentally misusing the
> > > existing
> > > > field though.) It is, however, not backwards compatible. An older
> Avro
> > > > library would ignore the new field and use a different default value.
> > > >
> > > > Introducing human-readable values in the JSON encoding is even more
> > > clearly
> > > > a breaking change. (Although for JSON we could add the human-readable
> > > value
> > > > as a separate extra field that gets ignored when reading. Problem is,
> > > users
> > > > may be tempted to change the value and be surprised. It's a pity that
> > > JSON
> > > > does not allow comments.)
> > > >
> > > > In your opinions, what would be the best way to deal with this
> problem?
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> > --
> >
> >
> > The information contained in this email message is PRIVATE and intended
> > only for the personal and confidential use of the recipient named above.
> If
> > the reader of this message is not the intended recipient or an agent
> > responsible for delivering it to the intended recipient, you are hereby
> > notified that you have received this message in error and that any
> review,
> > dissemination, distribution or copying of this message is strictly
> > prohibited.  If you have received this communication in error, please
> > notify us immediately by email, and delete the original message.
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 

Bridger Howell

Software Engineer

1200 N. Montana Ave

Helena, MT 59601

M: 406.422.9225

New York Times
<https://www.nytimes.com/2016/10/20/business/dealbook/sofi-an-online-lender-is-looking-for-a-relationship.html>
| Inc.
<http://www.inc.com/maria-aspan/sofi-plans-traditional-bank-accounts.html>
| Fast Company
<https://www.fastcompany.com/3060461/most-innovative-companies/inside-sofis-exclusive-club-for-great-people>
Wall Street Journal
<http://www.wsj.com/articles/online-lender-sofis-bond-deal-receives-moodys-highest-rating-1463847062>
| Quartz
<https://qz.com/721983/the-newest-workplace-benefit-for-millennials-paying-down-their-student-loans/>
| Forbes
<http://www.forbes.com/sites/mnewlands/2016/11/23/sofi-is-dominating-the-finance-space-heres-what-theyre-planning-next/#42c658036261>

-- 


The information contained in this email message is PRIVATE and intended 
only for the personal and confidential use of the recipient named above. If 
the reader of this message is not the intended recipient or an agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that you have received this message in error and that any review, 
dissemination, distribution or copying of this message is strictly 
prohibited.  If you have received this communication in error, please 
notify us immediately by email, and delete the original message.

Re: (Default) values for logical types in human-readable form

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don't think we can change the behavior of the "default" key. Otherwise,
older readers would use the wrong value.

I suggest that we add an optional key, like "default-as-string", that is
used to fill in a missing "default" key if there is a reasonable
conversion. On write, the write schema would convert to the normal
"default" field for backward-compatibility. On read, you can supply only
the string default to use that instead of the binary one. I think we could
take care of this entirely in the schema parser.

rb

On Tue, Oct 17, 2017 at 11:53 PM, Bridger Howell <bh...@sofi.org> wrote:

> I really like the idea of having support for human-readable default values.
>
> I think I prefer to keep the way defaults are interpreted separate from
> logical types, since logical types having are basically optional. I would
> be surprised if my language of choice could understand an ISO-8601
> formatted local-date for a field default based on logical type, but I still
> had to interface with a numeric value in my code.
>
> If this doesn't conflict too much with the default value for record fields
> (?), I would suggest having an object syntax with a "parser" or "type"
> field in addition to the default property.
>
> A sample record:
> {
>   "type": "record",
>   "name": "Foo",
>   "fields": [
>     {
>       "name: "body",
>       "type": "bytes",
>       "default": {
>         "value": "aGVsbG8gd29ybGQ",
>         "parser": "base64",
>         "doc": "'hello world' as a base64-encoded string"
>       }
>   ]
> }
>
> If changing the "default" property like that has too many issues, I suppose
> a parallel "default-parser" property would do the trick too.
>
> I think this type of approach keeps us neatly separated from logical types,
> so that having a parser for a default value doesn't require a logical type,
> and maybe makes it clearer which procedure is being performed on the JSON
> data to convert it to the base field type.
>
> -Bridger Howell
>
> On Tue, Oct 17, 2017 at 9:57 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > I think that the parsing canonical form of a schema
> > <https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canoni
> > cal+Form+for+Schemas>
> > doesn't include the default. I think that makes sense because the
> canonical
> > form is what's needed to read encoded data. Anyone with more context: is
> > that correct?
> >
> > In my opinion, that makes how we handle defaults a bit more flexible
> > because schemas with different defaults are "the same". I'd support
> adding
> > a new default field that handles values more naturally. We've always had
> a
> > problem with binary as well and I'd like to see us use base64 encoded
> > values instead of the current strategy.
> >
> > rb
> >
> > On Tue, Oct 17, 2017 at 8:16 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> > > Hi,
> > >
> > > I would like to start a discussion about making default values and
> values
> > > in general human-readable for logical types.
> > >
> > > Currently default values for logical types have to be specified in a
> JSON
> > > string as the binary representation of the backing primary type (e.g.,
> > > "\u0000"). Some users intuitively try to specify a human-readable
> logical
> > > value in this string instead (e.g., "0.00"). This is of course a valid
> > byte
> > > sequence and as such is accepted, but it results in unexpected
> behaviour
> > (a
> > > different default value than intended). Apart from being error prone,
> > > specifying default values this way is also tedious. To keep this e-mail
> > > brief, I won't list specific examples here, please see AVRO-2087
> > > <https://issues.apache.org/jira/browse/AVRO-2087> for details instead.
> > >
> > > The problem of non-human-readable values applies to JSON encoding of
> > actual
> > > data as well. One reason for using JSON is that it is human readable
> and
> > > therefore easy to debug. Seeing "\u00018" in a JSON file is not too
> > > intuitive and this specific example is actually quite misleading as
> well
> > > (it can be easily misread as "\u0018").
> > >
> > > Introducing a new default value field (called human-readable-default or
> > > logical-default for example) would allow easier specification of
> default
> > > values. (It doesn't solve the problem of accidentally misusing the
> > existing
> > > field though.) It is, however, not backwards compatible. An older Avro
> > > library would ignore the new field and use a different default value.
> > >
> > > Introducing human-readable values in the JSON encoding is even more
> > clearly
> > > a breaking change. (Although for JSON we could add the human-readable
> > value
> > > as a separate extra field that gets ignored when reading. Problem is,
> > users
> > > may be tempted to change the value and be surprised. It's a pity that
> > JSON
> > > does not allow comments.)
> > >
> > > In your opinions, what would be the best way to deal with this problem?
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
> --
>
>
> The information contained in this email message is PRIVATE and intended
> only for the personal and confidential use of the recipient named above. If
> the reader of this message is not the intended recipient or an agent
> responsible for delivering it to the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution or copying of this message is strictly
> prohibited.  If you have received this communication in error, please
> notify us immediately by email, and delete the original message.
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: (Default) values for logical types in human-readable form

Posted by Bridger Howell <bh...@sofi.org>.

I really like the idea of having support for human-readable default values.

I think I prefer to keep the way defaults are interpreted separate from
logical types, since logical types having are basically optional. I would
be surprised if my language of choice could understand an ISO-8601
formatted local-date for a field default based on logical type, but I still
had to interface with a numeric value in my code.

If this doesn't conflict too much with the default value for record fields
(?), I would suggest having an object syntax with a "parser" or "type"
field in addition to the default property.

A sample record:
{
  "type": "record",
  "name": "Foo",
  "fields": [
    {
      "name: "body",
      "type": "bytes",
      "default": {
        "value": "aGVsbG8gd29ybGQ",
        "parser": "base64",
        "doc": "'hello world' as a base64-encoded string"
      }
  ]
}

If changing the "default" property like that has too many issues, I suppose
a parallel "default-parser" property would do the trick too.

I think this type of approach keeps us neatly separated from logical types,
so that having a parser for a default value doesn't require a logical type,
and maybe makes it clearer which procedure is being performed on the JSON
data to convert it to the base field type.

-Bridger Howell

On Tue, Oct 17, 2017 at 9:57 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> I think that the parsing canonical form of a schema
> <https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canoni
> cal+Form+for+Schemas>
> doesn't include the default. I think that makes sense because the canonical
> form is what's needed to read encoded data. Anyone with more context: is
> that correct?
>
> In my opinion, that makes how we handle defaults a bit more flexible
> because schemas with different defaults are "the same". I'd support adding
> a new default field that handles values more naturally. We've always had a
> problem with binary as well and I'd like to see us use base64 encoded
> values instead of the current strategy.
>
> rb
>
> On Tue, Oct 17, 2017 at 8:16 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > Hi,
> >
> > I would like to start a discussion about making default values and values
> > in general human-readable for logical types.
> >
> > Currently default values for logical types have to be specified in a JSON
> > string as the binary representation of the backing primary type (e.g.,
> > "\u0000"). Some users intuitively try to specify a human-readable logical
> > value in this string instead (e.g., "0.00"). This is of course a valid
> byte
> > sequence and as such is accepted, but it results in unexpected behaviour
> (a
> > different default value than intended). Apart from being error prone,
> > specifying default values this way is also tedious. To keep this e-mail
> > brief, I won't list specific examples here, please see AVRO-2087
> > <https://issues.apache.org/jira/browse/AVRO-2087> for details instead.
> >
> > The problem of non-human-readable values applies to JSON encoding of
> actual
> > data as well. One reason for using JSON is that it is human readable and
> > therefore easy to debug. Seeing "\u00018" in a JSON file is not too
> > intuitive and this specific example is actually quite misleading as well
> > (it can be easily misread as "\u0018").
> >
> > Introducing a new default value field (called human-readable-default or
> > logical-default for example) would allow easier specification of default
> > values. (It doesn't solve the problem of accidentally misusing the
> existing
> > field though.) It is, however, not backwards compatible. An older Avro
> > library would ignore the new field and use a different default value.
> >
> > Introducing human-readable values in the JSON encoding is even more
> clearly
> > a breaking change. (Although for JSON we could add the human-readable
> value
> > as a separate extra field that gets ignored when reading. Problem is,
> users
> > may be tempted to change the value and be surprised. It's a pity that
> JSON
> > does not allow comments.)
> >
> > In your opinions, what would be the best way to deal with this problem?
> >
> > Thanks,
> >
> > Zoltan
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-- 

The information contained in this email message is PRIVATE and intended 
only for the personal and confidential use of the recipient named above. If 
the reader of this message is not the intended recipient or an agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that you have received this message in error and that any review, 
dissemination, distribution or copying of this message is strictly 
prohibited.  If you have received this communication in error, please 
notify us immediately by email, and delete the original message.

Re: (Default) values for logical types in human-readable form

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I think that the parsing canonical form of a schema
<https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas>
doesn't include the default. I think that makes sense because the canonical
form is what's needed to read encoded data. Anyone with more context: is
that correct?

In my opinion, that makes how we handle defaults a bit more flexible
because schemas with different defaults are "the same". I'd support adding
a new default field that handles values more naturally. We've always had a
problem with binary as well and I'd like to see us use base64 encoded
values instead of the current strategy.

rb

On Tue, Oct 17, 2017 at 8:16 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> I would like to start a discussion about making default values and values
> in general human-readable for logical types.
>
> Currently default values for logical types have to be specified in a JSON
> string as the binary representation of the backing primary type (e.g.,
> "\u0000"). Some users intuitively try to specify a human-readable logical
> value in this string instead (e.g., "0.00"). This is of course a valid byte
> sequence and as such is accepted, but it results in unexpected behaviour (a
> different default value than intended). Apart from being error prone,
> specifying default values this way is also tedious. To keep this e-mail
> brief, I won't list specific examples here, please see AVRO-2087
> <https://issues.apache.org/jira/browse/AVRO-2087> for details instead.
>
> The problem of non-human-readable values applies to JSON encoding of actual
> data as well. One reason for using JSON is that it is human readable and
> therefore easy to debug. Seeing "\u00018" in a JSON file is not too
> intuitive and this specific example is actually quite misleading as well
> (it can be easily misread as "\u0018").
>
> Introducing a new default value field (called human-readable-default or
> logical-default for example) would allow easier specification of default
> values. (It doesn't solve the problem of accidentally misusing the existing
> field though.) It is, however, not backwards compatible. An older Avro
> library would ignore the new field and use a different default value.
>
> Introducing human-readable values in the JSON encoding is even more clearly
> a breaking change. (Although for JSON we could add the human-readable value
> as a separate extra field that gets ignored when reading. Problem is, users
> may be tempted to change the value and be surprised. It's a pity that JSON
> does not allow comments.)
>
> In your opinions, what would be the best way to deal with this problem?
>
> Thanks,
>
> Zoltan
>



-- 
Ryan Blue
Software Engineer
Netflix