You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Kousuke Saruta <sa...@apache.org> on 2023/08/09 15:30:13 UTC

Specification of namespaces

Hi developers,

I'd like to discuss the specification of namespace.
According to the specification, each dot separated portion of a namespace
should be [a-zA-Z_]][a-zA-Z0-9_]*.
https://avro.apache.org/docs/1.11.1/specification/#names

But the actual implementations of some language bindings don't follow the
specification, and accept any characters.
Especially, the Java binding generates namespaces which contain "$" for
inner classes generated by protobuf.

So, should we need to review the namespace specification?

Thanks,
Kousuke

Re: Specification of namespaces

Posted by Kousuke Saruta <sa...@apache.org>.
Let me correct.
Not "alias", but "aliases".

{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","
>> type":{"type":"enum","name":"exampleEnum","alias":"1
>
> bad alias","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1
> bad alias"}]}
>

{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","
type":{"type":"enum","name":"exampleEnum","aliases":["1
bad alias"],"symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1
bad alias"}]}

{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1bad
> alias.foo.bar","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1bad
> alias.foo.bar"}]}


{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","aliases":["1bad
alias.foo.bar"],"symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1bad
alias.foo.bar"}]}

2023年8月18日(金) 15:37 Kousuke Saruta <sa...@apache.org>:

> Hi Michael,
>
>
>> {"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1
>> bad alias","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1
>> bad alias"}]}
>>
>
> In the current Java binding, the namespace portion in an alias is accepted
> without validation.
> So, the following schema is acceptable.
>
> {"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1bad
> alias.foo.bar","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"exampleEnum"}]}
>
> I'm discussing namespace in this thread, so this behavior seems O.K to me.
>
> But reference to another named types is not implemented for the Java
> binding.
> So the following schema is not accepted.
>
> {"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1bad
> alias.foo.bar","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1bad
> alias.foo.bar"}]}
>
> I have a plan to fix it.
>
> 2023年8月18日(金) 11:33 Michael A. Smith <mi...@smith-li.com>:
>
>> I found I'm still a little confused at how using aliases to correct
>> invalid names should work. Maybe you can define an alias that is an
>> invalid name, but having done so, can you use it? I tried this schema
>> in both the Python and Java implementations.
>>
>>
>> {"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1
>> bad alias","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1
>> bad alias"}]}
>>
>> I expected it to error in Python, because I know Python requires valid
>> names for aliases. But Java also errored with "schema failed: Illegal
>> initial character: 1 bad alias". I am not sure if the error is from
>> the alias definition or its use.
>>
>> If my example is flawed, can someone supply a correct one?
>>
>> On Thu, Aug 17, 2023 at 4:53 AM Oscar Westra van Holthe - Kind
>> <os...@westravanholthe.nl> wrote:
>> >
>> > On Mon, 14 Aug 2023 at 14:11, Ryan Skraba <ry...@skraba.com> wrote:
>> >
>> > > I think the right thing to do is [to] use a system
>> > > property / schema aliases to help people migrate back to the correct
>> > > behaviour.  If you are actually using Avro/Protobuf together, you
>> > > might be the best person to help us figure out the right was to do
>> > > this migration!
>> > >
>> >
>> > The idea that aliases can be used to evolve a schema with invalid names
>> to
>> > a schema with valid names is a sensible one, and currently hidden in the
>> > schema resolution rules in the specification.
>> >
>> > I've added AVRO-3833 <https://issues.apache.org/jira/browse/AVRO-3833>
>> (with
>> > PR <https://github.com/apache/avro/pull/2448>) because I wanted to
>> clarify
>> > that names must
>> > be unique (because otherwise schema resolution cannot work), and that
>> this
>> > includes aliases. The change also includes this migration/fix option.
>> >
>> > Kind regards,
>> > Oscar
>> >
>> > --
>> > ✉️ Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>
>>
>

Re: Specification of namespaces

Posted by Kousuke Saruta <sa...@apache.org>.
Hi Michael,

{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1
> bad alias","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1
> bad alias"}]}
>

In the current Java binding, the namespace portion in an alias is accepted
without validation.
So, the following schema is acceptable.

{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1bad
alias.foo.bar","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"exampleEnum"}]}

I'm discussing namespace in this thread, so this behavior seems O.K to me.

But reference to another named types is not implemented for the Java
binding.
So the following schema is not accepted.

{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1bad
alias.foo.bar","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1bad
alias.foo.bar"}]}

I have a plan to fix it.

2023年8月18日(金) 11:33 Michael A. Smith <mi...@smith-li.com>:

> I found I'm still a little confused at how using aliases to correct
> invalid names should work. Maybe you can define an alias that is an
> invalid name, but having done so, can you use it? I tried this schema
> in both the Python and Java implementations.
>
>
> {"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1
> bad alias","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1
> bad alias"}]}
>
> I expected it to error in Python, because I know Python requires valid
> names for aliases. But Java also errored with "schema failed: Illegal
> initial character: 1 bad alias". I am not sure if the error is from
> the alias definition or its use.
>
> If my example is flawed, can someone supply a correct one?
>
> On Thu, Aug 17, 2023 at 4:53 AM Oscar Westra van Holthe - Kind
> <os...@westravanholthe.nl> wrote:
> >
> > On Mon, 14 Aug 2023 at 14:11, Ryan Skraba <ry...@skraba.com> wrote:
> >
> > > I think the right thing to do is [to] use a system
> > > property / schema aliases to help people migrate back to the correct
> > > behaviour.  If you are actually using Avro/Protobuf together, you
> > > might be the best person to help us figure out the right was to do
> > > this migration!
> > >
> >
> > The idea that aliases can be used to evolve a schema with invalid names
> to
> > a schema with valid names is a sensible one, and currently hidden in the
> > schema resolution rules in the specification.
> >
> > I've added AVRO-3833 <https://issues.apache.org/jira/browse/AVRO-3833>
> (with
> > PR <https://github.com/apache/avro/pull/2448>) because I wanted to
> clarify
> > that names must
> > be unique (because otherwise schema resolution cannot work), and that
> this
> > includes aliases. The change also includes this migration/fix option.
> >
> > Kind regards,
> > Oscar
> >
> > --
> > ✉️ Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>
>

Re: Specification of namespaces

Posted by Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>.
Hi,

A bit of a late reply, but an example of using aliases to rename fields is
already part of the tests:
org.apache.avro.TestReadingWritingDataInEvolvedSchemas#aliasesInSchema
<https://github.com/apache/avro/blob/master/lang/java/avro/src/test/java/org/apache/avro/TestReadingWritingDataInEvolvedSchemas.java#L404>


Kind regards,
Oscar

On Fri, 18 Aug 2023 at 13:18, Michael A. Smith <mi...@smith-li.com> wrote:

> Thanks for the explanation, Oscar. Would you be willing to add some small
> demo schema to your spec PR? It can serve as an example as well as a simple
> interop test case.
>
> Thanks again,
> Michael
>
> On Fri, Aug 18, 2023 at 03:17 Oscar Westra van Holthe - Kind <
> oscar@westravanholthe.nl> wrote:
>
> > On Fri, 18 Aug 2023 at 04:32, Michael A. Smith <mi...@smith-li.com>
> > wrote:
> >
> > > I found I'm still a little confused at how using aliases to correct
> > > invalid names should work. Maybe you can define an alias that is an
> > > invalid name, but having done so, can you use it? I tried this schema
> > > in both the Python and Java implementations.
> > >
> >
> > Correcting names -- and other projections -- happen during schema
> > resolution
> > when reading Avro data. Such a process requires that the write schema is
> > parsed
> > without any validation. An exception when parsing the write schema means
> > the
> > data becomes unreadable.
> >
> > When reading the data, the read schema is first resolved against the
> write
> > schema.
> > One of the things that happen during schema resolution is that the names
> in
> > the
> > write schema are matched against the names and aliases in the read
> schema.
> >
> > This means you won't be using the aliases directly.
> >
> > You can test this theory by encoding data as Avro bytes without any
> header.
> > You'll
> > find you can decode the bytes using a different schema that is identical
> > except for
> > its names. This works, as these schemata yield the same sequence of
> bytes.
> >
> > Names and aliases in the schemas allow enhancing process to do other nice
> > things:
> > - skip written fields that were removed from the read schema
> > - fill in default values for new fields added to the read schema
> > - match different type orders in unions
> > - do some conversions, like reading an int as a long
> >
> >
> > Kind regards,
> > Oscar
> >
> > --
> >
> > ✉️ Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>
> >
>


-- 

✉️ Oscar Westra van Holthe - Kind <op...@apache.org>

🌐 https://github.com/opwvhk/

Re: Specification of namespaces

Posted by "Michael A. Smith" <mi...@smith-li.com>.
Thanks for the explanation, Oscar. Would you be willing to add some small
demo schema to your spec PR? It can serve as an example as well as a simple
interop test case.

Thanks again,
Michael

On Fri, Aug 18, 2023 at 03:17 Oscar Westra van Holthe - Kind <
oscar@westravanholthe.nl> wrote:

> On Fri, 18 Aug 2023 at 04:32, Michael A. Smith <mi...@smith-li.com>
> wrote:
>
> > I found I'm still a little confused at how using aliases to correct
> > invalid names should work. Maybe you can define an alias that is an
> > invalid name, but having done so, can you use it? I tried this schema
> > in both the Python and Java implementations.
> >
>
> Correcting names -- and other projections -- happen during schema
> resolution
> when reading Avro data. Such a process requires that the write schema is
> parsed
> without any validation. An exception when parsing the write schema means
> the
> data becomes unreadable.
>
> When reading the data, the read schema is first resolved against the write
> schema.
> One of the things that happen during schema resolution is that the names in
> the
> write schema are matched against the names and aliases in the read schema.
>
> This means you won't be using the aliases directly.
>
> You can test this theory by encoding data as Avro bytes without any header.
> You'll
> find you can decode the bytes using a different schema that is identical
> except for
> its names. This works, as these schemata yield the same sequence of bytes.
>
> Names and aliases in the schemas allow enhancing process to do other nice
> things:
> - skip written fields that were removed from the read schema
> - fill in default values for new fields added to the read schema
> - match different type orders in unions
> - do some conversions, like reading an int as a long
>
>
> Kind regards,
> Oscar
>
> --
>
> ✉️ Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>
>

Re: Specification of namespaces

Posted by Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>.
On Fri, 18 Aug 2023 at 04:32, Michael A. Smith <mi...@smith-li.com> wrote:

> I found I'm still a little confused at how using aliases to correct
> invalid names should work. Maybe you can define an alias that is an
> invalid name, but having done so, can you use it? I tried this schema
> in both the Python and Java implementations.
>

Correcting names -- and other projections -- happen during schema resolution
when reading Avro data. Such a process requires that the write schema is
parsed
without any validation. An exception when parsing the write schema means the
data becomes unreadable.

When reading the data, the read schema is first resolved against the write
schema.
One of the things that happen during schema resolution is that the names in
the
write schema are matched against the names and aliases in the read schema.

This means you won't be using the aliases directly.

You can test this theory by encoding data as Avro bytes without any header.
You'll
find you can decode the bytes using a different schema that is identical
except for
its names. This works, as these schemata yield the same sequence of bytes.

Names and aliases in the schemas allow enhancing process to do other nice
things:
- skip written fields that were removed from the read schema
- fill in default values for new fields added to the read schema
- match different type orders in unions
- do some conversions, like reading an int as a long


Kind regards,
Oscar

-- 

✉️ Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Re: Specification of namespaces

Posted by "Michael A. Smith" <mi...@smith-li.com>.
I found I'm still a little confused at how using aliases to correct
invalid names should work. Maybe you can define an alias that is an
invalid name, but having done so, can you use it? I tried this schema
in both the Python and Java implementations.

{"type":"record","name":"AliasReferenceExample","fields":[{"name":"anEnum","type":{"type":"enum","name":"exampleEnum","alias":"1
bad alias","symbols":["A","B","C"]}},{"name":"anotherEnum","type":"1
bad alias"}]}

I expected it to error in Python, because I know Python requires valid
names for aliases. But Java also errored with "schema failed: Illegal
initial character: 1 bad alias". I am not sure if the error is from
the alias definition or its use.

If my example is flawed, can someone supply a correct one?

On Thu, Aug 17, 2023 at 4:53 AM Oscar Westra van Holthe - Kind
<os...@westravanholthe.nl> wrote:
>
> On Mon, 14 Aug 2023 at 14:11, Ryan Skraba <ry...@skraba.com> wrote:
>
> > I think the right thing to do is [to] use a system
> > property / schema aliases to help people migrate back to the correct
> > behaviour.  If you are actually using Avro/Protobuf together, you
> > might be the best person to help us figure out the right was to do
> > this migration!
> >
>
> The idea that aliases can be used to evolve a schema with invalid names to
> a schema with valid names is a sensible one, and currently hidden in the
> schema resolution rules in the specification.
>
> I've added AVRO-3833 <https://issues.apache.org/jira/browse/AVRO-3833> (with
> PR <https://github.com/apache/avro/pull/2448>) because I wanted to clarify
> that names must
> be unique (because otherwise schema resolution cannot work), and that this
> includes aliases. The change also includes this migration/fix option.
>
> Kind regards,
> Oscar
>
> --
> ✉️ Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Re: Specification of namespaces

Posted by Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>.
On Mon, 14 Aug 2023 at 14:11, Ryan Skraba <ry...@skraba.com> wrote:

> I think the right thing to do is [to] use a system
> property / schema aliases to help people migrate back to the correct
> behaviour.  If you are actually using Avro/Protobuf together, you
> might be the best person to help us figure out the right was to do
> this migration!
>

The idea that aliases can be used to evolve a schema with invalid names to
a schema with valid names is a sensible one, and currently hidden in the
schema resolution rules in the specification.

I've added AVRO-3833 <https://issues.apache.org/jira/browse/AVRO-3833> (with
PR <https://github.com/apache/avro/pull/2448>) because I wanted to clarify
that names must
be unique (because otherwise schema resolution cannot work), and that this
includes aliases. The change also includes this migration/fix option.

Kind regards,
Oscar

-- 
✉️ Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Re: Specification of namespaces

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!  I don't think SDKs should be generating namespaces that
contain invalid names in any cases, and this is a bug in the Java SDK.

One of the unfortunate consequences of "bad" names is that names are
never _really_ necessary to deserialize/serialize data, so someone can
be using the same language SDK over several versions with buggy names
and never notice until they try and interop with another language...

I think the right thing to do is fix the Java SDK, and use a system
property / schema aliases to help people migrate back to the correct
behaviour.  If you are actually using Avro/Protobuf together, you
might be the best person to help us figure out the right was to do
this migration!

All my best, Ryan

On Fri, Aug 11, 2023 at 6:04 PM Kousuke Saruta <fr...@gmail.com> wrote:
>
> Hi Martin,
> Thank you for the comment.
>
>
> > Hi,
> >
> > On Wed, Aug 9, 2023 at 6:30 PM Kousuke Saruta <sa...@apache.org> wrote:
> >
> > > Hi developers,
> > >
> > > I'd like to discuss the specification of namespace.
> > > According to the specification, each dot separated portion of a namespace
> > > should be [a-zA-Z_]][a-zA-Z0-9_]*.
> > > https://avro.apache.org/docs/1.11.1/specification/#names
> > >
> > > But the actual implementations of some language bindings don't follow the
> > > specification, and accept any characters.
> > > Especially, the Java binding generates namespaces which contain "$" for
> > > inner classes generated by protobuf.
> > >
> > > So, should we need to review the namespace specification?
> > >
> >
> > To the developers who are familiar with the Java SDK: What problems do you
> > see if the generator stops producing "$", i.e. do something like
> > generated.replace('$', '') ?
> > Would that break existing apps ?
> >
>
> If we replace "$" with any other character in the new version of Avro,
> data serialized by an old Avro cannot be converted back to protobuf format,
> right?
>
> 2023年8月10日(木) 16:38 Martin Grigorov <mg...@apache.org>:
>
> > Hi,
> >
> > On Wed, Aug 9, 2023 at 6:30 PM Kousuke Saruta <sa...@apache.org> wrote:
> >
> > > Hi developers,
> > >
> > > I'd like to discuss the specification of namespace.
> > > According to the specification, each dot separated portion of a namespace
> > > should be [a-zA-Z_]][a-zA-Z0-9_]*.
> > > https://avro.apache.org/docs/1.11.1/specification/#names
> > >
> > > But the actual implementations of some language bindings don't follow the
> > > specification, and accept any characters.
> > > Especially, the Java binding generates namespaces which contain "$" for
> > > inner classes generated by protobuf.
> > >
> > > So, should we need to review the namespace specification?
> > >
> >
> > To the developers who are familiar with the Java SDK: What problems do you
> > see if the generator stops producing "$", i.e. do something like
> > generated.replace('$', '') ?
> > Would that break existing apps ?
> >
> >
> >
> > >
> > > Thanks,
> > > Kousuke
> > >
> >

Re: Specification of namespaces

Posted by Kousuke Saruta <fr...@gmail.com>.
Hi Martin,
Thank you for the comment.


> Hi,
>
> On Wed, Aug 9, 2023 at 6:30 PM Kousuke Saruta <sa...@apache.org> wrote:
>
> > Hi developers,
> >
> > I'd like to discuss the specification of namespace.
> > According to the specification, each dot separated portion of a namespace
> > should be [a-zA-Z_]][a-zA-Z0-9_]*.
> > https://avro.apache.org/docs/1.11.1/specification/#names
> >
> > But the actual implementations of some language bindings don't follow the
> > specification, and accept any characters.
> > Especially, the Java binding generates namespaces which contain "$" for
> > inner classes generated by protobuf.
> >
> > So, should we need to review the namespace specification?
> >
>
> To the developers who are familiar with the Java SDK: What problems do you
> see if the generator stops producing "$", i.e. do something like
> generated.replace('$', '') ?
> Would that break existing apps ?
>

If we replace "$" with any other character in the new version of Avro,
data serialized by an old Avro cannot be converted back to protobuf format,
right?

2023年8月10日(木) 16:38 Martin Grigorov <mg...@apache.org>:

> Hi,
>
> On Wed, Aug 9, 2023 at 6:30 PM Kousuke Saruta <sa...@apache.org> wrote:
>
> > Hi developers,
> >
> > I'd like to discuss the specification of namespace.
> > According to the specification, each dot separated portion of a namespace
> > should be [a-zA-Z_]][a-zA-Z0-9_]*.
> > https://avro.apache.org/docs/1.11.1/specification/#names
> >
> > But the actual implementations of some language bindings don't follow the
> > specification, and accept any characters.
> > Especially, the Java binding generates namespaces which contain "$" for
> > inner classes generated by protobuf.
> >
> > So, should we need to review the namespace specification?
> >
>
> To the developers who are familiar with the Java SDK: What problems do you
> see if the generator stops producing "$", i.e. do something like
> generated.replace('$', '') ?
> Would that break existing apps ?
>
>
>
> >
> > Thanks,
> > Kousuke
> >
>

Re: Specification of namespaces

Posted by Martin Grigorov <mg...@apache.org>.
Hi,

On Wed, Aug 9, 2023 at 6:30 PM Kousuke Saruta <sa...@apache.org> wrote:

> Hi developers,
>
> I'd like to discuss the specification of namespace.
> According to the specification, each dot separated portion of a namespace
> should be [a-zA-Z_]][a-zA-Z0-9_]*.
> https://avro.apache.org/docs/1.11.1/specification/#names
>
> But the actual implementations of some language bindings don't follow the
> specification, and accept any characters.
> Especially, the Java binding generates namespaces which contain "$" for
> inner classes generated by protobuf.
>
> So, should we need to review the namespace specification?
>

To the developers who are familiar with the Java SDK: What problems do you
see if the generator stops producing "$", i.e. do something like
generated.replace('$', '') ?
Would that break existing apps ?



>
> Thanks,
> Kousuke
>