You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Nilesh Yadav <ni...@google.com.INVALID> on 2022/05/03 22:06:57 UTC

Supporting unicode named fields

Hello,

As per https://avro.apache.org/docs/current/spec.html#names;field names are
restricted to alphanumeric values in Avro schema.

Do you have any plans to support unicode field names in future?
If yes, is there any estimated timeline for general availability?
If not, then could you suggest a way to support unicode field names in Avro
schema? Has anyone else solved the similar problem in a different way?

Thank you,
Nilesh

Re: Supporting unicode named fields

Posted by Christophe Le Saëc <ch...@gmail.com>.
Current java code already accept Unicode char, i added a unit test to show
it
<https://github.com/apache/avro/blob/8969bc15174b96ca17b8b264b596e0d3de7c9436/lang/java/avro/src/test/java/org/apache/avro/TestSchema.java#L72>.
It accepts Japanese and Chinese name, that match our needs, but as it's not
official, our code convert it to no-understable name, that generates issues
(even if we keep original name in a property of the schema field).
On rust, it would imply this change
<https://github.com/apache/avro/blob/8969bc15174b96ca17b8b264b596e0d3de7c9436/lang/rust/avro/src/schema.rs#L1283-L1293>
and on C, i made this PR <https://github.com/apache/avro/pull/1798>

For Java implementation, change documentation wouldn't be a breaking
change, but adapt code to strictly conform to documentation would be.

This is why i proposed this JIRA
<https://issues.apache.org/jira/browse/AVRO-3532> to enlarge accepted names
and give Avro more possibilities.

Le jeu. 26 mai 2022 à 23:32, Nilesh Yadav <ni...@google.com.invalid> a
écrit :

> GCP storage\analytics support unicode characters in column name but Avro
> which is used for message transfer does not. I'm trying to bridge the gap.
> Which is the reason I'm looking for unicode support in Avro schema.
>
>
> On Mon, May 16, 2022 at 12:07 PM Ryan Skraba <ry...@skraba.com> wrote:
>
> > Hello,
> >
> > At this point, changing the naming rules of the specification would be
> > a pretty significant breaking change, and I don't think it's likely to
> > happen without a compelling champion that feels strongly about the
> > issue!  As far as I know, this is
> >
> > Using non-compliant names *might* work with some implementations, but
> > there's no guarantee that that schema would be interoperable between
> > Avro versions and languages.  This is true regardless of how the
> > schema was generated, unfortunately, including from Avro IDL... I
> > don't believe that the AVDL example above would interoperate with the
> > Python SDK.
> >
> > In my experience, the usual reason I've encountered for wanting to
> > have unicode identifiers is have better, language-specific names or
> > fields like "prénom".  As an alternative, in practice, this can be
> > accomplished by adding your own custom JSON property to the type or
> > field name, like "label" or "display.name" (or by reusing the "doc"
> > field) for the human-readable internationalized name.  This technique
> > has the advantage, as well, of potentially supporting multiple
> > languages with different properties, or allowing you to rename the
> > field without affecting the canonical schema.  The disadvantage is
> > that none of this is provided for you...
> >
> > Is there a specific use case that you're looking to support?
> >
> > Ryan
> >
> > On Tue, May 10, 2022 at 6:55 AM Oscar Westra van Holthe - Kind
> > <os...@westravanholthe.nl> wrote:
> > >
> > > On  Mon 9 May 2022 23:27, Zoltan Csizmadia <zo...@apache.org> wrote:
> > >
> > > > Here are some protocol definition examples used for testing. They are
> > not
> > > > schemas, however it should work the same:
> > > >
> > > >
> > > >
> >
> https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/input/unicode.avdl
> > > >
> > > >
> >
> https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/output/unicode.avpr
> > >
> > >
> > > As an aside, there are 2 PRs (#1588 [1] & #1589 [2]) for an ANTLR based
> > > grammar that can also support a schema file equivalent syntax.
> > >
> > > It won't help you now, but it may be something to keep an eye on.
> > >
> > > Kind regards,
> > > Oscar
> > >
> > >
> > > [1] https://github.com/apache/avro/pull/1588
> > > [2] https://github.com/apache/avro/pull/1589
> > >
> > > --
> > > Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>
> >
>

Re: Supporting unicode named fields

Posted by Nilesh Yadav <ni...@google.com.INVALID>.
GCP storage\analytics support unicode characters in column name but Avro
which is used for message transfer does not. I'm trying to bridge the gap.
Which is the reason I'm looking for unicode support in Avro schema.


On Mon, May 16, 2022 at 12:07 PM Ryan Skraba <ry...@skraba.com> wrote:

> Hello,
>
> At this point, changing the naming rules of the specification would be
> a pretty significant breaking change, and I don't think it's likely to
> happen without a compelling champion that feels strongly about the
> issue!  As far as I know, this is
>
> Using non-compliant names *might* work with some implementations, but
> there's no guarantee that that schema would be interoperable between
> Avro versions and languages.  This is true regardless of how the
> schema was generated, unfortunately, including from Avro IDL... I
> don't believe that the AVDL example above would interoperate with the
> Python SDK.
>
> In my experience, the usual reason I've encountered for wanting to
> have unicode identifiers is have better, language-specific names or
> fields like "prénom".  As an alternative, in practice, this can be
> accomplished by adding your own custom JSON property to the type or
> field name, like "label" or "display.name" (or by reusing the "doc"
> field) for the human-readable internationalized name.  This technique
> has the advantage, as well, of potentially supporting multiple
> languages with different properties, or allowing you to rename the
> field without affecting the canonical schema.  The disadvantage is
> that none of this is provided for you...
>
> Is there a specific use case that you're looking to support?
>
> Ryan
>
> On Tue, May 10, 2022 at 6:55 AM Oscar Westra van Holthe - Kind
> <os...@westravanholthe.nl> wrote:
> >
> > On  Mon 9 May 2022 23:27, Zoltan Csizmadia <zo...@apache.org> wrote:
> >
> > > Here are some protocol definition examples used for testing. They are
> not
> > > schemas, however it should work the same:
> > >
> > >
> > >
> https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/input/unicode.avdl
> > >
> > >
> https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/output/unicode.avpr
> >
> >
> > As an aside, there are 2 PRs (#1588 [1] & #1589 [2]) for an ANTLR based
> > grammar that can also support a schema file equivalent syntax.
> >
> > It won't help you now, but it may be something to keep an eye on.
> >
> > Kind regards,
> > Oscar
> >
> >
> > [1] https://github.com/apache/avro/pull/1588
> > [2] https://github.com/apache/avro/pull/1589
> >
> > --
> > Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>
>

Re: Supporting unicode named fields

Posted by Ryan Skraba <ry...@skraba.com>.
Hello,

At this point, changing the naming rules of the specification would be
a pretty significant breaking change, and I don't think it's likely to
happen without a compelling champion that feels strongly about the
issue!  As far as I know, this is

Using non-compliant names *might* work with some implementations, but
there's no guarantee that that schema would be interoperable between
Avro versions and languages.  This is true regardless of how the
schema was generated, unfortunately, including from Avro IDL... I
don't believe that the AVDL example above would interoperate with the
Python SDK.

In my experience, the usual reason I've encountered for wanting to
have unicode identifiers is have better, language-specific names or
fields like "prénom".  As an alternative, in practice, this can be
accomplished by adding your own custom JSON property to the type or
field name, like "label" or "display.name" (or by reusing the "doc"
field) for the human-readable internationalized name.  This technique
has the advantage, as well, of potentially supporting multiple
languages with different properties, or allowing you to rename the
field without affecting the canonical schema.  The disadvantage is
that none of this is provided for you...

Is there a specific use case that you're looking to support?

Ryan

On Tue, May 10, 2022 at 6:55 AM Oscar Westra van Holthe - Kind
<os...@westravanholthe.nl> wrote:
>
> On  Mon 9 May 2022 23:27, Zoltan Csizmadia <zo...@apache.org> wrote:
>
> > Here are some protocol definition examples used for testing. They are not
> > schemas, however it should work the same:
> >
> >
> > https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/input/unicode.avdl
> >
> > https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/output/unicode.avpr
>
>
> As an aside, there are 2 PRs (#1588 [1] & #1589 [2]) for an ANTLR based
> grammar that can also support a schema file equivalent syntax.
>
> It won't help you now, but it may be something to keep an eye on.
>
> Kind regards,
> Oscar
>
>
> [1] https://github.com/apache/avro/pull/1588
> [2] https://github.com/apache/avro/pull/1589
>
> --
> Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Re: Supporting unicode named fields

Posted by Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>.
On  Mon 9 May 2022 23:27, Zoltan Csizmadia <zo...@apache.org> wrote:

> Here are some protocol definition examples used for testing. They are not
> schemas, however it should work the same:
>
>
> https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/input/unicode.avdl
>
> https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/output/unicode.avpr


As an aside, there are 2 PRs (#1588 [1] & #1589 [2]) for an ANTLR based
grammar that can also support a schema file equivalent syntax.

It won't help you now, but it may be something to keep an eye on.

Kind regards,
Oscar


[1] https://github.com/apache/avro/pull/1588
[2] https://github.com/apache/avro/pull/1589

-- 
Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Re: Supporting unicode named fields

Posted by Zoltan Csizmadia <zo...@apache.org>.
Here are some protocol definition examples used for testing. They are not schemas, however it should work the same:

https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/input/unicode.avdl
https://github.com/apache/avro/blob/master/lang/java/compiler/src/test/idl/output/unicode.avpr




Re: Supporting unicode named fields

Posted by Nilesh Yadav <ni...@google.com.INVALID>.
Hello,

Thank you for the information.

Could you please point me to "Java implementation of
IDL - unicode support"? And usage examples if possible?

Thank you,
Nilesh


On Thu, May 5, 2022 at 2:56 PM Zoltan Csizmadia <zo...@apache.org> wrote:

> The C# Avro implementation supports using Unicode characters in the field
> names, since C# supports unicode characters in identifiers (
> https://docs.microsoft.com/en-us/dotnet/csharp/fundamentals/coding-style/identifier-names
> ).
>
> On 2022/05/03 22:06:57 Nilesh Yadav wrote:
> > Hello,
> >
> > As per https://avro.apache.org/docs/current/spec.html#names;field names
> are
> > restricted to alphanumeric values in Avro schema.
> >
> > Do you have any plans to support unicode field names in future?
> > If yes, is there any estimated timeline for general availability?
> > If not, then could you suggest a way to support unicode field names in
> Avro
> > schema? Has anyone else solved the similar problem in a different way?
> >
> > Thank you,
> > Nilesh
> >
>

Re: Supporting unicode named fields

Posted by Zoltan Csizmadia <zo...@apache.org>.
The C# Avro implementation supports using Unicode characters in the field names, since C# supports unicode characters in identifiers (https://docs.microsoft.com/en-us/dotnet/csharp/fundamentals/coding-style/identifier-names).

On 2022/05/03 22:06:57 Nilesh Yadav wrote:
> Hello,
> 
> As per https://avro.apache.org/docs/current/spec.html#names;field names are
> restricted to alphanumeric values in Avro schema.
> 
> Do you have any plans to support unicode field names in future?
> If yes, is there any estimated timeline for general availability?
> If not, then could you suggest a way to support unicode field names in Avro
> schema? Has anyone else solved the similar problem in a different way?
> 
> Thank you,
> Nilesh
> 

Re: Supporting unicode named fields

Posted by Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>.
Hi Nilesh,

This issue has popped up in the past (
https://issues.apache.org/jira/browse/AVRO-1022), and one difficulty IMHO
is the varying quality of implementations of unicode identifiers across
programming languages.

For an implementation, please look at the current Java implementation of
IDL: it does support unicode (in the form of Java identifiers).

The downside of this though, is that it is inherently unportable and likely
not interoperable with other languages.

Kind regards,
Oscar

-- 
Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Op wo 4 mei 2022 00:06 schreef Nilesh Yadav <ni...@google.com.invalid>:

> Hello,
>
> As per https://avro.apache.org/docs/current/spec.html#names;field names
> are
> restricted to alphanumeric values in Avro schema.
>
> Do you have any plans to support unicode field names in the future?
> If yes, is there any estimated timeline for general availability?
> If not, then could you suggest a way to support unicode field names in Avro
> schema? Has anyone else solved a similar problem in a different way?
>
> Thank you,
> Nilesh
>