You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2020/03/18 19:09:06 UTC

Special characters in Beam Schema field names

In Beam schemas we don't seem to have a well-defined policy around special
characters (like $.[]) in field names. There's never any explicit
validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
more natural . when concatenating field names in a nested select [1])

I think we should explicitly allow any special character (any valid UTF-8
character?) to be used in Beam schema field names. But in order to do this
we will need to provide solutions for some edge cases. To my knowledge
there are two problems that arise with some special characters in field
names:
1. They can't be mapped to language types (e.g. Java Classes, and
NamedTuples in python).
2. It can make field accesses ambiguous (i.e. does
`FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
with that exact name or a nested field?).

We already have some precedent for (1) - Beam SQL produces field names like
`$col1` for unaliased fields in query outputs, and this is allowed. If a
user wants to map a schema with a field like this to a POJO, they have to
first rename the incompatible field(s), or use an @SchemaFieldName
annotation to map the field name. I think these are reasonable solutions.

We do not have a solution for (2) though. I think we should allow the use
of a backslash to escape characters that otherwise have special meaning for
FieldAccessDescriptors (based on [2] this is .[]{}*).

Does anyone have any objection to this proposal, or is there anything I'm
overlooking? If not, I'm happy to take the task to implement the escape
character change.

Brian

[1]
https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4

Re: Special characters in Beam Schema field names

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Mar 18, 2020 at 8:01 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> I favor allowing field names to contain any unicode character, semantically. I do not think encoding is a semantic property of a field name (or even a string in a particular programming language) so UTF-8 doesn't need to be part of it. Inputting a field name in a particular context is separable from what characters can occur in the name, and the encoding of a string when it is turned into bytes is orthogonal to what characters are in the string.

+1, I meant to say Unicode, not UTF-8.

> SQL has a good convention to allow any character (backticks, as you demonstrated), as do most unix shells / filesystems. Note again that backtick and backslash conventions are how to _input_ a field name, not the characters actually in the field name. Your example of "parent.child" is a good one, too: the dot is not part of any field name, but just a way to input a list of names to construct a path. And your later example of using backticks around the dot works perfectly if you want a dot in the field name. This is a solved problem IMO, and we just have to take a solution off the shelf.
>
> Since schemas are pretty closely related with SQL, how about just using these particular SQL conventions? I like backticks and I also like backslashes.

Makes sense to me.

> For debuggability, we need to always print a properly unparsed identifier, not just print the field name as a string. So in the example of "we use _ rather than the more natural . when concatenating field names in a nested select" I would prefer to just use a dot, for clarity, and when printing it the position of the backticks will make it totally clear that the dot is not a field separator.

If we're generating *new* field names, I'd just as soon a convention
that generates non-special ones just for ease of use.

> On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <ro...@google.com> wrote:
>>
>> Give the flexibility of SQL, and the diversity of upstream systems,
>> I'd lean on the side of being maximally flexible and saying a field
>> name is a utf-8 string (including whitespace?), but special characters
>> may require quoting and/or not allow some convenience (e.g. POJO
>> creation).
>>
>> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <bh...@google.com> wrote:
>> >
>> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow (quoted) field names to contain any character. So it's currently possible for SqlTransform to produce schemas with field names containing dots and other special characters, which we can't handle properly outside of the SQL context. If we do want to have some special characters, I think we should validate that schemas don't contain them, which would limit what you can output with SqlTransform, for better or worse.
>> >
>> > > We impose limits on Beam field names, and have automatic ways of escaping or translating characters that don't match. When the Beam field name does not match the field name in other systems, we use field Options to store the "original" name so it is not lost. That way we don't have to rely on the field names always being textually identical.
>> >
>> > A good use of the new Options feature :)
>> > One of the problems I would like this thread to solve though is the possibility of using schemas and rows for the Options themselves (discussed extensively in Alex's PR [3]). If we use Options to handle special characters, we would need options on the schema of the Options (I think I said that right?) to solve it in that context.
>> >
>> > > I'm all for initial strict naming rules, that we can relax as we learn more. Additional restrictions tend to require major version changes to accommodate the backwards incompatibility.
>> >
>> > I think it may be too late to be strict though, since schemas came from SQL, and both supported SQL dialects are very permissive here. At this point it seems easier to be very permissive within Beam, and provide ways to deal with incompatibilities at the boundaries (e.g. SDKs providing ways to translate fields for language types, raising errors when a schema is incompatible for some IO, etc).
>> >
>> > [1] https://calcite.apache.org/docs/reference.html#identifiers
>> > [2] https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
>> > [3] https://github.com/apache/beam/pull/10413
>> >
>> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <ro...@frantil.com> wrote:
>> >>
>> >> I'm all for initial strict naming rules, that we can relax as we learn more. Additional restrictions tend to require major version changes to accommodate the backwards incompatibility.
>> >>
>> >> I'd rather community provide compelling use cases for relaxations than us speculating what could be useful in the outset.
>> >>
>> >> That said, it might be a touch late for schema fields...
>> >>
>> >> It's definitely my Go Bias showing but a sensible start is to not allow fields to start with a digit. This matches most C derived languages (which includes all our SDK languages at present, except maybe for Scio...).
>> >>
>> >>
>> >>
>> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:
>> >>>
>> >>> For completeness, here's another proposal.
>> >>>
>> >>> We impose limits on Beam field names, and have automatic ways of escaping or translating characters that don't match. When the Beam field name does not match the field name in other systems, we use field Options to store the "original" name so it is not lost. That way we don't have to rely on the field names always being textually identical.
>> >>>
>> >>> Downside here: any time we automatically munge a field name, we make select statements a bit more awkward, as the user has to put the munged field name into the select.
>> >>>
>> >>> Reuven
>> >>>
>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com> wrote:
>> >>>>>>
>> >>>>>> In Beam schemas we don't seem to have a well-defined policy around special characters (like $.[]) in field names. There's never any explicit validation, but we do have some ad-hoc rules (e.g. we use _ rather than the more natural . when concatenating field names in a nested select [1])
>> >>>>>>
>> >>>>>> I think we should explicitly allow any special character (any valid UTF-8 character?) to be used in Beam schema field names. But in order to do this we will need to provide solutions for some edge cases. To my knowledge there are two problems that arise with some special characters in field names:
>> >>>>>>
>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and NamedTuples in python).
>> >>>>>
>> >>>>>
>> >>>>> We already have this problem - i.e. if you name a schema field to be int, or any other reserved string. We should disambiguate.
>> >>>>
>> >>>> True, but as I point out below we have ways to deal with this problem. (2) is really the problem we need to solve.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> 2. It can make field accesses ambiguous (i.e. does `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field with that exact name or a nested field?).
>> >>>>>
>> >>>>>
>> >>>>> I still think that we should reserve _some_ special characters. I'm not sure what the use is for allowing any character to be used.
>> >>>>
>> >>>> The use would be ensuring that we don't run into compatibility issues when mapping schemas from other systems that have made different choices about which characters are special.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> We already have some precedent for (1) - Beam SQL produces field names like `$col1` for unaliased fields in query outputs, and this is allowed. If a user wants to map a schema with a field like this to a POJO, they have to first rename the incompatible field(s), or use an @SchemaFieldName annotation to map the field name. I think these are reasonable solutions.
>> >>>>>>
>> >>>>>> We do not have a solution for (2) though. I think we should allow the use of a backslash to escape characters that otherwise have special meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>> >>>>
>> >>>> I think the SQL way of handling this is to require a field name to be wrapped in some way when it contains special characters, e.g. "`some.parent.field`.`some.child.field`". We could consider that as well.
>> >>>>>>
>> >>>>>>
>> >>>>>> Does anyone have any objection to this proposal, or is there anything I'm overlooking? If not, I'm happy to take the task to implement the escape character change.
>> >>>>>>
>> >>>>>> Brian
>> >>>>>>
>> >>>>>> [1] https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>> >>>>>> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4

Re: Special characters in Beam Schema field names

Posted by Robert Burke <ro...@frantil.com>.

Ah well. This shouldn't present a problem for implementation in Go at
least, with he intent of using strict field tags. By the spec,
https://golang.org/ref/spec#Struct_types Tags are string litterals
https://golang.org/ref/spec#String_literals and by convention, are comma
delimited key:value pairs, so we can specify to our heart's content within
that if users want to hardcode complex SQL result columns explicitly.

(No formal design exists for Beam Schemas in Go just yet, though I'll
produce something in the coming months. Collaboration welcome, of course!)

On Thu, Mar 19, 2020, 5:06 PM Brian Hulette <bh...@google.com> wrote:

> I'm +1 on using the SQL (quoting) convention to handle special characters
> when inputting a field name, rather than an escape character.
>
> On Thu, Mar 19, 2020 at 2:24 PM Reuven Lax <re...@google.com> wrote:
>
>> This sounds fine. We'd have to make our parser for Select clauses be a
>> bit smarter, but it shouldn't be too difficult to extend the grammar to
>> handle escape characters.
>>
>> On Wed, Mar 18, 2020 at 8:01 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> I favor allowing field names to contain any unicode character,
>>> semantically. I do not think encoding is a semantic property of a field
>>> name (or even a string in a particular programming language) so UTF-8
>>> doesn't need to be part of it. Inputting a field name in a particular
>>> context is separable from what characters can occur in the name, and the
>>> encoding of a string when it is turned into bytes is orthogonal to what
>>> characters are in the string.
>>>
>>> SQL has a good convention to allow any character (backticks, as you
>>> demonstrated), as do most unix shells / filesystems. Note again that
>>> backtick and backslash conventions are how to _input_ a field name, not the
>>> characters actually in the field name. Your example of "parent.child" is a
>>> good one, too: the dot is not part of any field name, but just a way to
>>> input a list of names to construct a path. And your later example of using
>>> backticks around the dot works perfectly if you want a dot in the field
>>> name. This is a solved problem IMO, and we just have to take a solution off
>>> the shelf.
>>>
>>> Since schemas are pretty closely related with SQL, how about just using
>>> these particular SQL conventions? I like backticks and I also like
>>> backslashes.
>>>
>>> For debuggability, we need to always print a properly unparsed
>>> identifier, not just print the field name as a string. So in the example of
>>> "we use _ rather than the more natural . when concatenating field names in
>>> a nested select" I would prefer to just use a dot, for clarity, and when
>>> printing it the position of the backticks will make it totally clear that
>>> the dot is not a field separator.
>>>
>>> Kenn
>>>
>>> On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> Give the flexibility of SQL, and the diversity of upstream systems,
>>>> I'd lean on the side of being maximally flexible and saying a field
>>>> name is a utf-8 string (including whitespace?), but special characters
>>>> may require quoting and/or not allow some convenience (e.g. POJO
>>>> creation).
>>>>
>>>> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>> >
>>>> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow
>>>> (quoted) field names to contain any character. So it's currently possible
>>>> for SqlTransform to produce schemas with field names containing dots and
>>>> other special characters, which we can't handle properly outside of the SQL
>>>> context. If we do want to have some special characters, I think we should
>>>> validate that schemas don't contain them, which would limit what you can
>>>> output with SqlTransform, for better or worse.
>>>> >
>>>> > > We impose limits on Beam field names, and have automatic ways of
>>>> escaping or translating characters that don't match. When the Beam field
>>>> name does not match the field name in other systems, we use field Options
>>>> to store the "original" name so it is not lost. That way we don't have to
>>>> rely on the field names always being textually identical.
>>>> >
>>>> > A good use of the new Options feature :)
>>>> > One of the problems I would like this thread to solve though is the
>>>> possibility of using schemas and rows for the Options themselves (discussed
>>>> extensively in Alex's PR [3]). If we use Options to handle special
>>>> characters, we would need options on the schema of the Options (I think I
>>>> said that right?) to solve it in that context.
>>>> >
>>>> > > I'm all for initial strict naming rules, that we can relax as we
>>>> learn more. Additional restrictions tend to require major version changes
>>>> to accommodate the backwards incompatibility.
>>>> >
>>>> > I think it may be too late to be strict though, since schemas came
>>>> from SQL, and both supported SQL dialects are very permissive here. At this
>>>> point it seems easier to be very permissive within Beam, and provide ways
>>>> to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
>>>> to translate fields for language types, raising errors when a schema is
>>>> incompatible for some IO, etc).
>>>> >
>>>> > [1] https://calcite.apache.org/docs/reference.html#identifiers
>>>> > [2]
>>>> https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
>>>> > [3] https://github.com/apache/beam/pull/10413
>>>> >
>>>> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <ro...@frantil.com>
>>>> wrote:
>>>> >>
>>>> >> I'm all for initial strict naming rules, that we can relax as we
>>>> learn more. Additional restrictions tend to require major version changes
>>>> to accommodate the backwards incompatibility.
>>>> >>
>>>> >> I'd rather community provide compelling use cases for relaxations
>>>> than us speculating what could be useful in the outset.
>>>> >>
>>>> >> That said, it might be a touch late for schema fields...
>>>> >>
>>>> >> It's definitely my Go Bias showing but a sensible start is to not
>>>> allow fields to start with a digit. This matches most C derived languages
>>>> (which includes all our SDK languages at present, except maybe for Scio...).
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:
>>>> >>>
>>>> >>> For completeness, here's another proposal.
>>>> >>>
>>>> >>> We impose limits on Beam field names, and have automatic ways of
>>>> escaping or translating characters that don't match. When the Beam field
>>>> name does not match the field name in other systems, we use field Options
>>>> to store the "original" name so it is not lost. That way we don't have to
>>>> rely on the field names always being textually identical.
>>>> >>>
>>>> >>> Downside here: any time we automatically munge a field name, we
>>>> make select statements a bit more awkward, as the user has to put the
>>>> munged field name into the select.
>>>> >>>
>>>> >>> Reuven
>>>> >>>
>>>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <
>>>> bhulette@google.com> wrote:
>>>> >>>>>>
>>>> >>>>>> In Beam schemas we don't seem to have a well-defined policy
>>>> around special characters (like $.[]) in field names. There's never any
>>>> explicit validation, but we do have some ad-hoc rules (e.g. we use _ rather
>>>> than the more natural . when concatenating field names in a nested select
>>>> [1])
>>>> >>>>>>
>>>> >>>>>> I think we should explicitly allow any special character (any
>>>> valid UTF-8 character?) to be used in Beam schema field names. But in order
>>>> to do this we will need to provide solutions for some edge cases. To my
>>>> knowledge there are two problems that arise with some special characters in
>>>> field names:
>>>> >>>>>>
>>>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes,
>>>> and NamedTuples in python).
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> We already have this problem - i.e. if you name a schema field to
>>>> be int, or any other reserved string. We should disambiguate.
>>>> >>>>
>>>> >>>> True, but as I point out below we have ways to deal with this
>>>> problem. (2) is really the problem we need to solve.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> 2. It can make field accesses ambiguous (i.e. does
>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>>> with that exact name or a nested field?).
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> I still think that we should reserve _some_ special characters.
>>>> I'm not sure what the use is for allowing any character to be used.
>>>> >>>>
>>>> >>>> The use would be ensuring that we don't run into compatibility
>>>> issues when mapping schemas from other systems that have made different
>>>> choices about which characters are special.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> We already have some precedent for (1) - Beam SQL produces field
>>>> names like `$col1` for unaliased fields in query outputs, and this is
>>>> allowed. If a user wants to map a schema with a field like this to a POJO,
>>>> they have to first rename the incompatible field(s), or use an
>>>> @SchemaFieldName annotation to map the field name. I think these are
>>>> reasonable solutions.
>>>> >>>>>>
>>>> >>>>>> We do not have a solution for (2) though. I think we should
>>>> allow the use of a backslash to escape characters that otherwise have
>>>> special meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>> >>>>
>>>> >>>> I think the SQL way of handling this is to require a field name to
>>>> be wrapped in some way when it contains special characters, e.g.
>>>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Does anyone have any objection to this proposal, or is there
>>>> anything I'm overlooking? If not, I'm happy to take the task to implement
>>>> the escape character change.
>>>> >>>>>>
>>>> >>>>>> Brian
>>>> >>>>>>
>>>> >>>>>> [1]
>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>>> >>>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>>
>>>

Re: Special characters in Beam Schema field names

Posted by Brian Hulette <bh...@google.com>.

I'm +1 on using the SQL (quoting) convention to handle special characters
when inputting a field name, rather than an escape character.

On Thu, Mar 19, 2020 at 2:24 PM Reuven Lax <re...@google.com> wrote:

> This sounds fine. We'd have to make our parser for Select clauses be a bit
> smarter, but it shouldn't be too difficult to extend the grammar to handle
> escape characters.
>
> On Wed, Mar 18, 2020 at 8:01 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> I favor allowing field names to contain any unicode character,
>> semantically. I do not think encoding is a semantic property of a field
>> name (or even a string in a particular programming language) so UTF-8
>> doesn't need to be part of it. Inputting a field name in a particular
>> context is separable from what characters can occur in the name, and the
>> encoding of a string when it is turned into bytes is orthogonal to what
>> characters are in the string.
>>
>> SQL has a good convention to allow any character (backticks, as you
>> demonstrated), as do most unix shells / filesystems. Note again that
>> backtick and backslash conventions are how to _input_ a field name, not the
>> characters actually in the field name. Your example of "parent.child" is a
>> good one, too: the dot is not part of any field name, but just a way to
>> input a list of names to construct a path. And your later example of using
>> backticks around the dot works perfectly if you want a dot in the field
>> name. This is a solved problem IMO, and we just have to take a solution off
>> the shelf.
>>
>> Since schemas are pretty closely related with SQL, how about just using
>> these particular SQL conventions? I like backticks and I also like
>> backslashes.
>>
>> For debuggability, we need to always print a properly unparsed
>> identifier, not just print the field name as a string. So in the example of
>> "we use _ rather than the more natural . when concatenating field names in
>> a nested select" I would prefer to just use a dot, for clarity, and when
>> printing it the position of the backticks will make it totally clear that
>> the dot is not a field separator.
>>
>> Kenn
>>
>> On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Give the flexibility of SQL, and the diversity of upstream systems,
>>> I'd lean on the side of being maximally flexible and saying a field
>>> name is a utf-8 string (including whitespace?), but special characters
>>> may require quoting and/or not allow some convenience (e.g. POJO
>>> creation).
>>>
>>> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>> >
>>> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow
>>> (quoted) field names to contain any character. So it's currently possible
>>> for SqlTransform to produce schemas with field names containing dots and
>>> other special characters, which we can't handle properly outside of the SQL
>>> context. If we do want to have some special characters, I think we should
>>> validate that schemas don't contain them, which would limit what you can
>>> output with SqlTransform, for better or worse.
>>> >
>>> > > We impose limits on Beam field names, and have automatic ways of
>>> escaping or translating characters that don't match. When the Beam field
>>> name does not match the field name in other systems, we use field Options
>>> to store the "original" name so it is not lost. That way we don't have to
>>> rely on the field names always being textually identical.
>>> >
>>> > A good use of the new Options feature :)
>>> > One of the problems I would like this thread to solve though is the
>>> possibility of using schemas and rows for the Options themselves (discussed
>>> extensively in Alex's PR [3]). If we use Options to handle special
>>> characters, we would need options on the schema of the Options (I think I
>>> said that right?) to solve it in that context.
>>> >
>>> > > I'm all for initial strict naming rules, that we can relax as we
>>> learn more. Additional restrictions tend to require major version changes
>>> to accommodate the backwards incompatibility.
>>> >
>>> > I think it may be too late to be strict though, since schemas came
>>> from SQL, and both supported SQL dialects are very permissive here. At this
>>> point it seems easier to be very permissive within Beam, and provide ways
>>> to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
>>> to translate fields for language types, raising errors when a schema is
>>> incompatible for some IO, etc).
>>> >
>>> > [1] https://calcite.apache.org/docs/reference.html#identifiers
>>> > [2]
>>> https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
>>> > [3] https://github.com/apache/beam/pull/10413
>>> >
>>> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <ro...@frantil.com>
>>> wrote:
>>> >>
>>> >> I'm all for initial strict naming rules, that we can relax as we
>>> learn more. Additional restrictions tend to require major version changes
>>> to accommodate the backwards incompatibility.
>>> >>
>>> >> I'd rather community provide compelling use cases for relaxations
>>> than us speculating what could be useful in the outset.
>>> >>
>>> >> That said, it might be a touch late for schema fields...
>>> >>
>>> >> It's definitely my Go Bias showing but a sensible start is to not
>>> allow fields to start with a digit. This matches most C derived languages
>>> (which includes all our SDK languages at present, except maybe for Scio...).
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:
>>> >>>
>>> >>> For completeness, here's another proposal.
>>> >>>
>>> >>> We impose limits on Beam field names, and have automatic ways of
>>> escaping or translating characters that don't match. When the Beam field
>>> name does not match the field name in other systems, we use field Options
>>> to store the "original" name so it is not lost. That way we don't have to
>>> rely on the field names always being textually identical.
>>> >>>
>>> >>> Downside here: any time we automatically munge a field name, we make
>>> select statements a bit more awkward, as the user has to put the munged
>>> field name into the select.
>>> >>>
>>> >>> Reuven
>>> >>>
>>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com>
>>> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <
>>> bhulette@google.com> wrote:
>>> >>>>>>
>>> >>>>>> In Beam schemas we don't seem to have a well-defined policy
>>> around special characters (like $.[]) in field names. There's never any
>>> explicit validation, but we do have some ad-hoc rules (e.g. we use _ rather
>>> than the more natural . when concatenating field names in a nested select
>>> [1])
>>> >>>>>>
>>> >>>>>> I think we should explicitly allow any special character (any
>>> valid UTF-8 character?) to be used in Beam schema field names. But in order
>>> to do this we will need to provide solutions for some edge cases. To my
>>> knowledge there are two problems that arise with some special characters in
>>> field names:
>>> >>>>>>
>>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and
>>> NamedTuples in python).
>>> >>>>>
>>> >>>>>
>>> >>>>> We already have this problem - i.e. if you name a schema field to
>>> be int, or any other reserved string. We should disambiguate.
>>> >>>>
>>> >>>> True, but as I point out below we have ways to deal with this
>>> problem. (2) is really the problem we need to solve.
>>> >>>>>
>>> >>>>>
>>> >>>>>>
>>> >>>>>> 2. It can make field accesses ambiguous (i.e. does
>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>> with that exact name or a nested field?).
>>> >>>>>
>>> >>>>>
>>> >>>>> I still think that we should reserve _some_ special characters.
>>> I'm not sure what the use is for allowing any character to be used.
>>> >>>>
>>> >>>> The use would be ensuring that we don't run into compatibility
>>> issues when mapping schemas from other systems that have made different
>>> choices about which characters are special.
>>> >>>>>
>>> >>>>>
>>> >>>>>>
>>> >>>>>> We already have some precedent for (1) - Beam SQL produces field
>>> names like `$col1` for unaliased fields in query outputs, and this is
>>> allowed. If a user wants to map a schema with a field like this to a POJO,
>>> they have to first rename the incompatible field(s), or use an
>>> @SchemaFieldName annotation to map the field name. I think these are
>>> reasonable solutions.
>>> >>>>>>
>>> >>>>>> We do not have a solution for (2) though. I think we should allow
>>> the use of a backslash to escape characters that otherwise have special
>>> meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>> >>>>
>>> >>>> I think the SQL way of handling this is to require a field name to
>>> be wrapped in some way when it contains special characters, e.g.
>>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Does anyone have any objection to this proposal, or is there
>>> anything I'm overlooking? If not, I'm happy to take the task to implement
>>> the escape character change.
>>> >>>>>>
>>> >>>>>> Brian
>>> >>>>>>
>>> >>>>>> [1]
>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>> >>>>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>
>>

Re: Special characters in Beam Schema field names

Posted by Reuven Lax <re...@google.com>.

This sounds fine. We'd have to make our parser for Select clauses be a bit
smarter, but it shouldn't be too difficult to extend the grammar to handle
escape characters.

On Wed, Mar 18, 2020 at 8:01 PM Kenneth Knowles <ke...@apache.org> wrote:

> I favor allowing field names to contain any unicode character,
> semantically. I do not think encoding is a semantic property of a field
> name (or even a string in a particular programming language) so UTF-8
> doesn't need to be part of it. Inputting a field name in a particular
> context is separable from what characters can occur in the name, and the
> encoding of a string when it is turned into bytes is orthogonal to what
> characters are in the string.
>
> SQL has a good convention to allow any character (backticks, as you
> demonstrated), as do most unix shells / filesystems. Note again that
> backtick and backslash conventions are how to _input_ a field name, not the
> characters actually in the field name. Your example of "parent.child" is a
> good one, too: the dot is not part of any field name, but just a way to
> input a list of names to construct a path. And your later example of using
> backticks around the dot works perfectly if you want a dot in the field
> name. This is a solved problem IMO, and we just have to take a solution off
> the shelf.
>
> Since schemas are pretty closely related with SQL, how about just using
> these particular SQL conventions? I like backticks and I also like
> backslashes.
>
> For debuggability, we need to always print a properly unparsed
> identifier, not just print the field name as a string. So in the example of
> "we use _ rather than the more natural . when concatenating field names in
> a nested select" I would prefer to just use a dot, for clarity, and when
> printing it the position of the backticks will make it totally clear that
> the dot is not a field separator.
>
> Kenn
>
> On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Give the flexibility of SQL, and the diversity of upstream systems,
>> I'd lean on the side of being maximally flexible and saying a field
>> name is a utf-8 string (including whitespace?), but special characters
>> may require quoting and/or not allow some convenience (e.g. POJO
>> creation).
>>
>> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >
>> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow
>> (quoted) field names to contain any character. So it's currently possible
>> for SqlTransform to produce schemas with field names containing dots and
>> other special characters, which we can't handle properly outside of the SQL
>> context. If we do want to have some special characters, I think we should
>> validate that schemas don't contain them, which would limit what you can
>> output with SqlTransform, for better or worse.
>> >
>> > > We impose limits on Beam field names, and have automatic ways of
>> escaping or translating characters that don't match. When the Beam field
>> name does not match the field name in other systems, we use field Options
>> to store the "original" name so it is not lost. That way we don't have to
>> rely on the field names always being textually identical.
>> >
>> > A good use of the new Options feature :)
>> > One of the problems I would like this thread to solve though is the
>> possibility of using schemas and rows for the Options themselves (discussed
>> extensively in Alex's PR [3]). If we use Options to handle special
>> characters, we would need options on the schema of the Options (I think I
>> said that right?) to solve it in that context.
>> >
>> > > I'm all for initial strict naming rules, that we can relax as we
>> learn more. Additional restrictions tend to require major version changes
>> to accommodate the backwards incompatibility.
>> >
>> > I think it may be too late to be strict though, since schemas came from
>> SQL, and both supported SQL dialects are very permissive here. At this
>> point it seems easier to be very permissive within Beam, and provide ways
>> to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
>> to translate fields for language types, raising errors when a schema is
>> incompatible for some IO, etc).
>> >
>> > [1] https://calcite.apache.org/docs/reference.html#identifiers
>> > [2]
>> https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
>> > [3] https://github.com/apache/beam/pull/10413
>> >
>> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <ro...@frantil.com>
>> wrote:
>> >>
>> >> I'm all for initial strict naming rules, that we can relax as we learn
>> more. Additional restrictions tend to require major version changes to
>> accommodate the backwards incompatibility.
>> >>
>> >> I'd rather community provide compelling use cases for relaxations than
>> us speculating what could be useful in the outset.
>> >>
>> >> That said, it might be a touch late for schema fields...
>> >>
>> >> It's definitely my Go Bias showing but a sensible start is to not
>> allow fields to start with a digit. This matches most C derived languages
>> (which includes all our SDK languages at present, except maybe for Scio...).
>> >>
>> >>
>> >>
>> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:
>> >>>
>> >>> For completeness, here's another proposal.
>> >>>
>> >>> We impose limits on Beam field names, and have automatic ways of
>> escaping or translating characters that don't match. When the Beam field
>> name does not match the field name in other systems, we use field Options
>> to store the "original" name so it is not lost. That way we don't have to
>> rely on the field names always being textually identical.
>> >>>
>> >>> Downside here: any time we automatically munge a field name, we make
>> select statements a bit more awkward, as the user has to put the munged
>> field name into the select.
>> >>>
>> >>> Reuven
>> >>>
>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com>
>> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >>>>>>
>> >>>>>> In Beam schemas we don't seem to have a well-defined policy around
>> special characters (like $.[]) in field names. There's never any explicit
>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
>> more natural . when concatenating field names in a nested select [1])
>> >>>>>>
>> >>>>>> I think we should explicitly allow any special character (any
>> valid UTF-8 character?) to be used in Beam schema field names. But in order
>> to do this we will need to provide solutions for some edge cases. To my
>> knowledge there are two problems that arise with some special characters in
>> field names:
>> >>>>>>
>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and
>> NamedTuples in python).
>> >>>>>
>> >>>>>
>> >>>>> We already have this problem - i.e. if you name a schema field to
>> be int, or any other reserved string. We should disambiguate.
>> >>>>
>> >>>> True, but as I point out below we have ways to deal with this
>> problem. (2) is really the problem we need to solve.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> 2. It can make field accesses ambiguous (i.e. does
>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>> with that exact name or a nested field?).
>> >>>>>
>> >>>>>
>> >>>>> I still think that we should reserve _some_ special characters. I'm
>> not sure what the use is for allowing any character to be used.
>> >>>>
>> >>>> The use would be ensuring that we don't run into compatibility
>> issues when mapping schemas from other systems that have made different
>> choices about which characters are special.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> We already have some precedent for (1) - Beam SQL produces field
>> names like `$col1` for unaliased fields in query outputs, and this is
>> allowed. If a user wants to map a schema with a field like this to a POJO,
>> they have to first rename the incompatible field(s), or use an
>> @SchemaFieldName annotation to map the field name. I think these are
>> reasonable solutions.
>> >>>>>>
>> >>>>>> We do not have a solution for (2) though. I think we should allow
>> the use of a backslash to escape characters that otherwise have special
>> meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>> >>>>
>> >>>> I think the SQL way of handling this is to require a field name to
>> be wrapped in some way when it contains special characters, e.g.
>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>> >>>>>>
>> >>>>>>
>> >>>>>> Does anyone have any objection to this proposal, or is there
>> anything I'm overlooking? If not, I'm happy to take the task to implement
>> the escape character change.
>> >>>>>>
>> >>>>>> Brian
>> >>>>>>
>> >>>>>> [1]
>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>> >>>>>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>
>

Re: Special characters in Beam Schema field names

Posted by Kenneth Knowles <ke...@apache.org>.

I favor allowing field names to contain any unicode character,
semantically. I do not think encoding is a semantic property of a field
name (or even a string in a particular programming language) so UTF-8
doesn't need to be part of it. Inputting a field name in a particular
context is separable from what characters can occur in the name, and the
encoding of a string when it is turned into bytes is orthogonal to what
characters are in the string.

SQL has a good convention to allow any character (backticks, as you
demonstrated), as do most unix shells / filesystems. Note again that
backtick and backslash conventions are how to _input_ a field name, not the
characters actually in the field name. Your example of "parent.child" is a
good one, too: the dot is not part of any field name, but just a way to
input a list of names to construct a path. And your later example of using
backticks around the dot works perfectly if you want a dot in the field
name. This is a solved problem IMO, and we just have to take a solution off
the shelf.

Since schemas are pretty closely related with SQL, how about just using
these particular SQL conventions? I like backticks and I also like
backslashes.

For debuggability, we need to always print a properly unparsed
identifier, not just print the field name as a string. So in the example of
"we use _ rather than the more natural . when concatenating field names in
a nested select" I would prefer to just use a dot, for clarity, and when
printing it the position of the backticks will make it totally clear that
the dot is not a field separator.

Kenn

On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <ro...@google.com> wrote:

> Give the flexibility of SQL, and the diversity of upstream systems,
> I'd lean on the side of being maximally flexible and saying a field
> name is a utf-8 string (including whitespace?), but special characters
> may require quoting and/or not allow some convenience (e.g. POJO
> creation).
>
> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <bh...@google.com> wrote:
> >
> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow
> (quoted) field names to contain any character. So it's currently possible
> for SqlTransform to produce schemas with field names containing dots and
> other special characters, which we can't handle properly outside of the SQL
> context. If we do want to have some special characters, I think we should
> validate that schemas don't contain them, which would limit what you can
> output with SqlTransform, for better or worse.
> >
> > > We impose limits on Beam field names, and have automatic ways of
> escaping or translating characters that don't match. When the Beam field
> name does not match the field name in other systems, we use field Options
> to store the "original" name so it is not lost. That way we don't have to
> rely on the field names always being textually identical.
> >
> > A good use of the new Options feature :)
> > One of the problems I would like this thread to solve though is the
> possibility of using schemas and rows for the Options themselves (discussed
> extensively in Alex's PR [3]). If we use Options to handle special
> characters, we would need options on the schema of the Options (I think I
> said that right?) to solve it in that context.
> >
> > > I'm all for initial strict naming rules, that we can relax as we learn
> more. Additional restrictions tend to require major version changes to
> accommodate the backwards incompatibility.
> >
> > I think it may be too late to be strict though, since schemas came from
> SQL, and both supported SQL dialects are very permissive here. At this
> point it seems easier to be very permissive within Beam, and provide ways
> to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
> to translate fields for language types, raising errors when a schema is
> incompatible for some IO, etc).
> >
> > [1] https://calcite.apache.org/docs/reference.html#identifiers
> > [2]
> https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
> > [3] https://github.com/apache/beam/pull/10413
> >
> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <ro...@frantil.com> wrote:
> >>
> >> I'm all for initial strict naming rules, that we can relax as we learn
> more. Additional restrictions tend to require major version changes to
> accommodate the backwards incompatibility.
> >>
> >> I'd rather community provide compelling use cases for relaxations than
> us speculating what could be useful in the outset.
> >>
> >> That said, it might be a touch late for schema fields...
> >>
> >> It's definitely my Go Bias showing but a sensible start is to not allow
> fields to start with a digit. This matches most C derived languages (which
> includes all our SDK languages at present, except maybe for Scio...).
> >>
> >>
> >>
> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:
> >>>
> >>> For completeness, here's another proposal.
> >>>
> >>> We impose limits on Beam field names, and have automatic ways of
> escaping or translating characters that don't match. When the Beam field
> name does not match the field name in other systems, we use field Options
> to store the "original" name so it is not lost. That way we don't have to
> rely on the field names always being textually identical.
> >>>
> >>> Downside here: any time we automatically munge a field name, we make
> select statements a bit more awkward, as the user has to put the munged
> field name into the select.
> >>>
> >>> Reuven
> >>>
> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com>
> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com>
> wrote:
> >>>>>>
> >>>>>> In Beam schemas we don't seem to have a well-defined policy around
> special characters (like $.[]) in field names. There's never any explicit
> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
> more natural . when concatenating field names in a nested select [1])
> >>>>>>
> >>>>>> I think we should explicitly allow any special character (any valid
> UTF-8 character?) to be used in Beam schema field names. But in order to do
> this we will need to provide solutions for some edge cases. To my knowledge
> there are two problems that arise with some special characters in field
> names:
> >>>>>>
> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and
> NamedTuples in python).
> >>>>>
> >>>>>
> >>>>> We already have this problem - i.e. if you name a schema field to be
> int, or any other reserved string. We should disambiguate.
> >>>>
> >>>> True, but as I point out below we have ways to deal with this
> problem. (2) is really the problem we need to solve.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> 2. It can make field accesses ambiguous (i.e. does
> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
> with that exact name or a nested field?).
> >>>>>
> >>>>>
> >>>>> I still think that we should reserve _some_ special characters. I'm
> not sure what the use is for allowing any character to be used.
> >>>>
> >>>> The use would be ensuring that we don't run into compatibility issues
> when mapping schemas from other systems that have made different choices
> about which characters are special.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> We already have some precedent for (1) - Beam SQL produces field
> names like `$col1` for unaliased fields in query outputs, and this is
> allowed. If a user wants to map a schema with a field like this to a POJO,
> they have to first rename the incompatible field(s), or use an
> @SchemaFieldName annotation to map the field name. I think these are
> reasonable solutions.
> >>>>>>
> >>>>>> We do not have a solution for (2) though. I think we should allow
> the use of a backslash to escape characters that otherwise have special
> meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
> >>>>
> >>>> I think the SQL way of handling this is to require a field name to be
> wrapped in some way when it contains special characters, e.g.
> "`some.parent.field`.`some.child.field`". We could consider that as well.
> >>>>>>
> >>>>>>
> >>>>>> Does anyone have any objection to this proposal, or is there
> anything I'm overlooking? If not, I'm happy to take the task to implement
> the escape character change.
> >>>>>>
> >>>>>> Brian
> >>>>>>
> >>>>>> [1]
> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
> >>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>

Re: Special characters in Beam Schema field names

Posted by Robert Bradshaw <ro...@google.com>.

Give the flexibility of SQL, and the diversity of upstream systems,
I'd lean on the side of being maximally flexible and saying a field
name is a utf-8 string (including whitespace?), but special characters
may require quoting and/or not allow some convenience (e.g. POJO
creation).

On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <bh...@google.com> wrote:
>
> Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow (quoted) field names to contain any character. So it's currently possible for SqlTransform to produce schemas with field names containing dots and other special characters, which we can't handle properly outside of the SQL context. If we do want to have some special characters, I think we should validate that schemas don't contain them, which would limit what you can output with SqlTransform, for better or worse.
>
> > We impose limits on Beam field names, and have automatic ways of escaping or translating characters that don't match. When the Beam field name does not match the field name in other systems, we use field Options to store the "original" name so it is not lost. That way we don't have to rely on the field names always being textually identical.
>
> A good use of the new Options feature :)
> One of the problems I would like this thread to solve though is the possibility of using schemas and rows for the Options themselves (discussed extensively in Alex's PR [3]). If we use Options to handle special characters, we would need options on the schema of the Options (I think I said that right?) to solve it in that context.
>
> > I'm all for initial strict naming rules, that we can relax as we learn more. Additional restrictions tend to require major version changes to accommodate the backwards incompatibility.
>
> I think it may be too late to be strict though, since schemas came from SQL, and both supported SQL dialects are very permissive here. At this point it seems easier to be very permissive within Beam, and provide ways to deal with incompatibilities at the boundaries (e.g. SDKs providing ways to translate fields for language types, raising errors when a schema is incompatible for some IO, etc).
>
> [1] https://calcite.apache.org/docs/reference.html#identifiers
> [2] https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
> [3] https://github.com/apache/beam/pull/10413
>
> On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <ro...@frantil.com> wrote:
>>
>> I'm all for initial strict naming rules, that we can relax as we learn more. Additional restrictions tend to require major version changes to accommodate the backwards incompatibility.
>>
>> I'd rather community provide compelling use cases for relaxations than us speculating what could be useful in the outset.
>>
>> That said, it might be a touch late for schema fields...
>>
>> It's definitely my Go Bias showing but a sensible start is to not allow fields to start with a digit. This matches most C derived languages (which includes all our SDK languages at present, except maybe for Scio...).
>>
>>
>>
>> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:
>>>
>>> For completeness, here's another proposal.
>>>
>>> We impose limits on Beam field names, and have automatic ways of escaping or translating characters that don't match. When the Beam field name does not match the field name in other systems, we use field Options to store the "original" name so it is not lost. That way we don't have to rely on the field names always being textually identical.
>>>
>>> Downside here: any time we automatically munge a field name, we make select statements a bit more awkward, as the user has to put the munged field name into the select.
>>>
>>> Reuven
>>>
>>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com> wrote:
>>>>>>
>>>>>> In Beam schemas we don't seem to have a well-defined policy around special characters (like $.[]) in field names. There's never any explicit validation, but we do have some ad-hoc rules (e.g. we use _ rather than the more natural . when concatenating field names in a nested select [1])
>>>>>>
>>>>>> I think we should explicitly allow any special character (any valid UTF-8 character?) to be used in Beam schema field names. But in order to do this we will need to provide solutions for some edge cases. To my knowledge there are two problems that arise with some special characters in field names:
>>>>>>
>>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and NamedTuples in python).
>>>>>
>>>>>
>>>>> We already have this problem - i.e. if you name a schema field to be int, or any other reserved string. We should disambiguate.
>>>>
>>>> True, but as I point out below we have ways to deal with this problem. (2) is really the problem we need to solve.
>>>>>
>>>>>
>>>>>>
>>>>>> 2. It can make field accesses ambiguous (i.e. does `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field with that exact name or a nested field?).
>>>>>
>>>>>
>>>>> I still think that we should reserve _some_ special characters. I'm not sure what the use is for allowing any character to be used.
>>>>
>>>> The use would be ensuring that we don't run into compatibility issues when mapping schemas from other systems that have made different choices about which characters are special.
>>>>>
>>>>>
>>>>>>
>>>>>> We already have some precedent for (1) - Beam SQL produces field names like `$col1` for unaliased fields in query outputs, and this is allowed. If a user wants to map a schema with a field like this to a POJO, they have to first rename the incompatible field(s), or use an @SchemaFieldName annotation to map the field name. I think these are reasonable solutions.
>>>>>>
>>>>>> We do not have a solution for (2) though. I think we should allow the use of a backslash to escape characters that otherwise have special meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>>
>>>> I think the SQL way of handling this is to require a field name to be wrapped in some way when it contains special characters, e.g. "`some.parent.field`.`some.child.field`". We could consider that as well.
>>>>>>
>>>>>>
>>>>>> Does anyone have any objection to this proposal, or is there anything I'm overlooking? If not, I'm happy to take the task to implement the escape character change.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> [1] https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>>>>> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4

Re: Special characters in Beam Schema field names

Posted by Brian Hulette <bh...@google.com>.

Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow (quoted)
field names to contain any character. So it's currently possible for
SqlTransform to produce schemas with field names containing dots and other
special characters, which we can't handle properly outside of the SQL
context. If we do want to have some special characters, I think we should
validate that schemas don't contain them, which would limit what you can
output with SqlTransform, for better or worse.

> We impose limits on Beam field names, and have automatic ways of escaping
or translating characters that don't match. When the Beam field name does
not match the field name in other systems, we use field Options to store
the "original" name so it is not lost. That way we don't have to rely on
the field names always being textually identical.

A good use of the new Options feature :)
One of the problems I would like this thread to solve though is the
possibility of using schemas and rows for the Options themselves (discussed
extensively in Alex's PR [3]). If we use Options to handle special
characters, we would need options on the schema of the Options (I think I
said that right?) to solve it in that context.

> I'm all for initial strict naming rules, that we can relax as we learn
more. Additional restrictions tend to require major version changes to
accommodate the backwards incompatibility.

I think it may be too late to be strict though, since schemas came from
SQL, and both supported SQL dialects are very permissive here. At this
point it seems easier to be very permissive within Beam, and provide ways
to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
to translate fields for language types, raising errors when a schema is
incompatible for some IO, etc).

[1] https://calcite.apache.org/docs/reference.html#identifiers
[2]
https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
[3] https://github.com/apache/beam/pull/10413

On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <ro...@frantil.com> wrote:

> I'm all for initial strict naming rules, that we can relax as we learn
> more. Additional restrictions tend to require major version changes to
> accommodate the backwards incompatibility.
>
> I'd rather community provide compelling use cases for relaxations than us
> speculating what could be useful in the outset.
>
> That said, it might be a touch late for schema fields...
>
> It's definitely my Go Bias showing but a sensible start is to not allow
> fields to start with a digit. This matches most C derived languages (which
> includes all our SDK languages at present, except maybe for Scio...).
>
>
>
> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:
>
>> For completeness, here's another proposal.
>>
>> We impose limits on Beam field names, and have automatic ways of escaping
>> or translating characters that don't match. When the Beam field name does
>> not match the field name in other systems, we use field Options to store
>> the "original" name so it is not lost. That way we don't have to rely on
>> the field names always being textually identical.
>>
>> Downside here: any time we automatically munge a field name, we make
>> select statements a bit more awkward, as the user has to put the munged
>> field name into the select.
>>
>> Reuven
>>
>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>>
>>>>> In Beam schemas we don't seem to have a well-defined policy around
>>>>> special characters (like $.[]) in field names. There's never any explicit
>>>>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
>>>>> more natural . when concatenating field names in a nested select [1])
>>>>>
>>>>> I think we should explicitly allow any special character (any valid
>>>>> UTF-8 character?) to be used in Beam schema field names. But in order to do
>>>>> this we will need to provide solutions for some edge cases. To my knowledge
>>>>> there are two problems that arise with some special characters in field
>>>>> names:
>>>>>
>>>> 1. They can't be mapped to language types (e.g. Java Classes, and
>>>>> NamedTuples in python).
>>>>>
>>>>
>>>> We already have this problem - i.e. if you name a schema field to be
>>>> int, or any other reserved string. We should disambiguate.
>>>>
>>> True, but as I point out below we have ways to deal with this problem.
>>> (2) is really the problem we need to solve.
>>>
>>>>
>>>>
>>>>> 2. It can make field accesses ambiguous (i.e. does
>>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>>>> with that exact name or a nested field?).
>>>>>
>>>>
>>>> I still think that we should reserve _some_ special characters. I'm not
>>>> sure what the use is for allowing any character to be used.
>>>>
>>> The use would be ensuring that we don't run into compatibility issues
>>> when mapping schemas from other systems that have made different choices
>>> about which characters are special.
>>>
>>>>
>>>>
>>>>> We already have some precedent for (1) - Beam SQL produces field names
>>>>> like `$col1` for unaliased fields in query outputs, and this is allowed. If
>>>>> a user wants to map a schema with a field like this to a POJO, they have to
>>>>> first rename the incompatible field(s), or use an @SchemaFieldName
>>>>> annotation to map the field name. I think these are reasonable solutions.
>>>>>
>>>>> We do not have a solution for (2) though. I think we should allow the
>>>>> use of a backslash to escape characters that otherwise have special meaning
>>>>> for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>>>
>>>> I think the SQL way of handling this is to require a field name to be
>>> wrapped in some way when it contains special characters, e.g.
>>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>>
>>>>
>>>>> Does anyone have any objection to this proposal, or is there anything
>>>>> I'm overlooking? If not, I'm happy to take the task to implement the escape
>>>>> character change.
>>>>>
>>>>> Brian
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>>>
>>>>

Re: Special characters in Beam Schema field names

Posted by Robert Burke <ro...@frantil.com>.

I'm all for initial strict naming rules, that we can relax as we learn
more. Additional restrictions tend to require major version changes to
accommodate the backwards incompatibility.

I'd rather community provide compelling use cases for relaxations than us
speculating what could be useful in the outset.

That said, it might be a touch late for schema fields...

It's definitely my Go Bias showing but a sensible start is to not allow
fields to start with a digit. This matches most C derived languages (which
includes all our SDK languages at present, except maybe for Scio...).



On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote:

> For completeness, here's another proposal.
>
> We impose limits on Beam field names, and have automatic ways of escaping
> or translating characters that don't match. When the Beam field name does
> not match the field name in other systems, we use field Options to store
> the "original" name so it is not lost. That way we don't have to rely on
> the field names always being textually identical.
>
> Downside here: any time we automatically munge a field name, we make
> select statements a bit more awkward, as the user has to put the munged
> field name into the select.
>
> Reuven
>
> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com>
> wrote:
>
>>
>>
>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>>
>>>> In Beam schemas we don't seem to have a well-defined policy around
>>>> special characters (like $.[]) in field names. There's never any explicit
>>>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
>>>> more natural . when concatenating field names in a nested select [1])
>>>>
>>>> I think we should explicitly allow any special character (any valid
>>>> UTF-8 character?) to be used in Beam schema field names. But in order to do
>>>> this we will need to provide solutions for some edge cases. To my knowledge
>>>> there are two problems that arise with some special characters in field
>>>> names:
>>>>
>>> 1. They can't be mapped to language types (e.g. Java Classes, and
>>>> NamedTuples in python).
>>>>
>>>
>>> We already have this problem - i.e. if you name a schema field to be
>>> int, or any other reserved string. We should disambiguate.
>>>
>> True, but as I point out below we have ways to deal with this problem.
>> (2) is really the problem we need to solve.
>>
>>>
>>>
>>>> 2. It can make field accesses ambiguous (i.e. does
>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>>> with that exact name or a nested field?).
>>>>
>>>
>>> I still think that we should reserve _some_ special characters. I'm not
>>> sure what the use is for allowing any character to be used.
>>>
>> The use would be ensuring that we don't run into compatibility issues
>> when mapping schemas from other systems that have made different choices
>> about which characters are special.
>>
>>>
>>>
>>>> We already have some precedent for (1) - Beam SQL produces field names
>>>> like `$col1` for unaliased fields in query outputs, and this is allowed. If
>>>> a user wants to map a schema with a field like this to a POJO, they have to
>>>> first rename the incompatible field(s), or use an @SchemaFieldName
>>>> annotation to map the field name. I think these are reasonable solutions.
>>>>
>>>> We do not have a solution for (2) though. I think we should allow the
>>>> use of a backslash to escape characters that otherwise have special meaning
>>>> for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>>
>>> I think the SQL way of handling this is to require a field name to be
>> wrapped in some way when it contains special characters, e.g.
>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>
>>>
>>>> Does anyone have any objection to this proposal, or is there anything
>>>> I'm overlooking? If not, I'm happy to take the task to implement the escape
>>>> character change.
>>>>
>>>> Brian
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>>
>>>

Re: Special characters in Beam Schema field names

Posted by Reuven Lax <re...@google.com>.

For completeness, here's another proposal.

We impose limits on Beam field names, and have automatic ways of escaping
or translating characters that don't match. When the Beam field name does
not match the field name in other systems, we use field Options to store
the "original" name so it is not lost. That way we don't have to rely on
the field names always being textually identical.

Downside here: any time we automatically munge a field name, we make select
statements a bit more awkward, as the user has to put the munged field name
into the select.

Reuven

On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bh...@google.com> wrote:

>
>
> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote:
>
>>
>>
>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>> In Beam schemas we don't seem to have a well-defined policy around
>>> special characters (like $.[]) in field names. There's never any explicit
>>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
>>> more natural . when concatenating field names in a nested select [1])
>>>
>>> I think we should explicitly allow any special character (any valid
>>> UTF-8 character?) to be used in Beam schema field names. But in order to do
>>> this we will need to provide solutions for some edge cases. To my knowledge
>>> there are two problems that arise with some special characters in field
>>> names:
>>>
>> 1. They can't be mapped to language types (e.g. Java Classes, and
>>> NamedTuples in python).
>>>
>>
>> We already have this problem - i.e. if you name a schema field to be int,
>> or any other reserved string. We should disambiguate.
>>
> True, but as I point out below we have ways to deal with this problem. (2)
> is really the problem we need to solve.
>
>>
>>
>>> 2. It can make field accesses ambiguous (i.e. does
>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>> with that exact name or a nested field?).
>>>
>>
>> I still think that we should reserve _some_ special characters. I'm not
>> sure what the use is for allowing any character to be used.
>>
> The use would be ensuring that we don't run into compatibility issues when
> mapping schemas from other systems that have made different choices about
> which characters are special.
>
>>
>>
>>> We already have some precedent for (1) - Beam SQL produces field names
>>> like `$col1` for unaliased fields in query outputs, and this is allowed. If
>>> a user wants to map a schema with a field like this to a POJO, they have to
>>> first rename the incompatible field(s), or use an @SchemaFieldName
>>> annotation to map the field name. I think these are reasonable solutions.
>>>
>>> We do not have a solution for (2) though. I think we should allow the
>>> use of a backslash to escape characters that otherwise have special meaning
>>> for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>
>> I think the SQL way of handling this is to require a field name to be
> wrapped in some way when it contains special characters, e.g.
> "`some.parent.field`.`some.child.field`". We could consider that as well.
>
>>
>>> Does anyone have any objection to this proposal, or is there anything
>>> I'm overlooking? If not, I'm happy to take the task to implement the escape
>>> character change.
>>>
>>> Brian
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>
>>

Re: Special characters in Beam Schema field names

Posted by Brian Hulette <bh...@google.com>.

On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote:

>
>
> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com>
> wrote:
>
>> In Beam schemas we don't seem to have a well-defined policy around
>> special characters (like $.[]) in field names. There's never any explicit
>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
>> more natural . when concatenating field names in a nested select [1])
>>
>> I think we should explicitly allow any special character (any valid UTF-8
>> character?) to be used in Beam schema field names. But in order to do this
>> we will need to provide solutions for some edge cases. To my knowledge
>> there are two problems that arise with some special characters in field
>> names:
>>
> 1. They can't be mapped to language types (e.g. Java Classes, and
>> NamedTuples in python).
>>
>
> We already have this problem - i.e. if you name a schema field to be int,
> or any other reserved string. We should disambiguate.
>
True, but as I point out below we have ways to deal with this problem. (2)
is really the problem we need to solve.

>
>
>> 2. It can make field accesses ambiguous (i.e. does
>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>> with that exact name or a nested field?).
>>
>
> I still think that we should reserve _some_ special characters. I'm not
> sure what the use is for allowing any character to be used.
>
The use would be ensuring that we don't run into compatibility issues when
mapping schemas from other systems that have made different choices about
which characters are special.

>
>
>> We already have some precedent for (1) - Beam SQL produces field names
>> like `$col1` for unaliased fields in query outputs, and this is allowed. If
>> a user wants to map a schema with a field like this to a POJO, they have to
>> first rename the incompatible field(s), or use an @SchemaFieldName
>> annotation to map the field name. I think these are reasonable solutions.
>>
>> We do not have a solution for (2) though. I think we should allow the use
>> of a backslash to escape characters that otherwise have special meaning for
>> FieldAccessDescriptors (based on [2] this is .[]{}*).
>>
> I think the SQL way of handling this is to require a field name to be
wrapped in some way when it contains special characters, e.g.
"`some.parent.field`.`some.child.field`". We could consider that as well.

>
>> Does anyone have any objection to this proposal, or is there anything I'm
>> overlooking? If not, I'm happy to take the task to implement the escape
>> character change.
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>
>

Re: Special characters in Beam Schema field names

Posted by Reuven Lax <re...@google.com>.

On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bh...@google.com> wrote:

> In Beam schemas we don't seem to have a well-defined policy around special
> characters (like $.[]) in field names. There's never any explicit
> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
> more natural . when concatenating field names in a nested select [1])
>
> I think we should explicitly allow any special character (any valid UTF-8
> character?) to be used in Beam schema field names. But in order to do this
> we will need to provide solutions for some edge cases. To my knowledge
> there are two problems that arise with some special characters in field
> names:
>
1. They can't be mapped to language types (e.g. Java Classes, and
> NamedTuples in python).
>

We already have this problem - i.e. if you name a schema field to be int,
or any other reserved string. We should disambiguate.


> 2. It can make field accesses ambiguous (i.e. does
> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
> with that exact name or a nested field?).
>

I still think that we should reserve _some_ special characters. I'm not
sure what the use is for allowing any character to be used.


> We already have some precedent for (1) - Beam SQL produces field names
> like `$col1` for unaliased fields in query outputs, and this is allowed. If
> a user wants to map a schema with a field like this to a POJO, they have to
> first rename the incompatible field(s), or use an @SchemaFieldName
> annotation to map the field name. I think these are reasonable solutions.
>
> We do not have a solution for (2) though. I think we should allow the use
> of a backslash to escape characters that otherwise have special meaning for
> FieldAccessDescriptors (based on [2] this is .[]{}*).
>
> Does anyone have any objection to this proposal, or is there anything I'm
> overlooking? If not, I'm happy to take the task to implement the escape
> character change.
>
> Brian
>
> [1]
> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>