You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Dawid Wysakowicz <dw...@apache.org> on 2020/06/18 10:29:33 UTC

Nullability support in structured types

Hi devs,

I wanted to reach out to the Apache Calcite community about an issue
with nullability of structured types we observed in Apache Flink.

First of all let me give you a quick introduction why it is important
for us. What we want to support eventually is that we want to be able to
map POJOs to structured types.

For example, having a POJO like

class Address {
  public String street,
  public int zip, // important it is a primitive int
  public String city);
}

we want to map it a type

CREATE TYPE Address AS (
  street VARCHAR(20),
  zip INT NOT NULL,
  city VARCHAR(20) NOT NULL);

and with that type we want to create a table

CREATE TABLE Contacts (
  name VARCHAR(20) NOT NULL,
  address Address); -- Address structured type is nullable

Unfortunately as described in CALCITE-2464[1], the way it works is that
first a not null address type is created which is further converted to
nullable via TypeFactor#createTypeWithNullability. Unfortunately the way
this method works right now is that it changes all the nested fields to
nullable if the the strucuted type is nullable. Therefore we can no
longer map this type to our original pojo, because the zip field has
nullable INT type now. Which in turn makes it impossible to pass that
pojo as a parameter to a UDF:

|class UDF extends ScalarFunction { public Object eval(Address address)
{ if (address != null) { // based on the type I know address.id is NOT
NULL here int id = address.id; } } }|

We were able to circumvent that somewhat in our version of the
TypeFactory, but this causes problems when accessing fields of such
type. In most of the places in Calcite as far as I can tell the type of
an accessed nested field will simply be RelDataType#getField#getType,
but this does not work if the outer type is nullable. Take the above
example:

SELECT address.zip FROM Contacts;

This query should have a column of type INT instead of INT NOT NULL,
because if the column itself (the address) is null we can not produce a
not null zip.

I was able to make it work with few changes to Calcite classes[2]:

* SqlDotOperator - I had to change the deriveType method, this class has
the needed logic in the inferReturnType

* SqlItemOperator - the same change in deriveType

* AliasNamespace - this class is stripping down the original nullability

* FlinkRexBuilder, SqlNameMatcher - those classes are used when
converting SqlIdentifier in a query like SELECT address.zip FROM Contacts;


Here come my questions:

1. Do you have a suggestion/better way to handle the problem

2. Would you be willing to accept a contribution of the changes
mentioned above? (I was thinking of the first three classes). We would
like to use vanilla Calcite as much as possible.


I am happy to answer any questions, if I described something not clear
enough. I am also looking forward to any comments/suggestions.

Best,

Dawid

[1] https://issues.apache.org/jira/browse/CALCITE-2464

[2]
https://github.com/apache/flink/pull/12649/commits/1c33fe9cb5491f47fb19c16f32de5e6aef5089d3

Re: Nullability support in structured types

Posted by Ruben Q L <ru...@gmail.com>.

In my opinion, your propositions look like an improvement, so please go
ahead, open an issue and submit a patch.
We can discuss the details in the Jira / PR (hopefully other people will be
involved in the discussion too).

Best regards,
Ruben


Le ven. 19 juin 2020 à 09:45, Dawid Wysakowicz <dw...@apache.org> a
écrit :

> Thank you for the reply Ruben,
>
> I understand the Calcite stand on the nullability. We should be fine
> having the custom logic in our factory.
>
> Happy to hear you are willing to accept some of the changes. Your
> summary is pretty good. Nevetheless I thought I'd extend it a little bit
> and also categorize it in two groups.
>
> 1. Changes that require copying Calcite classes (this is the group, we
> would like to contribute the most)
>
>     - SqlDotOperator#deriveType
>
>     - SqlItemOperator#inferReturnType (yes, I made a mistake with the
> methods in the original email)
>
>     Those two, as you said implement the logic that is already in
> SqlDotOperator#inferReturnType
>
>     - AliasNamespace#validateImpl - here the change is about forwading
> the original StructKind and nullability. Right now AliasNamespace always
> strips those and produces NOT NULL FULLY_QUALIFIED RelRecordType.
>
> 2. Changes that we can make by extending Calcite classes. It would be
> nice to have those also in Calcite, but we would be fine having it in
> Flink only. Even if not perfect, they go through extension points.
>
>     - RexBuilder#makeFieldAccessInternal -> in the attached commit it is
> implemented in FlinkRexBuilder, this is the same logic as in
> SqlDotOperator#deriveType/inferReturnType, SqlItemOperator#inferReturnType
>
>     - SqlValidatorImpl.DeriveTypeVisitor#visit(SqlIdentifier) -> in the
> attached commit it is implemented in FlinkSqlNameMatcher(which honestly
> is quite sketchy), again we would greatly appreciate if we could add the
> logic from SqlDotOperator#deriveType/inferReturnType,
> SqlItemOperator#inferReturnType to that method.
>
> If you are fine with it. I'll open an issue and submit a patch in the
> next few days.
>
> Best,
>
> Dawid
>
>
> On 18/06/2020 14:37, Ruben Q L wrote:
> > Hi Dawid,
> >
> > thanks for starting this discussion.
> >
> > The reason behind the current behavior of RelDataTypeFactoryImpl#
> > createTypeWithNullability (*"the way this method works right now is that
> it
> > changes all the nested fields to nullable if the the structured type is
> > nullable"*) is because during the discussion for CALCITE-2464 we decided
> > this was the best way to represent SQL standard (*"According to the SQL
> > standard, nullability for struct types can be defined only for columns,
> > which translates to top level structs. Nested struct attributes are
> always
> > nullable, so in principle we could always set the nested attributes to be
> > nullable"*).
> >
> > You say that this does not work for your case, but that you were able to
> > workaround the issue with your own version of TypeFactory. I am not sure
> if
> > this is the best solution for this problem, but at least it is a
> solution.
> >
> > Regarding the changes that you propose, if I am not mistaken they could
> be
> > summarized as:
> > - SqlDotOperator#deriveType
> > - SqlItemOperator#inferReturnType (your email mentions deriveType, but
> the
> > diff shows me it is actually inferReturnType)
> > - AliasNamespace#validateImpl
> >
> > At first glance, I think those changes are fair, and could be integrated
> as
> > a contribution. As I see it, you are applying in different methods the
> > logic that was already in place for SqlDotOperator#inferReturnType.
> >
> > Best regards,
> > Ruben
> >
> >
> > Le jeu. 18 juin 2020 à 12:29, Dawid Wysakowicz <dw...@apache.org>
> a
> > écrit :
> >
> >> Hi devs,
> >>
> >> I wanted to reach out to the Apache Calcite community about an issue
> with
> >> nullability of structured types we observed in Apache Flink.
> >>
> >> First of all let me give you a quick introduction why it is important
> for
> >> us. What we want to support eventually is that we want to be able to map
> >> POJOs to structured types.
> >>
> >> For example, having a POJO like
> >> class Address {
> >>   public String street,
> >>   public int zip, // important it is a primitive int
> >>   public String city);
> >> }
> >>
> >> we want to map it a type
> >>
> >> CREATE TYPE Address AS (
> >>   street VARCHAR(20),
> >>   zip INT NOT NULL,
> >>   city VARCHAR(20) NOT NULL);
> >>
> >> and with that type we want to create a table
> >>
> >> CREATE TABLE Contacts (
> >>   name VARCHAR(20) NOT NULL,
> >>   address Address); -- Address structured type is nullable
> >>
> >> Unfortunately as described in CALCITE-2464[1], the way it works is that
> >> first a not null address type is created which is further converted to
> >> nullable via TypeFactor#createTypeWithNullability. Unfortunately the way
> >> this method works right now is that it changes all the nested fields to
> >> nullable if the the strucuted type is nullable. Therefore we can no
> longer
> >> map this type to our original pojo, because the zip field has nullable
> INT
> >> type now. Which in turn makes it impossible to pass that pojo as a
> >> parameter to a UDF:
> >>
> >> class UDF extends ScalarFunction {
> >>   public Object eval(Address address) {
> >>      if (address != null) {
> >>          // based on the type I know address.id is NOT NULL here
> >>         int id = address.id;
> >>      }
> >>   }
> >> }
> >>
> >> We were able to circumvent that somewhat in our version of the
> >> TypeFactory, but this causes problems when accessing fields of such
> type.
> >> In most of the places in Calcite as far as I can tell the type of an
> >> accessed nested field will simply be RelDataType#getField#getType, but
> this
> >> does not work if the outer type is nullable. Take the above example:
> >>
> >> SELECT address.zip FROM Contacts;
> >>
> >> This query should have a column of type INT instead of INT NOT NULL,
> >> because if the column itself (the address) is null we can not produce a
> not
> >> null zip.
> >>
> >> I was able to make it work with few changes to Calcite classes[2]:
> >>
> >> * SqlDotOperator - I had to change the deriveType method, this class has
> >> the needed logic in the inferReturnType
> >>
> >> * SqlItemOperator - the same change in deriveType
> >>
> >> * AliasNamespace - this class is stripping down the original nullability
> >>
> >> * FlinkRexBuilder, SqlNameMatcher - those classes are used when
> converting
> >> SqlIdentifier in a query like SELECT address.zip FROM Contacts;
> >>
> >>
> >> Here come my questions:
> >>
> >> 1. Do you have a suggestion/better way to handle the problem
> >>
> >> 2. Would you be willing to accept a contribution of the changes
> mentioned
> >> above? (I was thinking of the first three classes). We would like to use
> >> vanilla Calcite as much as possible.
> >>
> >>
> >> I am happy to answer any questions, if I described something not clear
> >> enough. I am also looking forward to any comments/suggestions.
> >>
> >> Best,
> >>
> >> Dawid
> >>
> >> [1] https://issues.apache.org/jira/browse/CALCITE-2464
> >>
> >> [2]
> >>
> https://github.com/apache/flink/pull/12649/commits/1c33fe9cb5491f47fb19c16f32de5e6aef5089d3
> >>
>
>

Re: Nullability support in structured types

Posted by Dawid Wysakowicz <dw...@apache.org>.

Thank you for the reply Ruben,

I understand the Calcite stand on the nullability. We should be fine
having the custom logic in our factory.

Happy to hear you are willing to accept some of the changes. Your
summary is pretty good. Nevetheless I thought I'd extend it a little bit
and also categorize it in two groups.

1. Changes that require copying Calcite classes (this is the group, we
would like to contribute the most)

    - SqlDotOperator#deriveType

    - SqlItemOperator#inferReturnType (yes, I made a mistake with the
methods in the original email)

    Those two, as you said implement the logic that is already in
SqlDotOperator#inferReturnType

    - AliasNamespace#validateImpl - here the change is about forwading
the original StructKind and nullability. Right now AliasNamespace always
strips those and produces NOT NULL FULLY_QUALIFIED RelRecordType.

2. Changes that we can make by extending Calcite classes. It would be
nice to have those also in Calcite, but we would be fine having it in
Flink only. Even if not perfect, they go through extension points.

    - RexBuilder#makeFieldAccessInternal -> in the attached commit it is
implemented in FlinkRexBuilder, this is the same logic as in
SqlDotOperator#deriveType/inferReturnType, SqlItemOperator#inferReturnType

    - SqlValidatorImpl.DeriveTypeVisitor#visit(SqlIdentifier) -> in the
attached commit it is implemented in FlinkSqlNameMatcher(which honestly
is quite sketchy), again we would greatly appreciate if we could add the
logic from SqlDotOperator#deriveType/inferReturnType,
SqlItemOperator#inferReturnType to that method.

If you are fine with it. I'll open an issue and submit a patch in the
next few days.

Best,

Dawid


On 18/06/2020 14:37, Ruben Q L wrote:
> Hi Dawid,
>
> thanks for starting this discussion.
>
> The reason behind the current behavior of RelDataTypeFactoryImpl#
> createTypeWithNullability (*"the way this method works right now is that it
> changes all the nested fields to nullable if the the structured type is
> nullable"*) is because during the discussion for CALCITE-2464 we decided
> this was the best way to represent SQL standard (*"According to the SQL
> standard, nullability for struct types can be defined only for columns,
> which translates to top level structs. Nested struct attributes are always
> nullable, so in principle we could always set the nested attributes to be
> nullable"*).
>
> You say that this does not work for your case, but that you were able to
> workaround the issue with your own version of TypeFactory. I am not sure if
> this is the best solution for this problem, but at least it is a solution.
>
> Regarding the changes that you propose, if I am not mistaken they could be
> summarized as:
> - SqlDotOperator#deriveType
> - SqlItemOperator#inferReturnType (your email mentions deriveType, but the
> diff shows me it is actually inferReturnType)
> - AliasNamespace#validateImpl
>
> At first glance, I think those changes are fair, and could be integrated as
> a contribution. As I see it, you are applying in different methods the
> logic that was already in place for SqlDotOperator#inferReturnType.
>
> Best regards,
> Ruben
>
>
> Le jeu. 18 juin 2020 à 12:29, Dawid Wysakowicz <dw...@apache.org> a
> écrit :
>
>> Hi devs,
>>
>> I wanted to reach out to the Apache Calcite community about an issue with
>> nullability of structured types we observed in Apache Flink.
>>
>> First of all let me give you a quick introduction why it is important for
>> us. What we want to support eventually is that we want to be able to map
>> POJOs to structured types.
>>
>> For example, having a POJO like
>> class Address {
>>   public String street,
>>   public int zip, // important it is a primitive int
>>   public String city);
>> }
>>
>> we want to map it a type
>>
>> CREATE TYPE Address AS (
>>   street VARCHAR(20),
>>   zip INT NOT NULL,
>>   city VARCHAR(20) NOT NULL);
>>
>> and with that type we want to create a table
>>
>> CREATE TABLE Contacts (
>>   name VARCHAR(20) NOT NULL,
>>   address Address); -- Address structured type is nullable
>>
>> Unfortunately as described in CALCITE-2464[1], the way it works is that
>> first a not null address type is created which is further converted to
>> nullable via TypeFactor#createTypeWithNullability. Unfortunately the way
>> this method works right now is that it changes all the nested fields to
>> nullable if the the strucuted type is nullable. Therefore we can no longer
>> map this type to our original pojo, because the zip field has nullable INT
>> type now. Which in turn makes it impossible to pass that pojo as a
>> parameter to a UDF:
>>
>> class UDF extends ScalarFunction {
>>   public Object eval(Address address) {
>>      if (address != null) {
>>          // based on the type I know address.id is NOT NULL here
>>         int id = address.id;
>>      }
>>   }
>> }
>>
>> We were able to circumvent that somewhat in our version of the
>> TypeFactory, but this causes problems when accessing fields of such type.
>> In most of the places in Calcite as far as I can tell the type of an
>> accessed nested field will simply be RelDataType#getField#getType, but this
>> does not work if the outer type is nullable. Take the above example:
>>
>> SELECT address.zip FROM Contacts;
>>
>> This query should have a column of type INT instead of INT NOT NULL,
>> because if the column itself (the address) is null we can not produce a not
>> null zip.
>>
>> I was able to make it work with few changes to Calcite classes[2]:
>>
>> * SqlDotOperator - I had to change the deriveType method, this class has
>> the needed logic in the inferReturnType
>>
>> * SqlItemOperator - the same change in deriveType
>>
>> * AliasNamespace - this class is stripping down the original nullability
>>
>> * FlinkRexBuilder, SqlNameMatcher - those classes are used when converting
>> SqlIdentifier in a query like SELECT address.zip FROM Contacts;
>>
>>
>> Here come my questions:
>>
>> 1. Do you have a suggestion/better way to handle the problem
>>
>> 2. Would you be willing to accept a contribution of the changes mentioned
>> above? (I was thinking of the first three classes). We would like to use
>> vanilla Calcite as much as possible.
>>
>>
>> I am happy to answer any questions, if I described something not clear
>> enough. I am also looking forward to any comments/suggestions.
>>
>> Best,
>>
>> Dawid
>>
>> [1] https://issues.apache.org/jira/browse/CALCITE-2464
>>
>> [2]
>> https://github.com/apache/flink/pull/12649/commits/1c33fe9cb5491f47fb19c16f32de5e6aef5089d3
>>

Re: Nullability support in structured types

Posted by Ruben Q L <ru...@gmail.com>.

Hi Dawid,

thanks for starting this discussion.

The reason behind the current behavior of RelDataTypeFactoryImpl#
createTypeWithNullability (*"the way this method works right now is that it
changes all the nested fields to nullable if the the structured type is
nullable"*) is because during the discussion for CALCITE-2464 we decided
this was the best way to represent SQL standard (*"According to the SQL
standard, nullability for struct types can be defined only for columns,
which translates to top level structs. Nested struct attributes are always
nullable, so in principle we could always set the nested attributes to be
nullable"*).

You say that this does not work for your case, but that you were able to
workaround the issue with your own version of TypeFactory. I am not sure if
this is the best solution for this problem, but at least it is a solution.

Regarding the changes that you propose, if I am not mistaken they could be
summarized as:
- SqlDotOperator#deriveType
- SqlItemOperator#inferReturnType (your email mentions deriveType, but the
diff shows me it is actually inferReturnType)
- AliasNamespace#validateImpl

At first glance, I think those changes are fair, and could be integrated as
a contribution. As I see it, you are applying in different methods the
logic that was already in place for SqlDotOperator#inferReturnType.

Best regards,
Ruben


Le jeu. 18 juin 2020 à 12:29, Dawid Wysakowicz <dw...@apache.org> a
écrit :

> Hi devs,
>
> I wanted to reach out to the Apache Calcite community about an issue with
> nullability of structured types we observed in Apache Flink.
>
> First of all let me give you a quick introduction why it is important for
> us. What we want to support eventually is that we want to be able to map
> POJOs to structured types.
>
> For example, having a POJO like
> class Address {
>   public String street,
>   public int zip, // important it is a primitive int
>   public String city);
> }
>
> we want to map it a type
>
> CREATE TYPE Address AS (
>   street VARCHAR(20),
>   zip INT NOT NULL,
>   city VARCHAR(20) NOT NULL);
>
> and with that type we want to create a table
>
> CREATE TABLE Contacts (
>   name VARCHAR(20) NOT NULL,
>   address Address); -- Address structured type is nullable
>
> Unfortunately as described in CALCITE-2464[1], the way it works is that
> first a not null address type is created which is further converted to
> nullable via TypeFactor#createTypeWithNullability. Unfortunately the way
> this method works right now is that it changes all the nested fields to
> nullable if the the strucuted type is nullable. Therefore we can no longer
> map this type to our original pojo, because the zip field has nullable INT
> type now. Which in turn makes it impossible to pass that pojo as a
> parameter to a UDF:
>
> class UDF extends ScalarFunction {
>   public Object eval(Address address) {
>      if (address != null) {
>          // based on the type I know address.id is NOT NULL here
>         int id = address.id;
>      }
>   }
> }
>
> We were able to circumvent that somewhat in our version of the
> TypeFactory, but this causes problems when accessing fields of such type.
> In most of the places in Calcite as far as I can tell the type of an
> accessed nested field will simply be RelDataType#getField#getType, but this
> does not work if the outer type is nullable. Take the above example:
>
> SELECT address.zip FROM Contacts;
>
> This query should have a column of type INT instead of INT NOT NULL,
> because if the column itself (the address) is null we can not produce a not
> null zip.
>
> I was able to make it work with few changes to Calcite classes[2]:
>
> * SqlDotOperator - I had to change the deriveType method, this class has
> the needed logic in the inferReturnType
>
> * SqlItemOperator - the same change in deriveType
>
> * AliasNamespace - this class is stripping down the original nullability
>
> * FlinkRexBuilder, SqlNameMatcher - those classes are used when converting
> SqlIdentifier in a query like SELECT address.zip FROM Contacts;
>
>
> Here come my questions:
>
> 1. Do you have a suggestion/better way to handle the problem
>
> 2. Would you be willing to accept a contribution of the changes mentioned
> above? (I was thinking of the first three classes). We would like to use
> vanilla Calcite as much as possible.
>
>
> I am happy to answer any questions, if I described something not clear
> enough. I am also looking forward to any comments/suggestions.
>
> Best,
>
> Dawid
>
> [1] https://issues.apache.org/jira/browse/CALCITE-2464
>
> [2]
> https://github.com/apache/flink/pull/12649/commits/1c33fe9cb5491f47fb19c16f32de5e6aef5089d3
>