You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Maximilian Michels <mx...@apache.org> on 2015/06/15 17:45:26 UTC

The null in Flink

Hi everyone,

I'm seeing a lot of null value related pull requests nowadays, like these:

https://github.com/apache/flink/pull/780
https://github.com/apache/flink/pull/831
https://github.com/apache/flink/pull/834

It used to be the case that null values were simply not supported by Flink.
Recently, Flink supports null values for some components. Now I'm wondering
what the current state of null values in Flink is. While ignoring null
values might be a good for not crashing your programs, null values are
generally a bad way of signaling empty values for which better strategies
are available. My intuition would be that it is a bit evil to support them
in DataSets.

Just to give an idea what null values could cause in Flink: DataSet.count()
returns the number of elements of all values in a Dataset (null or not)
while #834 would ignore null values and aggregate the DataSet without them.

Best,
Max

Re: The null in Flink

Posted by Aljoscha Krettek <al...@apache.org>.

Yes, I agree, these are the steps to take. Thanks for creating the issues. 
> On 08 Dec 2015, at 07:18, Li, Chengxiang <ch...@intel.com> wrote:
> 
> Chengxiang

RE: The null in Flink

Posted by "Li, Chengxiang" <ch...@intel.com>.

Hi, 
Summary of our discussion about NULL value handling in FLink:
1. Read from data source to Table/Row DataSet directly is necessary for NULL value handling.
2. NULL value representation in Row object, this would change its binary data layout, so we would need new Row Serializer/Comparator(and its dependency) which aware of this new binary data layout. Tuple and Case Class serializer/Comparator should remain the same.
3. NULL value handling in operations. We would follow the SQL standard as default, but these are not concluded yet, any more input would be welcomed. 

I've created an umbrella JIRA(https://issues.apache.org/jira/browse/FLINK-3139) for this, the following subtasks based on the previous 3 aspects would be created as well, so anyone interested could contribute and comment on all subtasks. And we could also move discussion on specified issues to JIRA system.

Thanks
Chengxiang 

-----Original Message-----
From: Li, Chengxiang [mailto:chengxiang.li@intel.com] 
Sent: Thursday, December 3, 2015 4:43 PM
To: dev@flink.apache.org
Subject: RE: The null in Flink

Hi, Stephan
Treat UNKOWN as FALSE may works if the Boolean expression is used in filter operation, but for other operations, such as select and groupBy, it does not make sense anymore, we should need UNKNOWN value(or unified as NULL) to distinguish with TRUE/FALSE .

Thanks
Chengxiang 

-----Original Message-----
From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf Of Stephan Ewen
Sent: Wednesday, December 2, 2015 6:27 PM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Hi Chenliang!

I have to dig into this again, it was a while back. I think (vaguely) the reason why this worked was that in the end (at the root of a tree that is a logical expression) if the result is UNKNOWN, it is treated like FALSE.

For example a predicate like "WHERE t.a > 10 && t.b == 'pending' ". If one boolean atom is UNKNOWN, the other is TRUE, the whole term becomes UNKNOWN and the row is filtered out (as if the predicate was false) - the result of the query contains no rows where predicate results are UNKNOWN.

Stephan



On Tue, Dec 1, 2015 at 4:09 AM, Li, Chengxiang <ch...@intel.com>
wrote:

> Stephen,
> For the 3rd topic, you mentioned that "If the boolean expressions are 
> monotonous (have no NOT), then the UNKNOWN value can be the same as 
> FALSE ", as UNKNOWN means it could be TRUE as well, does it a proper 
> way to handle it just as FALSE?
>
> Aljoscha,
> I agree with you, Table can only be transformed from Tuple/Case Class 
> DataSet now, and Tuple/Case Class does not allow null field value, so 
> read files from data source to Row DataSet is necessary for NULL value handling.
>
> -----Original Message-----
> From: Aljoscha Krettek [mailto:aljoscha@apache.org]
> Sent: Friday, November 27, 2015 6:41 PM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Oh, this is probably the Jira for what I mentioned:
> https://issues.apache.org/jira/browse/FLINK-2988
>
> > On 27 Nov 2015, at 11:02, Aljoscha Krettek <al...@apache.org> wrote:
> >
> > Hi,
> > just some information. The Table API code generator already has
> preliminary support for generating code that is NULL-aware. So for 
> example if you have expressions like 1 + NULL the result would also be null.
> >
> > I think one of the missing pieces is a way to get data that contains
> null values into the system. For example, right now the expected way 
> to read csv files is via tuples and they don’t support null values. I 
> think we need a way to directly read CSV files into a Row DataSet (or Table).
> >
> > Cheers,
> > Aljoscha
> >> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote:
> >>
> >> Hi!
> >>
> >> Thanks for the good discussion! Here are some thoughts from my side:
> >>
> >> 1)
> >> I would agree with Chengxiang that it helps to have as much NULL 
> >> handling in the table API as possible, since most SQL constructs 
> >> will be permitted there are well.
> >>
> >> 2)
> >> A question that I have is whether we want to actually follow the 
> >> SQL standard exactly. There is a lot of criticism on NULL in the 
> >> SQL standard, and there have been many good proposals for more 
> >> meaningful semantics (for example differentiate between the 
> >> meanings "value missing", "value unknown", "value not applicable", etc).
> >>
> >> Going with the SQL way is easiest and makes SQL addition on top of 
> >> the table API much easier. Also, there is only one type of NULL, 
> >> meaning that null-values can be encoded efficiently in bitmaps.
> >> Further more, the fact that the Table API users have the power of a 
> >> programming language at hand (rather than the limited set of SQL 
> >> operators), they should be able to easily define their own 
> >> constants for special meanings like "value not applicable" or so.
> >>
> >> Just curious if anyone has experience with some of the other 
> >> null-sematic proposals that have been around.
> >>
> >> 3)
> >> One comment concerning the three-value-logic for boolean expressions:
> >>
> >> A while back, I worked on a SQL engine, and we were able to not 
> >> implement three-value logic with trick. If I recall correctly, it 
> >> was
> like this:
> >>
> >> If the boolean expressions are monotonous (have no NOT), then the 
> >> UNKNOWN value can be the same as FALSE. So the query planner had to 
> >> rewrite all expression trees to have no NOT, which means pushing 
> >> the NOT down into the leaf comparison operations (for example push 
> >> NOT into
> == to become !=).
> >> These leaf comparison operators needed to be NULL aware to return 
> >> FALSE on comparisons with NULL.
> >>
> >>
> >> Greetings,
> >> Stephan
> >>
> >>
> >> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang 
> >> <ch...@intel.com>
> >> wrote:
> >>
> >>> Thanks, Timo.
> >>> We may put the NULL related function support to SQL API, but for 
> >>> Scalar expression and Boolean expression, it already been 
> >>> supported in Table API, without NULL value handling support, query 
> >>> with Scalar expression and Boolean expression would fail while 
> >>> encounter NULL
> value.
> >>>
> >>> Thanks
> >>> Chengxiang
> >>>
> >>> -----Original Message-----
> >>> From: Timo Walther [mailto:twalthr@apache.org]
> >>> Sent: Wednesday, November 25, 2015 7:33 PM
> >>> To: dev@flink.apache.org
> >>> Subject: Re: The null in Flink
> >>>
> >>> Hi Chengxiang,
> >>>
> >>> I totally agree that the Table API should fully support NULL values.
> >>> The Table API is a logical API and therefore we should be as close 
> >>> to ANSI SQL as possible. Rows need to be nullable in the near future.
> >>>
> >>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to 
> >>> much like SQL magic. I think all other SQL magic (DBMS specific 
> >>> corner cases) should be handled by the SQL API on top of the Table API.
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>> On 25.11.2015 11:31, Li, Chengxiang wrote:
> >>>> Hi
> >>>> In this mail list, there are some discussions about null value 
> >>>> handling
> >>> in Flink, and I saw several related JIRAs as well(like FLINK-2203, 
> >>> FLINK-2210), but unfortunately, got reverted due to immature 
> >>> design, and no further action since then. I would like to pick 
> >>> this topic up here, as it's quite an important part of data 
> >>> analysis and many
> features depend on it.
> >>> Hopefully, through a plenary discussion, we can generate an 
> >>> acceptable solution and move forward. Stephan has explained very 
> >>> clearly about how and why Flink handle "Null values in the 
> >>> Programming Language APIs", so I mainly talk about the second part 
> >>> of "Null values in the high-level
> >>> (logical) APIs ".
> >>>>
> >>>> 1. Why should Flink support Null values handling in Table API?
> >>>>     i.  Data source may miss column value in many cases, if no 
> >>>> Null
> >>> values handling in Table API, user need to write an extra ETL to 
> >>> handle missing values manually.
> >>>>     ii. Some Table API operators generate Null values on their 
> >>>> own,
> >>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values 
> >>> handling in Table API is the prerequisite of these features.
> >>>>
> >>>> 2. The semantic of Null value handling in Table API.
> >>>> Fortunately, there are already mature DBMS  standards we can 
> >>>> follow for
> >>> Null value handling, I list several semantic of Null value 
> >>> handling
> here.
> >>> To be noted that, this may not cover all the cases, and the 
> >>> semantics may vary in different DBMSs, so it should totally open 
> >>> to
> discuss.
> >>>>     I,  NULL compare. In ascending order, NULL is smaller than 
> >>>> any
> >>> other value, and NULL == NULL return false.
> >>>>     ii. NULL exists in GroupBy Key, all NULL values are grouped 
> >>>> as a
> >>> single group.
> >>>>     iii. NULL exists in Aggregate columns, ignore NULL in 
> >>>> aggregation
> >>> function.
> >>>>                iv. NULL exists in both side Join key, refer to 
> >>>> #i,
> >>> NULL == NULL return false, no output for NULL Join key.
> >>>>                v.  NULL in Scalar expression, expression within
> >>> NULL(eg. 1 + NULL) return NULL.
> >>>>                vi. NULL in Boolean expression, add an extra result:
> >>> UNKNOWN, more semantic for Boolean expression in reference #1.
> >>>>                vii. More related function support, like COALESCE, 
> >>>> NVL,
> >>> NANVL, and so on.
> >>>>
> >>>> 3. NULL value storage in Table API.
> >>>>  Just set null to Row field value. To mark NULL value in 
> >>>> serialized
> >>> binary record data, normally it use extra flag for each field to 
> >>> mark whether its value is NULL, which would change the data layout 
> >>> of Row object. So any logic that access serialized Row data 
> >>> directly should updated to sync with new data layout, for example, 
> >>> many methods in RowComparator.
> >>>>
> >>>> Reference:
> >>>> 1. Nulls: Nothing to worry about:
> >>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql
> >>> -0
> >>> 97727.html
> >>> .
> >>>> 2. Null related functions:
> >>>> https://oracle-base.com/articles/misc/null-related-functions
> >>>>
> >>>> -----Original Message-----
> >>>> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On 
> >>>> Behalf Of Stephan Ewen
> >>>> Sent: Thursday, June 18, 2015 8:43 AM
> >>>> To: dev@flink.apache.org
> >>>> Subject: Re: The null in Flink
> >>>>
> >>>> Hi!
> >>>>
> >>>> I think we actually have two discussions here, both of them important:
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 1) Null values in the Programming Language APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> Fields in composite types may simply be null pointers.
> >>>>
> >>>> In object types:
> >>>>  - primitives members are naturally non-nullable
> >>>>  - all other members are nullable
> >>>>
> >>>> => If you want to avoid the overhead of nullability, go with 
> >>>> primitive
> >>> types.
> >>>>
> >>>> In Tuples, and derives types (Scala case classes):
> >>>>  - Fields are non-nullable.
> >>>>
> >>>> => The reason here is that we initially decided to keep tuples as 
> >>>> a very
> >>> fast data type. Because tuples cannot hold primitives in 
> >>> Java/Scala, we would not have a way to make fast non-nullable 
> >>> fields. The performance of nullable fields affects the 
> >>> key-operations, especially
> on normalized keys.
> >>>> We can work around that with some effort, but have not one it so far.
> >>>>
> >>>> => In Scala, the Option types is a natural way of elegantly 
> >>>> working
> >>> around that.
> >>>>
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 2) Null values in the high-level (logial) APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> This is mainly what Ted was referring to, if I understood him
> correctly.
> >>>>
> >>>> Here, we need to figure out what form of semantical null values 
> >>>> in the
> >>> Table API and later, in SQL.
> >>>>
> >>>> Besides deciding what semantics to follow here in the logical 
> >>>> APIs, we
> >>> need to decide what these values confert to/from when switching 
> >>> between logical/physical APIs.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning 
> >>>> <te...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels 
> >>>>> <mx...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Just to give an idea what null values could cause in Flink:
> >>>>> DataSet.count()
> >>>>>> returns the number of elements of all values in a Dataset (null 
> >>>>>> or
> >>>>>> not) while #834 would ignore null values and aggregate the 
> >>>>>> DataSet without
> >>>>> them.
> >>>>> Compare R's na.action.
> >>>>>
> >>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
> >>>>>
> >>>
> >>>
> >
>
>

RE: The null in Flink

Posted by "Li, Chengxiang" <ch...@intel.com>.

Hi, Stephan
Treat UNKOWN as FALSE may works if the Boolean expression is used in filter operation, but for other operations, such as select and groupBy, it does not make sense anymore, we should need UNKNOWN value(or unified as NULL) to distinguish with TRUE/FALSE .

Thanks
Chengxiang 

-----Original Message-----
From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf Of Stephan Ewen
Sent: Wednesday, December 2, 2015 6:27 PM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Hi Chenliang!

I have to dig into this again, it was a while back. I think (vaguely) the reason why this worked was that in the end (at the root of a tree that is a logical expression) if the result is UNKNOWN, it is treated like FALSE.

For example a predicate like "WHERE t.a > 10 && t.b == 'pending' ". If one boolean atom is UNKNOWN, the other is TRUE, the whole term becomes UNKNOWN and the row is filtered out (as if the predicate was false) - the result of the query contains no rows where predicate results are UNKNOWN.

Stephan



On Tue, Dec 1, 2015 at 4:09 AM, Li, Chengxiang <ch...@intel.com>
wrote:

> Stephen,
> For the 3rd topic, you mentioned that "If the boolean expressions are 
> monotonous (have no NOT), then the UNKNOWN value can be the same as 
> FALSE ", as UNKNOWN means it could be TRUE as well, does it a proper 
> way to handle it just as FALSE?
>
> Aljoscha,
> I agree with you, Table can only be transformed from Tuple/Case Class 
> DataSet now, and Tuple/Case Class does not allow null field value, so 
> read files from data source to Row DataSet is necessary for NULL value handling.
>
> -----Original Message-----
> From: Aljoscha Krettek [mailto:aljoscha@apache.org]
> Sent: Friday, November 27, 2015 6:41 PM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Oh, this is probably the Jira for what I mentioned:
> https://issues.apache.org/jira/browse/FLINK-2988
>
> > On 27 Nov 2015, at 11:02, Aljoscha Krettek <al...@apache.org> wrote:
> >
> > Hi,
> > just some information. The Table API code generator already has
> preliminary support for generating code that is NULL-aware. So for 
> example if you have expressions like 1 + NULL the result would also be null.
> >
> > I think one of the missing pieces is a way to get data that contains
> null values into the system. For example, right now the expected way 
> to read csv files is via tuples and they don’t support null values. I 
> think we need a way to directly read CSV files into a Row DataSet (or Table).
> >
> > Cheers,
> > Aljoscha
> >> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote:
> >>
> >> Hi!
> >>
> >> Thanks for the good discussion! Here are some thoughts from my side:
> >>
> >> 1)
> >> I would agree with Chengxiang that it helps to have as much NULL 
> >> handling in the table API as possible, since most SQL constructs 
> >> will be permitted there are well.
> >>
> >> 2)
> >> A question that I have is whether we want to actually follow the 
> >> SQL standard exactly. There is a lot of criticism on NULL in the 
> >> SQL standard, and there have been many good proposals for more 
> >> meaningful semantics (for example differentiate between the 
> >> meanings "value missing", "value unknown", "value not applicable", etc).
> >>
> >> Going with the SQL way is easiest and makes SQL addition on top of 
> >> the table API much easier. Also, there is only one type of NULL, 
> >> meaning that null-values can be encoded efficiently in bitmaps.
> >> Further more, the fact that the Table API users have the power of a 
> >> programming language at hand (rather than the limited set of SQL 
> >> operators), they should be able to easily define their own 
> >> constants for special meanings like "value not applicable" or so.
> >>
> >> Just curious if anyone has experience with some of the other 
> >> null-sematic proposals that have been around.
> >>
> >> 3)
> >> One comment concerning the three-value-logic for boolean expressions:
> >>
> >> A while back, I worked on a SQL engine, and we were able to not 
> >> implement three-value logic with trick. If I recall correctly, it 
> >> was
> like this:
> >>
> >> If the boolean expressions are monotonous (have no NOT), then the 
> >> UNKNOWN value can be the same as FALSE. So the query planner had to 
> >> rewrite all expression trees to have no NOT, which means pushing 
> >> the NOT down into the leaf comparison operations (for example push 
> >> NOT into
> == to become !=).
> >> These leaf comparison operators needed to be NULL aware to return 
> >> FALSE on comparisons with NULL.
> >>
> >>
> >> Greetings,
> >> Stephan
> >>
> >>
> >> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang 
> >> <ch...@intel.com>
> >> wrote:
> >>
> >>> Thanks, Timo.
> >>> We may put the NULL related function support to SQL API, but for 
> >>> Scalar expression and Boolean expression, it already been 
> >>> supported in Table API, without NULL value handling support, query 
> >>> with Scalar expression and Boolean expression would fail while 
> >>> encounter NULL
> value.
> >>>
> >>> Thanks
> >>> Chengxiang
> >>>
> >>> -----Original Message-----
> >>> From: Timo Walther [mailto:twalthr@apache.org]
> >>> Sent: Wednesday, November 25, 2015 7:33 PM
> >>> To: dev@flink.apache.org
> >>> Subject: Re: The null in Flink
> >>>
> >>> Hi Chengxiang,
> >>>
> >>> I totally agree that the Table API should fully support NULL values.
> >>> The Table API is a logical API and therefore we should be as close 
> >>> to ANSI SQL as possible. Rows need to be nullable in the near future.
> >>>
> >>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to 
> >>> much like SQL magic. I think all other SQL magic (DBMS specific 
> >>> corner cases) should be handled by the SQL API on top of the Table API.
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>> On 25.11.2015 11:31, Li, Chengxiang wrote:
> >>>> Hi
> >>>> In this mail list, there are some discussions about null value 
> >>>> handling
> >>> in Flink, and I saw several related JIRAs as well(like FLINK-2203, 
> >>> FLINK-2210), but unfortunately, got reverted due to immature 
> >>> design, and no further action since then. I would like to pick 
> >>> this topic up here, as it's quite an important part of data 
> >>> analysis and many
> features depend on it.
> >>> Hopefully, through a plenary discussion, we can generate an 
> >>> acceptable solution and move forward. Stephan has explained very 
> >>> clearly about how and why Flink handle "Null values in the 
> >>> Programming Language APIs", so I mainly talk about the second part 
> >>> of "Null values in the high-level
> >>> (logical) APIs ".
> >>>>
> >>>> 1. Why should Flink support Null values handling in Table API?
> >>>>     i.  Data source may miss column value in many cases, if no 
> >>>> Null
> >>> values handling in Table API, user need to write an extra ETL to 
> >>> handle missing values manually.
> >>>>     ii. Some Table API operators generate Null values on their 
> >>>> own,
> >>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values 
> >>> handling in Table API is the prerequisite of these features.
> >>>>
> >>>> 2. The semantic of Null value handling in Table API.
> >>>> Fortunately, there are already mature DBMS  standards we can 
> >>>> follow for
> >>> Null value handling, I list several semantic of Null value 
> >>> handling
> here.
> >>> To be noted that, this may not cover all the cases, and the 
> >>> semantics may vary in different DBMSs, so it should totally open 
> >>> to
> discuss.
> >>>>     I,  NULL compare. In ascending order, NULL is smaller than 
> >>>> any
> >>> other value, and NULL == NULL return false.
> >>>>     ii. NULL exists in GroupBy Key, all NULL values are grouped 
> >>>> as a
> >>> single group.
> >>>>     iii. NULL exists in Aggregate columns, ignore NULL in 
> >>>> aggregation
> >>> function.
> >>>>                iv. NULL exists in both side Join key, refer to 
> >>>> #i,
> >>> NULL == NULL return false, no output for NULL Join key.
> >>>>                v.  NULL in Scalar expression, expression within
> >>> NULL(eg. 1 + NULL) return NULL.
> >>>>                vi. NULL in Boolean expression, add an extra result:
> >>> UNKNOWN, more semantic for Boolean expression in reference #1.
> >>>>                vii. More related function support, like COALESCE, 
> >>>> NVL,
> >>> NANVL, and so on.
> >>>>
> >>>> 3. NULL value storage in Table API.
> >>>>  Just set null to Row field value. To mark NULL value in 
> >>>> serialized
> >>> binary record data, normally it use extra flag for each field to 
> >>> mark whether its value is NULL, which would change the data layout 
> >>> of Row object. So any logic that access serialized Row data 
> >>> directly should updated to sync with new data layout, for example, 
> >>> many methods in RowComparator.
> >>>>
> >>>> Reference:
> >>>> 1. Nulls: Nothing to worry about:
> >>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql
> >>> -0
> >>> 97727.html
> >>> .
> >>>> 2. Null related functions:
> >>>> https://oracle-base.com/articles/misc/null-related-functions
> >>>>
> >>>> -----Original Message-----
> >>>> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On 
> >>>> Behalf Of Stephan Ewen
> >>>> Sent: Thursday, June 18, 2015 8:43 AM
> >>>> To: dev@flink.apache.org
> >>>> Subject: Re: The null in Flink
> >>>>
> >>>> Hi!
> >>>>
> >>>> I think we actually have two discussions here, both of them important:
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 1) Null values in the Programming Language APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> Fields in composite types may simply be null pointers.
> >>>>
> >>>> In object types:
> >>>>  - primitives members are naturally non-nullable
> >>>>  - all other members are nullable
> >>>>
> >>>> => If you want to avoid the overhead of nullability, go with 
> >>>> primitive
> >>> types.
> >>>>
> >>>> In Tuples, and derives types (Scala case classes):
> >>>>  - Fields are non-nullable.
> >>>>
> >>>> => The reason here is that we initially decided to keep tuples as 
> >>>> a very
> >>> fast data type. Because tuples cannot hold primitives in 
> >>> Java/Scala, we would not have a way to make fast non-nullable 
> >>> fields. The performance of nullable fields affects the 
> >>> key-operations, especially
> on normalized keys.
> >>>> We can work around that with some effort, but have not one it so far.
> >>>>
> >>>> => In Scala, the Option types is a natural way of elegantly 
> >>>> working
> >>> around that.
> >>>>
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 2) Null values in the high-level (logial) APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> This is mainly what Ted was referring to, if I understood him
> correctly.
> >>>>
> >>>> Here, we need to figure out what form of semantical null values 
> >>>> in the
> >>> Table API and later, in SQL.
> >>>>
> >>>> Besides deciding what semantics to follow here in the logical 
> >>>> APIs, we
> >>> need to decide what these values confert to/from when switching 
> >>> between logical/physical APIs.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning 
> >>>> <te...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels 
> >>>>> <mx...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Just to give an idea what null values could cause in Flink:
> >>>>> DataSet.count()
> >>>>>> returns the number of elements of all values in a Dataset (null 
> >>>>>> or
> >>>>>> not) while #834 would ignore null values and aggregate the 
> >>>>>> DataSet without
> >>>>> them.
> >>>>> Compare R's na.action.
> >>>>>
> >>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
> >>>>>
> >>>
> >>>
> >
>
>

Re: The null in Flink

Posted by Stephan Ewen <se...@apache.org>.

Hi Chenliang!

I have to dig into this again, it was a while back. I think (vaguely) the
reason why this worked was that in the end (at the root of a tree that is a
logical expression) if the result is UNKNOWN, it is treated like FALSE.

For example a predicate like "WHERE t.a > 10 && t.b == 'pending' ". If one
boolean atom is UNKNOWN, the other is TRUE, the whole term becomes UNKNOWN
and the row is filtered out (as if the predicate was false) - the result of
the query contains no rows where predicate results are UNKNOWN.

Stephan



On Tue, Dec 1, 2015 at 4:09 AM, Li, Chengxiang <ch...@intel.com>
wrote:

> Stephen,
> For the 3rd topic, you mentioned that "If the boolean expressions are
> monotonous (have no NOT), then the UNKNOWN value can be the same as FALSE
> ", as UNKNOWN means it could be TRUE as well, does it a proper way to
> handle it just as FALSE?
>
> Aljoscha,
> I agree with you, Table can only be transformed from Tuple/Case Class
> DataSet now, and Tuple/Case Class does not allow null field value, so read
> files from data source to Row DataSet is necessary for NULL value handling.
>
> -----Original Message-----
> From: Aljoscha Krettek [mailto:aljoscha@apache.org]
> Sent: Friday, November 27, 2015 6:41 PM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Oh, this is probably the Jira for what I mentioned:
> https://issues.apache.org/jira/browse/FLINK-2988
>
> > On 27 Nov 2015, at 11:02, Aljoscha Krettek <al...@apache.org> wrote:
> >
> > Hi,
> > just some information. The Table API code generator already has
> preliminary support for generating code that is NULL-aware. So for example
> if you have expressions like 1 + NULL the result would also be null.
> >
> > I think one of the missing pieces is a way to get data that contains
> null values into the system. For example, right now the expected way to
> read csv files is via tuples and they don’t support null values. I think we
> need a way to directly read CSV files into a Row DataSet (or Table).
> >
> > Cheers,
> > Aljoscha
> >> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote:
> >>
> >> Hi!
> >>
> >> Thanks for the good discussion! Here are some thoughts from my side:
> >>
> >> 1)
> >> I would agree with Chengxiang that it helps to have as much NULL
> >> handling in the table API as possible, since most SQL constructs will
> >> be permitted there are well.
> >>
> >> 2)
> >> A question that I have is whether we want to actually follow the SQL
> >> standard exactly. There is a lot of criticism on NULL in the SQL
> >> standard, and there have been many good proposals for more meaningful
> >> semantics (for example differentiate between the meanings "value
> >> missing", "value unknown", "value not applicable", etc).
> >>
> >> Going with the SQL way is easiest and makes SQL addition on top of
> >> the table API much easier. Also, there is only one type of NULL,
> >> meaning that null-values can be encoded efficiently in bitmaps.
> >> Further more, the fact that the Table API users have the power of a
> >> programming language at hand (rather than the limited set of SQL
> >> operators), they should be able to easily define their own constants
> >> for special meanings like "value not applicable" or so.
> >>
> >> Just curious if anyone has experience with some of the other
> >> null-sematic proposals that have been around.
> >>
> >> 3)
> >> One comment concerning the three-value-logic for boolean expressions:
> >>
> >> A while back, I worked on a SQL engine, and we were able to not
> >> implement three-value logic with trick. If I recall correctly, it was
> like this:
> >>
> >> If the boolean expressions are monotonous (have no NOT), then the
> >> UNKNOWN value can be the same as FALSE. So the query planner had to
> >> rewrite all expression trees to have no NOT, which means pushing the
> >> NOT down into the leaf comparison operations (for example push NOT into
> == to become !=).
> >> These leaf comparison operators needed to be NULL aware to return
> >> FALSE on comparisons with NULL.
> >>
> >>
> >> Greetings,
> >> Stephan
> >>
> >>
> >> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang
> >> <ch...@intel.com>
> >> wrote:
> >>
> >>> Thanks, Timo.
> >>> We may put the NULL related function support to SQL API, but for
> >>> Scalar expression and Boolean expression, it already been supported
> >>> in Table API, without NULL value handling support, query with Scalar
> >>> expression and Boolean expression would fail while encounter NULL
> value.
> >>>
> >>> Thanks
> >>> Chengxiang
> >>>
> >>> -----Original Message-----
> >>> From: Timo Walther [mailto:twalthr@apache.org]
> >>> Sent: Wednesday, November 25, 2015 7:33 PM
> >>> To: dev@flink.apache.org
> >>> Subject: Re: The null in Flink
> >>>
> >>> Hi Chengxiang,
> >>>
> >>> I totally agree that the Table API should fully support NULL values.
> >>> The Table API is a logical API and therefore we should be as close
> >>> to ANSI SQL as possible. Rows need to be nullable in the near future.
> >>>
> >>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to
> >>> much like SQL magic. I think all other SQL magic (DBMS specific
> >>> corner cases) should be handled by the SQL API on top of the Table API.
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>> On 25.11.2015 11:31, Li, Chengxiang wrote:
> >>>> Hi
> >>>> In this mail list, there are some discussions about null value
> >>>> handling
> >>> in Flink, and I saw several related JIRAs as well(like FLINK-2203,
> >>> FLINK-2210), but unfortunately, got reverted due to immature design,
> >>> and no further action since then. I would like to pick this topic up
> >>> here, as it's quite an important part of data analysis and many
> features depend on it.
> >>> Hopefully, through a plenary discussion, we can generate an
> >>> acceptable solution and move forward. Stephan has explained very
> >>> clearly about how and why Flink handle "Null values in the
> >>> Programming Language APIs", so I mainly talk about the second part
> >>> of "Null values in the high-level
> >>> (logical) APIs ".
> >>>>
> >>>> 1. Why should Flink support Null values handling in Table API?
> >>>>     i.  Data source may miss column value in many cases, if no Null
> >>> values handling in Table API, user need to write an extra ETL to
> >>> handle missing values manually.
> >>>>     ii. Some Table API operators generate Null values on their own,
> >>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values
> >>> handling in Table API is the prerequisite of these features.
> >>>>
> >>>> 2. The semantic of Null value handling in Table API.
> >>>> Fortunately, there are already mature DBMS  standards we can follow
> >>>> for
> >>> Null value handling, I list several semantic of Null value handling
> here.
> >>> To be noted that, this may not cover all the cases, and the
> >>> semantics may vary in different DBMSs, so it should totally open to
> discuss.
> >>>>     I,  NULL compare. In ascending order, NULL is smaller than any
> >>> other value, and NULL == NULL return false.
> >>>>     ii. NULL exists in GroupBy Key, all NULL values are grouped as
> >>>> a
> >>> single group.
> >>>>     iii. NULL exists in Aggregate columns, ignore NULL in
> >>>> aggregation
> >>> function.
> >>>>                iv. NULL exists in both side Join key, refer to #i,
> >>> NULL == NULL return false, no output for NULL Join key.
> >>>>                v.  NULL in Scalar expression, expression within
> >>> NULL(eg. 1 + NULL) return NULL.
> >>>>                vi. NULL in Boolean expression, add an extra result:
> >>> UNKNOWN, more semantic for Boolean expression in reference #1.
> >>>>                vii. More related function support, like COALESCE,
> >>>> NVL,
> >>> NANVL, and so on.
> >>>>
> >>>> 3. NULL value storage in Table API.
> >>>>  Just set null to Row field value. To mark NULL value in serialized
> >>> binary record data, normally it use extra flag for each field to
> >>> mark whether its value is NULL, which would change the data layout
> >>> of Row object. So any logic that access serialized Row data directly
> >>> should updated to sync with new data layout, for example, many
> >>> methods in RowComparator.
> >>>>
> >>>> Reference:
> >>>> 1. Nulls: Nothing to worry about:
> >>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-0
> >>> 97727.html
> >>> .
> >>>> 2. Null related functions:
> >>>> https://oracle-base.com/articles/misc/null-related-functions
> >>>>
> >>>> -----Original Message-----
> >>>> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On
> >>>> Behalf Of Stephan Ewen
> >>>> Sent: Thursday, June 18, 2015 8:43 AM
> >>>> To: dev@flink.apache.org
> >>>> Subject: Re: The null in Flink
> >>>>
> >>>> Hi!
> >>>>
> >>>> I think we actually have two discussions here, both of them important:
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 1) Null values in the Programming Language APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> Fields in composite types may simply be null pointers.
> >>>>
> >>>> In object types:
> >>>>  - primitives members are naturally non-nullable
> >>>>  - all other members are nullable
> >>>>
> >>>> => If you want to avoid the overhead of nullability, go with
> >>>> primitive
> >>> types.
> >>>>
> >>>> In Tuples, and derives types (Scala case classes):
> >>>>  - Fields are non-nullable.
> >>>>
> >>>> => The reason here is that we initially decided to keep tuples as a
> >>>> very
> >>> fast data type. Because tuples cannot hold primitives in Java/Scala,
> >>> we would not have a way to make fast non-nullable fields. The
> >>> performance of nullable fields affects the key-operations, especially
> on normalized keys.
> >>>> We can work around that with some effort, but have not one it so far.
> >>>>
> >>>> => In Scala, the Option types is a natural way of elegantly working
> >>> around that.
> >>>>
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 2) Null values in the high-level (logial) APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> This is mainly what Ted was referring to, if I understood him
> correctly.
> >>>>
> >>>> Here, we need to figure out what form of semantical null values in
> >>>> the
> >>> Table API and later, in SQL.
> >>>>
> >>>> Besides deciding what semantics to follow here in the logical APIs,
> >>>> we
> >>> need to decide what these values confert to/from when switching
> >>> between logical/physical APIs.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning
> >>>> <te...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels
> >>>>> <mx...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Just to give an idea what null values could cause in Flink:
> >>>>> DataSet.count()
> >>>>>> returns the number of elements of all values in a Dataset (null
> >>>>>> or
> >>>>>> not) while #834 would ignore null values and aggregate the
> >>>>>> DataSet without
> >>>>> them.
> >>>>> Compare R's na.action.
> >>>>>
> >>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
> >>>>>
> >>>
> >>>
> >
>
>

RE: The null in Flink

Posted by "Li, Chengxiang" <ch...@intel.com>.

Stephen,
For the 3rd topic, you mentioned that "If the boolean expressions are monotonous (have no NOT), then the UNKNOWN value can be the same as FALSE ", as UNKNOWN means it could be TRUE as well, does it a proper way to handle it just as FALSE?

Aljoscha, 
I agree with you, Table can only be transformed from Tuple/Case Class DataSet now, and Tuple/Case Class does not allow null field value, so read files from data source to Row DataSet is necessary for NULL value handling. 

-----Original Message-----
From: Aljoscha Krettek [mailto:aljoscha@apache.org] 
Sent: Friday, November 27, 2015 6:41 PM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Oh, this is probably the Jira for what I mentioned: https://issues.apache.org/jira/browse/FLINK-2988

> On 27 Nov 2015, at 11:02, Aljoscha Krettek <al...@apache.org> wrote:
> 
> Hi,
> just some information. The Table API code generator already has preliminary support for generating code that is NULL-aware. So for example if you have expressions like 1 + NULL the result would also be null.
> 
> I think one of the missing pieces is a way to get data that contains null values into the system. For example, right now the expected way to read csv files is via tuples and they don’t support null values. I think we need a way to directly read CSV files into a Row DataSet (or Table).
> 
> Cheers,
> Aljoscha
>> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote:
>> 
>> Hi!
>> 
>> Thanks for the good discussion! Here are some thoughts from my side:
>> 
>> 1)
>> I would agree with Chengxiang that it helps to have as much NULL 
>> handling in the table API as possible, since most SQL constructs will 
>> be permitted there are well.
>> 
>> 2)
>> A question that I have is whether we want to actually follow the SQL 
>> standard exactly. There is a lot of criticism on NULL in the SQL 
>> standard, and there have been many good proposals for more meaningful 
>> semantics (for example differentiate between the meanings "value 
>> missing", "value unknown", "value not applicable", etc).
>> 
>> Going with the SQL way is easiest and makes SQL addition on top of 
>> the table API much easier. Also, there is only one type of NULL, 
>> meaning that null-values can be encoded efficiently in bitmaps. 
>> Further more, the fact that the Table API users have the power of a 
>> programming language at hand (rather than the limited set of SQL 
>> operators), they should be able to easily define their own constants 
>> for special meanings like "value not applicable" or so.
>> 
>> Just curious if anyone has experience with some of the other 
>> null-sematic proposals that have been around.
>> 
>> 3)
>> One comment concerning the three-value-logic for boolean expressions:
>> 
>> A while back, I worked on a SQL engine, and we were able to not 
>> implement three-value logic with trick. If I recall correctly, it was like this:
>> 
>> If the boolean expressions are monotonous (have no NOT), then the 
>> UNKNOWN value can be the same as FALSE. So the query planner had to 
>> rewrite all expression trees to have no NOT, which means pushing the 
>> NOT down into the leaf comparison operations (for example push NOT into == to become !=).
>> These leaf comparison operators needed to be NULL aware to return 
>> FALSE on comparisons with NULL.
>> 
>> 
>> Greetings,
>> Stephan
>> 
>> 
>> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang 
>> <ch...@intel.com>
>> wrote:
>> 
>>> Thanks, Timo.
>>> We may put the NULL related function support to SQL API, but for 
>>> Scalar expression and Boolean expression, it already been supported 
>>> in Table API, without NULL value handling support, query with Scalar 
>>> expression and Boolean expression would fail while encounter NULL value.
>>> 
>>> Thanks
>>> Chengxiang
>>> 
>>> -----Original Message-----
>>> From: Timo Walther [mailto:twalthr@apache.org]
>>> Sent: Wednesday, November 25, 2015 7:33 PM
>>> To: dev@flink.apache.org
>>> Subject: Re: The null in Flink
>>> 
>>> Hi Chengxiang,
>>> 
>>> I totally agree that the Table API should fully support NULL values. 
>>> The Table API is a logical API and therefore we should be as close 
>>> to ANSI SQL as possible. Rows need to be nullable in the near future.
>>> 
>>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to 
>>> much like SQL magic. I think all other SQL magic (DBMS specific 
>>> corner cases) should be handled by the SQL API on top of the Table API.
>>> 
>>> Regards,
>>> Timo
>>> 
>>> 
>>> On 25.11.2015 11:31, Li, Chengxiang wrote:
>>>> Hi
>>>> In this mail list, there are some discussions about null value 
>>>> handling
>>> in Flink, and I saw several related JIRAs as well(like FLINK-2203, 
>>> FLINK-2210), but unfortunately, got reverted due to immature design, 
>>> and no further action since then. I would like to pick this topic up 
>>> here, as it's quite an important part of data analysis and many features depend on it.
>>> Hopefully, through a plenary discussion, we can generate an 
>>> acceptable solution and move forward. Stephan has explained very 
>>> clearly about how and why Flink handle "Null values in the 
>>> Programming Language APIs", so I mainly talk about the second part 
>>> of "Null values in the high-level
>>> (logical) APIs ".
>>>> 
>>>> 1. Why should Flink support Null values handling in Table API?
>>>>     i.  Data source may miss column value in many cases, if no Null
>>> values handling in Table API, user need to write an extra ETL to 
>>> handle missing values manually.
>>>>     ii. Some Table API operators generate Null values on their own,
>>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values 
>>> handling in Table API is the prerequisite of these features.
>>>> 
>>>> 2. The semantic of Null value handling in Table API.
>>>> Fortunately, there are already mature DBMS  standards we can follow 
>>>> for
>>> Null value handling, I list several semantic of Null value handling here.
>>> To be noted that, this may not cover all the cases, and the 
>>> semantics may vary in different DBMSs, so it should totally open to discuss.
>>>>     I,  NULL compare. In ascending order, NULL is smaller than any
>>> other value, and NULL == NULL return false.
>>>>     ii. NULL exists in GroupBy Key, all NULL values are grouped as 
>>>> a
>>> single group.
>>>>     iii. NULL exists in Aggregate columns, ignore NULL in 
>>>> aggregation
>>> function.
>>>>                iv. NULL exists in both side Join key, refer to #i,
>>> NULL == NULL return false, no output for NULL Join key.
>>>>                v.  NULL in Scalar expression, expression within
>>> NULL(eg. 1 + NULL) return NULL.
>>>>                vi. NULL in Boolean expression, add an extra result:
>>> UNKNOWN, more semantic for Boolean expression in reference #1.
>>>>                vii. More related function support, like COALESCE, 
>>>> NVL,
>>> NANVL, and so on.
>>>> 
>>>> 3. NULL value storage in Table API.
>>>>  Just set null to Row field value. To mark NULL value in serialized
>>> binary record data, normally it use extra flag for each field to 
>>> mark whether its value is NULL, which would change the data layout 
>>> of Row object. So any logic that access serialized Row data directly 
>>> should updated to sync with new data layout, for example, many 
>>> methods in RowComparator.
>>>> 
>>>> Reference:
>>>> 1. Nulls: Nothing to worry about:
>>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-0
>>> 97727.html
>>> .
>>>> 2. Null related functions:
>>>> https://oracle-base.com/articles/misc/null-related-functions
>>>> 
>>>> -----Original Message-----
>>>> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On 
>>>> Behalf Of Stephan Ewen
>>>> Sent: Thursday, June 18, 2015 8:43 AM
>>>> To: dev@flink.apache.org
>>>> Subject: Re: The null in Flink
>>>> 
>>>> Hi!
>>>> 
>>>> I think we actually have two discussions here, both of them important:
>>>> 
>>>> --------------------------------------------------------------
>>>> 1) Null values in the Programming Language APIs
>>>> --------------------------------------------------------------
>>>> 
>>>> Fields in composite types may simply be null pointers.
>>>> 
>>>> In object types:
>>>>  - primitives members are naturally non-nullable
>>>>  - all other members are nullable
>>>> 
>>>> => If you want to avoid the overhead of nullability, go with 
>>>> primitive
>>> types.
>>>> 
>>>> In Tuples, and derives types (Scala case classes):
>>>>  - Fields are non-nullable.
>>>> 
>>>> => The reason here is that we initially decided to keep tuples as a 
>>>> very
>>> fast data type. Because tuples cannot hold primitives in Java/Scala, 
>>> we would not have a way to make fast non-nullable fields. The 
>>> performance of nullable fields affects the key-operations, especially on normalized keys.
>>>> We can work around that with some effort, but have not one it so far.
>>>> 
>>>> => In Scala, the Option types is a natural way of elegantly working
>>> around that.
>>>> 
>>>> 
>>>> --------------------------------------------------------------
>>>> 2) Null values in the high-level (logial) APIs
>>>> --------------------------------------------------------------
>>>> 
>>>> This is mainly what Ted was referring to, if I understood him correctly.
>>>> 
>>>> Here, we need to figure out what form of semantical null values in 
>>>> the
>>> Table API and later, in SQL.
>>>> 
>>>> Besides deciding what semantics to follow here in the logical APIs, 
>>>> we
>>> need to decide what these values confert to/from when switching 
>>> between logical/physical APIs.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning 
>>>> <te...@gmail.com>
>>> wrote:
>>>> 
>>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels 
>>>>> <mx...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Just to give an idea what null values could cause in Flink:
>>>>> DataSet.count()
>>>>>> returns the number of elements of all values in a Dataset (null 
>>>>>> or
>>>>>> not) while #834 would ignore null values and aggregate the 
>>>>>> DataSet without
>>>>> them.
>>>>> Compare R's na.action.
>>>>> 
>>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>>>>> 
>>> 
>>> 
>

Re: The null in Flink

Posted by Aljoscha Krettek <al...@apache.org>.

Oh, this is probably the Jira for what I mentioned: https://issues.apache.org/jira/browse/FLINK-2988

> On 27 Nov 2015, at 11:02, Aljoscha Krettek <al...@apache.org> wrote:
> 
> Hi,
> just some information. The Table API code generator already has preliminary support for generating code that is NULL-aware. So for example if you have expressions like 1 + NULL the result would also be null.
> 
> I think one of the missing pieces is a way to get data that contains null values into the system. For example, right now the expected way to read csv files is via tuples and they don’t support null values. I think we need a way to directly read CSV files into a Row DataSet (or Table).
> 
> Cheers,
> Aljoscha
>> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote:
>> 
>> Hi!
>> 
>> Thanks for the good discussion! Here are some thoughts from my side:
>> 
>> 1)
>> I would agree with Chengxiang that it helps to have as much NULL handling
>> in the table API as possible, since most SQL constructs will be permitted
>> there are well.
>> 
>> 2)
>> A question that I have is whether we want to actually follow the SQL
>> standard exactly. There is a lot of criticism on NULL in the SQL standard,
>> and there have been many good proposals for more meaningful semantics (for
>> example differentiate between the meanings "value missing", "value
>> unknown", "value not applicable", etc).
>> 
>> Going with the SQL way is easiest and makes SQL addition on top of the
>> table API much easier. Also, there is only one type of NULL, meaning that
>> null-values can be encoded efficiently in bitmaps. Further more, the fact
>> that the Table API users have the power of a programming language at hand
>> (rather than the limited set of SQL operators), they should be able to
>> easily define their own constants for special meanings like "value not
>> applicable" or so.
>> 
>> Just curious if anyone has experience with some of the other null-sematic
>> proposals that have been around.
>> 
>> 3)
>> One comment concerning the three-value-logic for boolean expressions:
>> 
>> A while back, I worked on a SQL engine, and we were able to not implement
>> three-value logic with trick. If I recall correctly, it was like this:
>> 
>> If the boolean expressions are monotonous (have no NOT), then the UNKNOWN
>> value can be the same as FALSE. So the query planner had to rewrite all
>> expression trees to have no NOT, which means pushing the NOT down into the
>> leaf comparison operations (for example push NOT into == to become !=).
>> These leaf comparison operators needed to be NULL aware to return FALSE on
>> comparisons with NULL.
>> 
>> 
>> Greetings,
>> Stephan
>> 
>> 
>> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang <ch...@intel.com>
>> wrote:
>> 
>>> Thanks, Timo.
>>> We may put the NULL related function support to SQL API, but for Scalar
>>> expression and Boolean expression, it already been supported in Table API,
>>> without NULL value handling support, query with Scalar expression and
>>> Boolean expression would fail while encounter NULL value.
>>> 
>>> Thanks
>>> Chengxiang
>>> 
>>> -----Original Message-----
>>> From: Timo Walther [mailto:twalthr@apache.org]
>>> Sent: Wednesday, November 25, 2015 7:33 PM
>>> To: dev@flink.apache.org
>>> Subject: Re: The null in Flink
>>> 
>>> Hi Chengxiang,
>>> 
>>> I totally agree that the Table API should fully support NULL values. The
>>> Table API is a logical API and therefore we should be as close to ANSI SQL
>>> as possible. Rows need to be nullable in the near future.
>>> 
>>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much
>>> like SQL magic. I think all other SQL magic (DBMS specific corner cases)
>>> should be handled by the SQL API on top of the Table API.
>>> 
>>> Regards,
>>> Timo
>>> 
>>> 
>>> On 25.11.2015 11:31, Li, Chengxiang wrote:
>>>> Hi
>>>> In this mail list, there are some discussions about null value handling
>>> in Flink, and I saw several related JIRAs as well(like FLINK-2203,
>>> FLINK-2210), but unfortunately, got reverted due to immature design, and no
>>> further action since then. I would like to pick this topic up here, as it's
>>> quite an important part of data analysis and many features depend on it.
>>> Hopefully, through a plenary discussion, we can generate an acceptable
>>> solution and move forward. Stephan has explained very clearly about how and
>>> why Flink handle "Null values in the Programming Language APIs", so I
>>> mainly talk about the second part of "Null values in the high-level
>>> (logical) APIs ".
>>>> 
>>>> 1. Why should Flink support Null values handling in Table API?
>>>>     i.  Data source may miss column value in many cases, if no Null
>>> values handling in Table API, user need to write an extra ETL to handle
>>> missing values manually.
>>>>     ii. Some Table API operators generate Null values on their own,
>>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling
>>> in Table API is the prerequisite of these features.
>>>> 
>>>> 2. The semantic of Null value handling in Table API.
>>>> Fortunately, there are already mature DBMS  standards we can follow for
>>> Null value handling, I list several semantic of Null value handling here.
>>> To be noted that, this may not cover all the cases, and the semantics may
>>> vary in different DBMSs, so it should totally open to discuss.
>>>>     I,  NULL compare. In ascending order, NULL is smaller than any
>>> other value, and NULL == NULL return false.
>>>>     ii. NULL exists in GroupBy Key, all NULL values are grouped as a
>>> single group.
>>>>     iii. NULL exists in Aggregate columns, ignore NULL in aggregation
>>> function.
>>>>                iv. NULL exists in both side Join key, refer to #i,
>>> NULL == NULL return false, no output for NULL Join key.
>>>>                v.  NULL in Scalar expression, expression within
>>> NULL(eg. 1 + NULL) return NULL.
>>>>                vi. NULL in Boolean expression, add an extra result:
>>> UNKNOWN, more semantic for Boolean expression in reference #1.
>>>>                vii. More related function support, like COALESCE, NVL,
>>> NANVL, and so on.
>>>> 
>>>> 3. NULL value storage in Table API.
>>>>  Just set null to Row field value. To mark NULL value in serialized
>>> binary record data, normally it use extra flag for each field to mark
>>> whether its value is NULL, which would change the data layout of Row
>>> object. So any logic that access serialized Row data directly should
>>> updated to sync with new data layout, for example, many methods in
>>> RowComparator.
>>>> 
>>>> Reference:
>>>> 1. Nulls: Nothing to worry about:
>>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html
>>> .
>>>> 2. Null related functions:
>>>> https://oracle-base.com/articles/misc/null-related-functions
>>>> 
>>>> -----Original Message-----
>>>> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf
>>>> Of Stephan Ewen
>>>> Sent: Thursday, June 18, 2015 8:43 AM
>>>> To: dev@flink.apache.org
>>>> Subject: Re: The null in Flink
>>>> 
>>>> Hi!
>>>> 
>>>> I think we actually have two discussions here, both of them important:
>>>> 
>>>> --------------------------------------------------------------
>>>> 1) Null values in the Programming Language APIs
>>>> --------------------------------------------------------------
>>>> 
>>>> Fields in composite types may simply be null pointers.
>>>> 
>>>> In object types:
>>>>  - primitives members are naturally non-nullable
>>>>  - all other members are nullable
>>>> 
>>>> => If you want to avoid the overhead of nullability, go with primitive
>>> types.
>>>> 
>>>> In Tuples, and derives types (Scala case classes):
>>>>  - Fields are non-nullable.
>>>> 
>>>> => The reason here is that we initially decided to keep tuples as a very
>>> fast data type. Because tuples cannot hold primitives in Java/Scala, we
>>> would not have a way to make fast non-nullable fields. The performance of
>>> nullable fields affects the key-operations, especially on normalized keys.
>>>> We can work around that with some effort, but have not one it so far.
>>>> 
>>>> => In Scala, the Option types is a natural way of elegantly working
>>> around that.
>>>> 
>>>> 
>>>> --------------------------------------------------------------
>>>> 2) Null values in the high-level (logial) APIs
>>>> --------------------------------------------------------------
>>>> 
>>>> This is mainly what Ted was referring to, if I understood him correctly.
>>>> 
>>>> Here, we need to figure out what form of semantical null values in the
>>> Table API and later, in SQL.
>>>> 
>>>> Besides deciding what semantics to follow here in the logical APIs, we
>>> need to decide what these values confert to/from when switching between
>>> logical/physical APIs.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>> 
>>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Just to give an idea what null values could cause in Flink:
>>>>> DataSet.count()
>>>>>> returns the number of elements of all values in a Dataset (null or
>>>>>> not) while #834 would ignore null values and aggregate the DataSet
>>>>>> without
>>>>> them.
>>>>> Compare R's na.action.
>>>>> 
>>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>>>>> 
>>> 
>>> 
>

Re: The null in Flink

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
just some information. The Table API code generator already has preliminary support for generating code that is NULL-aware. So for example if you have expressions like 1 + NULL the result would also be null.

I think one of the missing pieces is a way to get data that contains null values into the system. For example, right now the expected way to read csv files is via tuples and they don’t support null values. I think we need a way to directly read CSV files into a Row DataSet (or Table).

Cheers,
Aljoscha
> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi!
> 
> Thanks for the good discussion! Here are some thoughts from my side:
> 
> 1)
> I would agree with Chengxiang that it helps to have as much NULL handling
> in the table API as possible, since most SQL constructs will be permitted
> there are well.
> 
> 2)
> A question that I have is whether we want to actually follow the SQL
> standard exactly. There is a lot of criticism on NULL in the SQL standard,
> and there have been many good proposals for more meaningful semantics (for
> example differentiate between the meanings "value missing", "value
> unknown", "value not applicable", etc).
> 
> Going with the SQL way is easiest and makes SQL addition on top of the
> table API much easier. Also, there is only one type of NULL, meaning that
> null-values can be encoded efficiently in bitmaps. Further more, the fact
> that the Table API users have the power of a programming language at hand
> (rather than the limited set of SQL operators), they should be able to
> easily define their own constants for special meanings like "value not
> applicable" or so.
> 
> Just curious if anyone has experience with some of the other null-sematic
> proposals that have been around.
> 
> 3)
> One comment concerning the three-value-logic for boolean expressions:
> 
> A while back, I worked on a SQL engine, and we were able to not implement
> three-value logic with trick. If I recall correctly, it was like this:
> 
> If the boolean expressions are monotonous (have no NOT), then the UNKNOWN
> value can be the same as FALSE. So the query planner had to rewrite all
> expression trees to have no NOT, which means pushing the NOT down into the
> leaf comparison operations (for example push NOT into == to become !=).
> These leaf comparison operators needed to be NULL aware to return FALSE on
> comparisons with NULL.
> 
> 
> Greetings,
> Stephan
> 
> 
> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang <ch...@intel.com>
> wrote:
> 
>> Thanks, Timo.
>> We may put the NULL related function support to SQL API, but for Scalar
>> expression and Boolean expression, it already been supported in Table API,
>> without NULL value handling support, query with Scalar expression and
>> Boolean expression would fail while encounter NULL value.
>> 
>> Thanks
>> Chengxiang
>> 
>> -----Original Message-----
>> From: Timo Walther [mailto:twalthr@apache.org]
>> Sent: Wednesday, November 25, 2015 7:33 PM
>> To: dev@flink.apache.org
>> Subject: Re: The null in Flink
>> 
>> Hi Chengxiang,
>> 
>> I totally agree that the Table API should fully support NULL values. The
>> Table API is a logical API and therefore we should be as close to ANSI SQL
>> as possible. Rows need to be nullable in the near future.
>> 
>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much
>> like SQL magic. I think all other SQL magic (DBMS specific corner cases)
>> should be handled by the SQL API on top of the Table API.
>> 
>> Regards,
>> Timo
>> 
>> 
>> On 25.11.2015 11:31, Li, Chengxiang wrote:
>>> Hi
>>> In this mail list, there are some discussions about null value handling
>> in Flink, and I saw several related JIRAs as well(like FLINK-2203,
>> FLINK-2210), but unfortunately, got reverted due to immature design, and no
>> further action since then. I would like to pick this topic up here, as it's
>> quite an important part of data analysis and many features depend on it.
>> Hopefully, through a plenary discussion, we can generate an acceptable
>> solution and move forward. Stephan has explained very clearly about how and
>> why Flink handle "Null values in the Programming Language APIs", so I
>> mainly talk about the second part of "Null values in the high-level
>> (logical) APIs ".
>>> 
>>> 1. Why should Flink support Null values handling in Table API?
>>>      i.  Data source may miss column value in many cases, if no Null
>> values handling in Table API, user need to write an extra ETL to handle
>> missing values manually.
>>>      ii. Some Table API operators generate Null values on their own,
>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling
>> in Table API is the prerequisite of these features.
>>> 
>>> 2. The semantic of Null value handling in Table API.
>>> Fortunately, there are already mature DBMS  standards we can follow for
>> Null value handling, I list several semantic of Null value handling here.
>> To be noted that, this may not cover all the cases, and the semantics may
>> vary in different DBMSs, so it should totally open to discuss.
>>>      I,  NULL compare. In ascending order, NULL is smaller than any
>> other value, and NULL == NULL return false.
>>>      ii. NULL exists in GroupBy Key, all NULL values are grouped as a
>> single group.
>>>      iii. NULL exists in Aggregate columns, ignore NULL in aggregation
>> function.
>>>                 iv. NULL exists in both side Join key, refer to #i,
>> NULL == NULL return false, no output for NULL Join key.
>>>                 v.  NULL in Scalar expression, expression within
>> NULL(eg. 1 + NULL) return NULL.
>>>                 vi. NULL in Boolean expression, add an extra result:
>> UNKNOWN, more semantic for Boolean expression in reference #1.
>>>                 vii. More related function support, like COALESCE, NVL,
>> NANVL, and so on.
>>> 
>>> 3. NULL value storage in Table API.
>>>   Just set null to Row field value. To mark NULL value in serialized
>> binary record data, normally it use extra flag for each field to mark
>> whether its value is NULL, which would change the data layout of Row
>> object. So any logic that access serialized Row data directly should
>> updated to sync with new data layout, for example, many methods in
>> RowComparator.
>>> 
>>> Reference:
>>> 1. Nulls: Nothing to worry about:
>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html
>> .
>>> 2. Null related functions:
>>> https://oracle-base.com/articles/misc/null-related-functions
>>> 
>>> -----Original Message-----
>>> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf
>>> Of Stephan Ewen
>>> Sent: Thursday, June 18, 2015 8:43 AM
>>> To: dev@flink.apache.org
>>> Subject: Re: The null in Flink
>>> 
>>> Hi!
>>> 
>>> I think we actually have two discussions here, both of them important:
>>> 
>>> --------------------------------------------------------------
>>> 1) Null values in the Programming Language APIs
>>> --------------------------------------------------------------
>>> 
>>> Fields in composite types may simply be null pointers.
>>> 
>>> In object types:
>>>   - primitives members are naturally non-nullable
>>>   - all other members are nullable
>>> 
>>> => If you want to avoid the overhead of nullability, go with primitive
>> types.
>>> 
>>> In Tuples, and derives types (Scala case classes):
>>>   - Fields are non-nullable.
>>> 
>>> => The reason here is that we initially decided to keep tuples as a very
>> fast data type. Because tuples cannot hold primitives in Java/Scala, we
>> would not have a way to make fast non-nullable fields. The performance of
>> nullable fields affects the key-operations, especially on normalized keys.
>>> We can work around that with some effort, but have not one it so far.
>>> 
>>> => In Scala, the Option types is a natural way of elegantly working
>> around that.
>>> 
>>> 
>>> --------------------------------------------------------------
>>> 2) Null values in the high-level (logial) APIs
>>> --------------------------------------------------------------
>>> 
>>> This is mainly what Ted was referring to, if I understood him correctly.
>>> 
>>> Here, we need to figure out what form of semantical null values in the
>> Table API and later, in SQL.
>>> 
>>> Besides deciding what semantics to follow here in the logical APIs, we
>> need to decide what these values confert to/from when switching between
>> logical/physical APIs.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>> 
>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org>
>>>> wrote:
>>>> 
>>>>> Just to give an idea what null values could cause in Flink:
>>>> DataSet.count()
>>>>> returns the number of elements of all values in a Dataset (null or
>>>>> not) while #834 would ignore null values and aggregate the DataSet
>>>>> without
>>>> them.
>>>> Compare R's na.action.
>>>> 
>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>>>> 
>> 
>>

Re: The null in Flink

Posted by Stephan Ewen <se...@apache.org>.

Hi!

Thanks for the good discussion! Here are some thoughts from my side:

1)
I would agree with Chengxiang that it helps to have as much NULL handling
in the table API as possible, since most SQL constructs will be permitted
there are well.

2)
A question that I have is whether we want to actually follow the SQL
standard exactly. There is a lot of criticism on NULL in the SQL standard,
and there have been many good proposals for more meaningful semantics (for
example differentiate between the meanings "value missing", "value
unknown", "value not applicable", etc).

Going with the SQL way is easiest and makes SQL addition on top of the
table API much easier. Also, there is only one type of NULL, meaning that
null-values can be encoded efficiently in bitmaps. Further more, the fact
that the Table API users have the power of a programming language at hand
(rather than the limited set of SQL operators), they should be able to
easily define their own constants for special meanings like "value not
applicable" or so.

Just curious if anyone has experience with some of the other null-sematic
proposals that have been around.

3)
One comment concerning the three-value-logic for boolean expressions:

A while back, I worked on a SQL engine, and we were able to not implement
three-value logic with trick. If I recall correctly, it was like this:

If the boolean expressions are monotonous (have no NOT), then the UNKNOWN
value can be the same as FALSE. So the query planner had to rewrite all
expression trees to have no NOT, which means pushing the NOT down into the
leaf comparison operations (for example push NOT into == to become !=).
These leaf comparison operators needed to be NULL aware to return FALSE on
comparisons with NULL.


Greetings,
Stephan


On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang <ch...@intel.com>
wrote:

> Thanks, Timo.
> We may put the NULL related function support to SQL API, but for Scalar
> expression and Boolean expression, it already been supported in Table API,
> without NULL value handling support, query with Scalar expression and
> Boolean expression would fail while encounter NULL value.
>
> Thanks
> Chengxiang
>
> -----Original Message-----
> From: Timo Walther [mailto:twalthr@apache.org]
> Sent: Wednesday, November 25, 2015 7:33 PM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Hi Chengxiang,
>
> I totally agree that the Table API should fully support NULL values. The
> Table API is a logical API and therefore we should be as close to ANSI SQL
> as possible. Rows need to be nullable in the near future.
>
> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much
> like SQL magic. I think all other SQL magic (DBMS specific corner cases)
> should be handled by the SQL API on top of the Table API.
>
> Regards,
> Timo
>
>
> On 25.11.2015 11:31, Li, Chengxiang wrote:
> > Hi
> > In this mail list, there are some discussions about null value handling
> in Flink, and I saw several related JIRAs as well(like FLINK-2203,
> FLINK-2210), but unfortunately, got reverted due to immature design, and no
> further action since then. I would like to pick this topic up here, as it's
> quite an important part of data analysis and many features depend on it.
> Hopefully, through a plenary discussion, we can generate an acceptable
> solution and move forward. Stephan has explained very clearly about how and
> why Flink handle "Null values in the Programming Language APIs", so I
> mainly talk about the second part of "Null values in the high-level
> (logical) APIs ".
> >
> > 1. Why should Flink support Null values handling in Table API?
> >       i.  Data source may miss column value in many cases, if no Null
> values handling in Table API, user need to write an extra ETL to handle
> missing values manually.
> >       ii. Some Table API operators generate Null values on their own,
> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling
> in Table API is the prerequisite of these features.
> >
> > 2. The semantic of Null value handling in Table API.
> > Fortunately, there are already mature DBMS  standards we can follow for
> Null value handling, I list several semantic of Null value handling here.
> To be noted that, this may not cover all the cases, and the semantics may
> vary in different DBMSs, so it should totally open to discuss.
> >       I,  NULL compare. In ascending order, NULL is smaller than any
> other value, and NULL == NULL return false.
> >       ii. NULL exists in GroupBy Key, all NULL values are grouped as a
> single group.
> >       iii. NULL exists in Aggregate columns, ignore NULL in aggregation
> function.
> >                  iv. NULL exists in both side Join key, refer to #i,
> NULL == NULL return false, no output for NULL Join key.
> >                  v.  NULL in Scalar expression, expression within
> NULL(eg. 1 + NULL) return NULL.
> >                  vi. NULL in Boolean expression, add an extra result:
> UNKNOWN, more semantic for Boolean expression in reference #1.
> >                  vii. More related function support, like COALESCE, NVL,
> NANVL, and so on.
> >
> > 3. NULL value storage in Table API.
> >    Just set null to Row field value. To mark NULL value in serialized
> binary record data, normally it use extra flag for each field to mark
> whether its value is NULL, which would change the data layout of Row
> object. So any logic that access serialized Row data directly should
> updated to sync with new data layout, for example, many methods in
> RowComparator.
> >
> > Reference:
> > 1. Nulls: Nothing to worry about:
> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html
> .
> > 2. Null related functions:
> > https://oracle-base.com/articles/misc/null-related-functions
> >
> > -----Original Message-----
> > From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf
> > Of Stephan Ewen
> > Sent: Thursday, June 18, 2015 8:43 AM
> > To: dev@flink.apache.org
> > Subject: Re: The null in Flink
> >
> > Hi!
> >
> > I think we actually have two discussions here, both of them important:
> >
> > --------------------------------------------------------------
> > 1) Null values in the Programming Language APIs
> > --------------------------------------------------------------
> >
> > Fields in composite types may simply be null pointers.
> >
> > In object types:
> >    - primitives members are naturally non-nullable
> >    - all other members are nullable
> >
> > => If you want to avoid the overhead of nullability, go with primitive
> types.
> >
> > In Tuples, and derives types (Scala case classes):
> >    - Fields are non-nullable.
> >
> > => The reason here is that we initially decided to keep tuples as a very
> fast data type. Because tuples cannot hold primitives in Java/Scala, we
> would not have a way to make fast non-nullable fields. The performance of
> nullable fields affects the key-operations, especially on normalized keys.
> > We can work around that with some effort, but have not one it so far.
> >
> > => In Scala, the Option types is a natural way of elegantly working
> around that.
> >
> >
> > --------------------------------------------------------------
> > 2) Null values in the high-level (logial) APIs
> > --------------------------------------------------------------
> >
> > This is mainly what Ted was referring to, if I understood him correctly.
> >
> > Here, we need to figure out what form of semantical null values in the
> Table API and later, in SQL.
> >
> > Besides deciding what semantics to follow here in the logical APIs, we
> need to decide what these values confert to/from when switching between
> logical/physical APIs.
> >
> >
> >
> >
> >
> >
> > On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org>
> >> wrote:
> >>
> >>> Just to give an idea what null values could cause in Flink:
> >> DataSet.count()
> >>> returns the number of elements of all values in a Dataset (null or
> >>> not) while #834 would ignore null values and aggregate the DataSet
> >>> without
> >> them.
> >> Compare R's na.action.
> >>
> >> http://www.ats.ucla.edu/stat/r/faq/missing.htm
> >>
>
>

RE: The null in Flink

Posted by "Li, Chengxiang" <ch...@intel.com>.

Thanks, Timo. 
We may put the NULL related function support to SQL API, but for Scalar expression and Boolean expression, it already been supported in Table API, without NULL value handling support, query with Scalar expression and Boolean expression would fail while encounter NULL value.

Thanks
Chengxiang 

-----Original Message-----
From: Timo Walther [mailto:twalthr@apache.org] 
Sent: Wednesday, November 25, 2015 7:33 PM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Hi Chengxiang,

I totally agree that the Table API should fully support NULL values. The Table API is a logical API and therefore we should be as close to ANSI SQL as possible. Rows need to be nullable in the near future.

2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much like SQL magic. I think all other SQL magic (DBMS specific corner cases) should be handled by the SQL API on top of the Table API.

Regards,
Timo


On 25.11.2015 11:31, Li, Chengxiang wrote:
> Hi
> In this mail list, there are some discussions about null value handling in Flink, and I saw several related JIRAs as well(like FLINK-2203, FLINK-2210), but unfortunately, got reverted due to immature design, and no further action since then. I would like to pick this topic up here, as it's quite an important part of data analysis and many features depend on it. Hopefully, through a plenary discussion, we can generate an acceptable solution and move forward. Stephan has explained very clearly about how and why Flink handle "Null values in the Programming Language APIs", so I mainly talk about the second part of "Null values in the high-level (logical) APIs ".
>
> 1. Why should Flink support Null values handling in Table API?
> 	i.  Data source may miss column value in many cases, if no Null values handling in Table API, user need to write an extra ETL to handle missing values manually.
> 	ii. Some Table API operators generate Null values on their own, like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling in Table API is the prerequisite of these features.
>
> 2. The semantic of Null value handling in Table API.
> Fortunately, there are already mature DBMS  standards we can follow for Null value handling, I list several semantic of Null value handling here. To be noted that, this may not cover all the cases, and the semantics may vary in different DBMSs, so it should totally open to discuss.
> 	I,  NULL compare. In ascending order, NULL is smaller than any other value, and NULL == NULL return false.
> 	ii. NULL exists in GroupBy Key, all NULL values are grouped as a single group.
> 	iii. NULL exists in Aggregate columns, ignore NULL in aggregation function.
>                  iv. NULL exists in both side Join key, refer to #i, NULL == NULL return false, no output for NULL Join key.
>                  v.  NULL in Scalar expression, expression within NULL(eg. 1 + NULL) return NULL.
>                  vi. NULL in Boolean expression, add an extra result: UNKNOWN, more semantic for Boolean expression in reference #1.
>                  vii. More related function support, like COALESCE, NVL, NANVL, and so on.
>
> 3. NULL value storage in Table API.
>    Just set null to Row field value. To mark NULL value in serialized binary record data, normally it use extra flag for each field to mark whether its value is NULL, which would change the data layout of Row object. So any logic that access serialized Row data directly should updated to sync with new data layout, for example, many methods in RowComparator.
>
> Reference:
> 1. Nulls: Nothing to worry about: http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html.
> 2. Null related functions: 
> https://oracle-base.com/articles/misc/null-related-functions
>
> -----Original Message-----
> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf 
> Of Stephan Ewen
> Sent: Thursday, June 18, 2015 8:43 AM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Hi!
>
> I think we actually have two discussions here, both of them important:
>
> --------------------------------------------------------------
> 1) Null values in the Programming Language APIs
> --------------------------------------------------------------
>
> Fields in composite types may simply be null pointers.
>
> In object types:
>    - primitives members are naturally non-nullable
>    - all other members are nullable
>
> => If you want to avoid the overhead of nullability, go with primitive types.
>
> In Tuples, and derives types (Scala case classes):
>    - Fields are non-nullable.
>
> => The reason here is that we initially decided to keep tuples as a very fast data type. Because tuples cannot hold primitives in Java/Scala, we would not have a way to make fast non-nullable fields. The performance of nullable fields affects the key-operations, especially on normalized keys.
> We can work around that with some effort, but have not one it so far.
>
> => In Scala, the Option types is a natural way of elegantly working around that.
>
>
> --------------------------------------------------------------
> 2) Null values in the high-level (logial) APIs
> --------------------------------------------------------------
>
> This is mainly what Ted was referring to, if I understood him correctly.
>
> Here, we need to figure out what form of semantical null values in the Table API and later, in SQL.
>
> Besides deciding what semantics to follow here in the logical APIs, we need to decide what these values confert to/from when switching between logical/physical APIs.
>
>
>
>
>
>
> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> Just to give an idea what null values could cause in Flink:
>> DataSet.count()
>>> returns the number of elements of all values in a Dataset (null or
>>> not) while #834 would ignore null values and aggregate the DataSet 
>>> without
>> them.
>> Compare R's na.action.
>>
>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>>

Re: The null in Flink

Posted by Timo Walther <tw...@apache.org>.

Hi Chengxiang,

I totally agree that the Table API should fully support NULL values. The 
Table API is a logical API and therefore we should be as close to ANSI 
SQL as possible. Rows need to be nullable in the near future.

2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much 
like SQL magic. I think all other SQL magic (DBMS specific corner cases) 
should be handled by the SQL API on top of the Table API.

Regards,
Timo


On 25.11.2015 11:31, Li, Chengxiang wrote:
> Hi
> In this mail list, there are some discussions about null value handling in Flink, and I saw several related JIRAs as well(like FLINK-2203, FLINK-2210), but unfortunately, got reverted due to immature design, and no further action since then. I would like to pick this topic up here, as it's quite an important part of data analysis and many features depend on it. Hopefully, through a plenary discussion, we can generate an acceptable solution and move forward. Stephan has explained very clearly about how and why Flink handle "Null values in the Programming Language APIs", so I mainly talk about the second part of "Null values in the high-level (logical) APIs ".
>
> 1. Why should Flink support Null values handling in Table API?
> 	i.  Data source may miss column value in many cases, if no Null values handling in Table API, user need to write an extra ETL to handle missing values manually.
> 	ii. Some Table API operators generate Null values on their own, like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling in Table API is the prerequisite of these features.
>
> 2. The semantic of Null value handling in Table API.
> Fortunately, there are already mature DBMS  standards we can follow for Null value handling, I list several semantic of Null value handling here. To be noted that, this may not cover all the cases, and the semantics may vary in different DBMSs, so it should totally open to discuss.
> 	I,  NULL compare. In ascending order, NULL is smaller than any other value, and NULL == NULL return false.
> 	ii. NULL exists in GroupBy Key, all NULL values are grouped as a single group.
> 	iii. NULL exists in Aggregate columns, ignore NULL in aggregation function.
>                  iv. NULL exists in both side Join key, refer to #i, NULL == NULL return false, no output for NULL Join key.
>                  v.  NULL in Scalar expression, expression within NULL(eg. 1 + NULL) return NULL.
>                  vi. NULL in Boolean expression, add an extra result: UNKNOWN, more semantic for Boolean expression in reference #1.
>                  vii. More related function support, like COALESCE, NVL, NANVL, and so on.
>
> 3. NULL value storage in Table API.
>    Just set null to Row field value. To mark NULL value in serialized binary record data, normally it use extra flag for each field to mark whether its value is NULL, which would change the data layout of Row object. So any logic that access serialized Row data directly should updated to sync with new data layout, for example, many methods in RowComparator.
>
> Reference:
> 1. Nulls: Nothing to worry about: http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html.
> 2. Null related functions: https://oracle-base.com/articles/misc/null-related-functions
>
> -----Original Message-----
> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf Of Stephan Ewen
> Sent: Thursday, June 18, 2015 8:43 AM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Hi!
>
> I think we actually have two discussions here, both of them important:
>
> --------------------------------------------------------------
> 1) Null values in the Programming Language APIs
> --------------------------------------------------------------
>
> Fields in composite types may simply be null pointers.
>
> In object types:
>    - primitives members are naturally non-nullable
>    - all other members are nullable
>
> => If you want to avoid the overhead of nullability, go with primitive types.
>
> In Tuples, and derives types (Scala case classes):
>    - Fields are non-nullable.
>
> => The reason here is that we initially decided to keep tuples as a very fast data type. Because tuples cannot hold primitives in Java/Scala, we would not have a way to make fast non-nullable fields. The performance of nullable fields affects the key-operations, especially on normalized keys.
> We can work around that with some effort, but have not one it so far.
>
> => In Scala, the Option types is a natural way of elegantly working around that.
>
>
> --------------------------------------------------------------
> 2) Null values in the high-level (logial) APIs
> --------------------------------------------------------------
>
> This is mainly what Ted was referring to, if I understood him correctly.
>
> Here, we need to figure out what form of semantical null values in the Table API and later, in SQL.
>
> Besides deciding what semantics to follow here in the logical APIs, we need to decide what these values confert to/from when switching between logical/physical APIs.
>
>
>
>
>
>
> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> Just to give an idea what null values could cause in Flink:
>> DataSet.count()
>>> returns the number of elements of all values in a Dataset (null or
>>> not) while #834 would ignore null values and aggregate the DataSet
>>> without
>> them.
>> Compare R's na.action.
>>
>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>>

RE: The null in Flink

Posted by "Li, Chengxiang" <ch...@intel.com>.

Hi
In this mail list, there are some discussions about null value handling in Flink, and I saw several related JIRAs as well(like FLINK-2203, FLINK-2210), but unfortunately, got reverted due to immature design, and no further action since then. I would like to pick this topic up here, as it's quite an important part of data analysis and many features depend on it. Hopefully, through a plenary discussion, we can generate an acceptable solution and move forward. Stephan has explained very clearly about how and why Flink handle "Null values in the Programming Language APIs", so I mainly talk about the second part of "Null values in the high-level (logical) APIs ".

1. Why should Flink support Null values handling in Table API?
	i.  Data source may miss column value in many cases, if no Null values handling in Table API, user need to write an extra ETL to handle missing values manually.
	ii. Some Table API operators generate Null values on their own, like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling in Table API is the prerequisite of these features.

2. The semantic of Null value handling in Table API.
Fortunately, there are already mature DBMS  standards we can follow for Null value handling, I list several semantic of Null value handling here. To be noted that, this may not cover all the cases, and the semantics may vary in different DBMSs, so it should totally open to discuss.
	I,  NULL compare. In ascending order, NULL is smaller than any other value, and NULL == NULL return false. 
	ii. NULL exists in GroupBy Key, all NULL values are grouped as a single group.
	iii. NULL exists in Aggregate columns, ignore NULL in aggregation function.
                iv. NULL exists in both side Join key, refer to #i, NULL == NULL return false, no output for NULL Join key.
                v.  NULL in Scalar expression, expression within NULL(eg. 1 + NULL) return NULL. 
                vi. NULL in Boolean expression, add an extra result: UNKNOWN, more semantic for Boolean expression in reference #1.
                vii. More related function support, like COALESCE, NVL, NANVL, and so on.

3. NULL value storage in Table API.
  Just set null to Row field value. To mark NULL value in serialized binary record data, normally it use extra flag for each field to mark whether its value is NULL, which would change the data layout of Row object. So any logic that access serialized Row data directly should updated to sync with new data layout, for example, many methods in RowComparator.

Reference:
1. Nulls: Nothing to worry about: http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html.
2. Null related functions: https://oracle-base.com/articles/misc/null-related-functions

-----Original Message-----
From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf Of Stephan Ewen
Sent: Thursday, June 18, 2015 8:43 AM
To: dev@flink.apache.org
Subject: Re: The null in Flink

Hi!

I think we actually have two discussions here, both of them important:

--------------------------------------------------------------
1) Null values in the Programming Language APIs
--------------------------------------------------------------

Fields in composite types may simply be null pointers.

In object types:
  - primitives members are naturally non-nullable
  - all other members are nullable

=> If you want to avoid the overhead of nullability, go with primitive types.

In Tuples, and derives types (Scala case classes):
  - Fields are non-nullable.

=> The reason here is that we initially decided to keep tuples as a very fast data type. Because tuples cannot hold primitives in Java/Scala, we would not have a way to make fast non-nullable fields. The performance of nullable fields affects the key-operations, especially on normalized keys.
We can work around that with some effort, but have not one it so far.

=> In Scala, the Option types is a natural way of elegantly working around that.

--------------------------------------------------------------
2) Null values in the high-level (logial) APIs
--------------------------------------------------------------

This is mainly what Ted was referring to, if I understood him correctly.

Here, we need to figure out what form of semantical null values in the Table API and later, in SQL.

Besides deciding what semantics to follow here in the logical APIs, we need to decide what these values confert to/from when switching between logical/physical APIs.

On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org>
> wrote:
>
> > Just to give an idea what null values could cause in Flink:
> DataSet.count()
> > returns the number of elements of all values in a Dataset (null or 
> > not) while #834 would ignore null values and aggregate the DataSet 
> > without
> them.
> >
>
> Compare R's na.action.
>
> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>

Re: The null in Flink

Posted by Stephan Ewen <se...@apache.org>.

Hi!

I think we actually have two discussions here, both of them important:

--------------------------------------------------------------
1) Null values in the Programming Language APIs
--------------------------------------------------------------

Fields in composite types may simply be null pointers.

In object types:
  - primitives members are naturally non-nullable
  - all other members are nullable

=> If you want to avoid the overhead of nullability, go with primitive
types.

In Tuples, and derives types (Scala case classes):
  - Fields are non-nullable.

=> The reason here is that we initially decided to keep tuples as a very
fast data type. Because tuples cannot hold primitives in Java/Scala, we
would not have a way to make fast non-nullable fields. The performance of
nullable fields affects the key-operations, especially on normalized keys.
We can work around that with some effort, but have not one it so far.

=> In Scala, the Option types is a natural way of elegantly working around
that.

--------------------------------------------------------------
2) Null values in the high-level (logial) APIs
--------------------------------------------------------------

This is mainly what Ted was referring to, if I understood him correctly.

Here, we need to figure out what form of semantical null values in the
Table API and later, in SQL.

Besides deciding what semantics to follow here in the logical APIs, we need
to decide what these values confert to/from when switching between
logical/physical APIs.

On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org>
> wrote:
>
> > Just to give an idea what null values could cause in Flink:
> DataSet.count()
> > returns the number of elements of all values in a Dataset (null or not)
> > while #834 would ignore null values and aggregate the DataSet without
> them.
> >
>
> Compare R's na.action.
>
> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>

Re: The null in Flink

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org> wrote:

> Just to give an idea what null values could cause in Flink: DataSet.count()
> returns the number of elements of all values in a Dataset (null or not)
> while #834 would ignore null values and aggregate the DataSet without them.
>

Compare R's na.action.

http://www.ats.ucla.edu/stat/r/faq/missing.htm

Re: The null in Flink

Posted by Ted Dunning <te...@gmail.com>.

The example of SQL is obviously dominating thoughts of NULL, but I think
that the example of R is probably better in terms of how things can work
fairly well.  NULL is a key concept and very helpful in a number of
settings.  With R's fairly simple functional nature it is easy to filter
data and most functions have options to adjust treatment of null.  R also
has the value NA which is intended for the representation of missing
values.  R also supports NaN (not-a-number) for the result of undefined
numerical operations.

My own thought is based on the fact that I am rarely surprised by NULL, NA
and NaN in R and often surprised by NULL in SQL.  I don't have strong
conclusions from that other than to think that R did something right and
SQL did something wrong (from my point of view).

None of this matters if compatibility with SQL is the primary requirement.
In that case, I say just do it.

On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <mx...@apache.org> wrote:

> Hi everyone,
>
> I'm seeing a lot of null value related pull requests nowadays, like these:
>
> https://github.com/apache/flink/pull/780
> https://github.com/apache/flink/pull/831
> https://github.com/apache/flink/pull/834
>
> It used to be the case that null values were simply not supported by Flink.
> Recently, Flink supports null values for some components. Now I'm wondering
> what the current state of null values in Flink is. While ignoring null
> values might be a good for not crashing your programs, null values are
> generally a bad way of signaling empty values for which better strategies
> are available. My intuition would be that it is a bit evil to support them
> in DataSets.
>
> Just to give an idea what null values could cause in Flink: DataSet.count()
> returns the number of elements of all values in a Dataset (null or not)
> while #834 would ignore null values and aggregate the DataSet without them.
>
> Best,
> Max
>