You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Shardul Mahadik <sh...@gmail.com> on 2020/09/17 21:08:32 UTC

SQL compatibility of Iceberg Expressions

Hi all,

I noticed that Iceberg's predicates are not compatible with SQL predicates
when it comes to handling NULL values. In SQL, if any of the operands of a
scalar comparison predicate is NULL, then the resultant truth value of the
predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a NULL in SQL and
not FALSE. If such predicates are used as filters, the resultant output
will be different for Iceberg v/s SQL. e.g. `.filter(notEqual(column,
'x'))` in Iceberg will return rows excluding ‘x’ but including NULL. The
same thing in Presto SQL `WHERE column != 'x'` will return rows excluding
both ‘x’ and NULL. So essentially, Iceberg can return more rows than
required when an engine pushes down these predicates, however the engines
will filter out these additional rows, so everything seems good. But
modules like iceberg-data and iceberg-mr which rely solely on Iceberg's
expression evaluators for filtering will return the additional rows. Should
we change the behavior of Iceberg expressions to be more SQL-like or should
we keep this behavior and document the differences when compared with SQL?

This also has some implications on predicate pushdown e.g. ORC follows SQL
semantics and if we try to push down Iceberg predicates, simply converting
Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient as it does
not return NULLs contrary to what Iceberg expects.

Thanks,
Shardul

Re: SQL compatibility of Iceberg Expressions

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

If we can translate to an equivalent expression, then I think it is better
to do that. Iceberg expressions don't support SQL semantics and it would be
a big project to convert them. Afterward, we would just have more
complicated expressions that are harder to maintain because SQL semantics
are counter-intuitive. If we can translate into expressions that are easier
to maintain and have understandable semantics, then I think we will be
better off in the long term. And we won't have to rewrite so much and find
a work around for the notEqual <-> not(equal) rewrite problem.

We should still make it easy to choose the SQL semantics with new factory
methods. I think we will need `sqlNotNull`, `sqlIn`, and `isDistinctFrom`.
Any others?

For ORC, we will want to rewrite. But for Parquet we don't currently push
any filters down through the Parquet API. Instead we have our own filter
implementations in Iceberg that use Iceberg expressions directly.

On Fri, Sep 18, 2020 at 4:38 PM Owen O'Malley <ow...@gmail.com>
wrote:

> No, you can translate these expressions, but you have to evaluate the
> entire expression. For example:
>
> "col1 = 'x' and col2 in (1,2)" becomes col1 = 'x' and col2 in (1,2)
> "not(col1 = 'x' and col2 in (1,2))" becomes (col1 != 'x' or col2 not in
> (1,2)) and col1 is not null and col2 is not null
>
> Furthermore, ORC does (and Parquet should) already use these semantics.
> Therefore, you'll end up translating on both sides:
>
> hive        -\
> presto      --+-> Iceberg -+--> ORC
> spark sql   -/              \-> Parquet
>
> Since the non-sql use cases have fewer pushdown predicates, having a
> translation on that side seems less error-prone.
>
> .. Owen
>
>
> On Fri, Sep 18, 2020 at 10:54 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Are you saying that we can't fix this by rewriting expressions to
>> translate from SQL to more natural semantics?
>>
>> On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley <ow...@gmail.com>
>> wrote:
>>
>>> In the SQL world, the second point isn't right. It is still the case
>>> that not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well,
>>> three valued logic) in SQL is just strange relative to programming
>>> languages:
>>>
>>>    - null *=* "x" -> null
>>>    - null *is distinct from* "x" -> true
>>>    - *not*(null) -> null
>>>    - null *and* true -> null
>>>    - null *or* false -> null
>>>
>>> We absolutely need a null safe equals function (<=> or "is distinct
>>> from") also, which is what we currently have as equals. So we really need
>>> to have the logical operators also treat null as a special value.
>>>
>>> .. Owen
>>>
>>>
>>>
>>> On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> It would be nice to avoid the problem by changing the semantics of
>>>> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
>>>> reasons.
>>>>
>>>> First, I think that API users creating expressions directly expect the
>>>> current behavior. It would be surprising to a user if a notEqual
>>>> expression didn’t return nulls. People integrating Iceberg in SQL engines
>>>> will be more aware of SQL semantics, especially if the behavior we choose
>>>> is documented. I think for API uses, the current behavior is better.
>>>>
>>>> Second, some evaluations require expressions to be rewritten without
>>>> not, so we have a utility that pushes not down an expression tree to
>>>> the leaf predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
>>>> "x"). If we were to change the semantics of notEqual, then this
>>>> rewrite would no longer be valid. If col is null then it is not equal
>>>> to x, and negating that result is true. But notEqual would give a
>>>> different answer so we can’t rewrite it.
>>>>
>>>> We can work around the rewrite problem by adding
>>>> Expressions.sqlNotEqual method for engines to call that has the SQL
>>>> semantics by returning and(notEqual("col", "x"), notNull("col")).
>>>>
>>>> For pushdown, we should add tests for these cases and rewrite
>>>> expressions to account for the difference. Iceberg should push notEqual("col",
>>>> "x") to ORC as SQL (col != 'x' or col is null). Presto can similarly
>>>> translate col != 'x' to and(notEqual("col", "x"), notNull("col").
>>>>
>>>> rb
>>>>
>>>> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <ow...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think that we should follow the SQL semantics to prevent surprises
>>>>> when SQL engines integrate with Iceberg.
>>>>>
>>>>> .. Owen
>>>>>
>>>>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>>>>> shardulsmahadik@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I noticed that Iceberg's predicates are not compatible with SQL
>>>>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>>>>> operands of a scalar comparison predicate is NULL, then the resultant truth
>>>>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>>>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>>>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>>>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>>>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>>>>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>>>>>> more rows than required when an engine pushes down these predicates,
>>>>>> however the engines will filter out these additional rows, so everything
>>>>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>>>>> on Iceberg's expression evaluators for filtering will return the additional
>>>>>> rows. Should we change the behavior of Iceberg expressions to be more
>>>>>> SQL-like or should we keep this behavior and document the differences when
>>>>>> compared with SQL?
>>>>>>
>>>>>> This also has some implications on predicate pushdown e.g. ORC
>>>>>> follows SQL semantics and if we try to push down Iceberg predicates, simply
>>>>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>>>>> as it does not return NULLs contrary to what Iceberg expects.
>>>>>>
>>>>>> Thanks,
>>>>>> Shardul
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: SQL compatibility of Iceberg Expressions

Posted by Owen O'Malley <ow...@gmail.com>.

No, you can translate these expressions, but you have to evaluate the
entire expression. For example:

"col1 = 'x' and col2 in (1,2)" becomes col1 = 'x' and col2 in (1,2)
"not(col1 = 'x' and col2 in (1,2))" becomes (col1 != 'x' or col2 not in
(1,2)) and col1 is not null and col2 is not null

Furthermore, ORC does (and Parquet should) already use these semantics.
Therefore, you'll end up translating on both sides:

hive        -\
presto      --+-> Iceberg -+--> ORC
spark sql   -/              \-> Parquet

Since the non-sql use cases have fewer pushdown predicates, having a
translation on that side seems less error-prone.

.. Owen


On Fri, Sep 18, 2020 at 10:54 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Are you saying that we can't fix this by rewriting expressions to
> translate from SQL to more natural semantics?
>
> On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley <ow...@gmail.com>
> wrote:
>
>> In the SQL world, the second point isn't right. It is still the case that
>> not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
>> valued logic) in SQL is just strange relative to programming languages:
>>
>>    - null *=* "x" -> null
>>    - null *is distinct from* "x" -> true
>>    - *not*(null) -> null
>>    - null *and* true -> null
>>    - null *or* false -> null
>>
>> We absolutely need a null safe equals function (<=> or "is distinct
>> from") also, which is what we currently have as equals. So we really need
>> to have the logical operators also treat null as a special value.
>>
>> .. Owen
>>
>>
>>
>> On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> It would be nice to avoid the problem by changing the semantics of
>>> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
>>> reasons.
>>>
>>> First, I think that API users creating expressions directly expect the
>>> current behavior. It would be surprising to a user if a notEqual
>>> expression didn’t return nulls. People integrating Iceberg in SQL engines
>>> will be more aware of SQL semantics, especially if the behavior we choose
>>> is documented. I think for API uses, the current behavior is better.
>>>
>>> Second, some evaluations require expressions to be rewritten without not,
>>> so we have a utility that pushes not down an expression tree to the
>>> leaf predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
>>> "x"). If we were to change the semantics of notEqual, then this rewrite
>>> would no longer be valid. If col is null then it is not equal to x, and
>>> negating that result is true. But notEqual would give a different
>>> answer so we can’t rewrite it.
>>>
>>> We can work around the rewrite problem by adding Expressions.sqlNotEqual
>>> method for engines to call that has the SQL semantics by returning and(notEqual("col",
>>> "x"), notNull("col")).
>>>
>>> For pushdown, we should add tests for these cases and rewrite
>>> expressions to account for the difference. Iceberg should push notEqual("col",
>>> "x") to ORC as SQL (col != 'x' or col is null). Presto can similarly
>>> translate col != 'x' to and(notEqual("col", "x"), notNull("col").
>>>
>>> rb
>>>
>>> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <ow...@gmail.com>
>>> wrote:
>>>
>>>> I think that we should follow the SQL semantics to prevent surprises
>>>> when SQL engines integrate with Iceberg.
>>>>
>>>> .. Owen
>>>>
>>>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>>>> shardulsmahadik@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I noticed that Iceberg's predicates are not compatible with SQL
>>>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>>>> operands of a scalar comparison predicate is NULL, then the resultant truth
>>>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>>>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>>>>> more rows than required when an engine pushes down these predicates,
>>>>> however the engines will filter out these additional rows, so everything
>>>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>>>> on Iceberg's expression evaluators for filtering will return the additional
>>>>> rows. Should we change the behavior of Iceberg expressions to be more
>>>>> SQL-like or should we keep this behavior and document the differences when
>>>>> compared with SQL?
>>>>>
>>>>> This also has some implications on predicate pushdown e.g. ORC follows
>>>>> SQL semantics and if we try to push down Iceberg predicates, simply
>>>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>>>> as it does not return NULLs contrary to what Iceberg expects.
>>>>>
>>>>> Thanks,
>>>>> Shardul
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: SQL compatibility of Iceberg Expressions

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Are you saying that we can't fix this by rewriting expressions to translate
from SQL to more natural semantics?

On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley <ow...@gmail.com>
wrote:

> In the SQL world, the second point isn't right. It is still the case that
> not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
> valued logic) in SQL is just strange relative to programming languages:
>
>    - null *=* "x" -> null
>    - null *is distinct from* "x" -> true
>    - *not*(null) -> null
>    - null *and* true -> null
>    - null *or* false -> null
>
> We absolutely need a null safe equals function (<=> or "is distinct from")
> also, which is what we currently have as equals. So we really need to have
> the logical operators also treat null as a special value.
>
> .. Owen
>
>
>
> On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> It would be nice to avoid the problem by changing the semantics of
>> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
>> reasons.
>>
>> First, I think that API users creating expressions directly expect the
>> current behavior. It would be surprising to a user if a notEqual
>> expression didn’t return nulls. People integrating Iceberg in SQL engines
>> will be more aware of SQL semantics, especially if the behavior we choose
>> is documented. I think for API uses, the current behavior is better.
>>
>> Second, some evaluations require expressions to be rewritten without not,
>> so we have a utility that pushes not down an expression tree to the leaf
>> predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
>> "x"). If we were to change the semantics of notEqual, then this rewrite
>> would no longer be valid. If col is null then it is not equal to x, and
>> negating that result is true. But notEqual would give a different answer
>> so we can’t rewrite it.
>>
>> We can work around the rewrite problem by adding Expressions.sqlNotEqual
>> method for engines to call that has the SQL semantics by returning and(notEqual("col",
>> "x"), notNull("col")).
>>
>> For pushdown, we should add tests for these cases and rewrite expressions
>> to account for the difference. Iceberg should push notEqual("col", "x")
>> to ORC as SQL (col != 'x' or col is null). Presto can similarly
>> translate col != 'x' to and(notEqual("col", "x"), notNull("col").
>>
>> rb
>>
>> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <ow...@gmail.com>
>> wrote:
>>
>>> I think that we should follow the SQL semantics to prevent surprises
>>> when SQL engines integrate with Iceberg.
>>>
>>> .. Owen
>>>
>>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>>> shardulsmahadik@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I noticed that Iceberg's predicates are not compatible with SQL
>>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>>> operands of a scalar comparison predicate is NULL, then the resultant truth
>>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>>>> more rows than required when an engine pushes down these predicates,
>>>> however the engines will filter out these additional rows, so everything
>>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>>> on Iceberg's expression evaluators for filtering will return the additional
>>>> rows. Should we change the behavior of Iceberg expressions to be more
>>>> SQL-like or should we keep this behavior and document the differences when
>>>> compared with SQL?
>>>>
>>>> This also has some implications on predicate pushdown e.g. ORC follows
>>>> SQL semantics and if we try to push down Iceberg predicates, simply
>>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>>> as it does not return NULLs contrary to what Iceberg expects.
>>>>
>>>> Thanks,
>>>> Shardul
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: SQL compatibility of Iceberg Expressions

Posted by Owen O'Malley <ow...@gmail.com>.

In the SQL world, the second point isn't right. It is still the case that
not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
valued logic) in SQL is just strange relative to programming languages:

   - null *=* "x" -> null
   - null *is distinct from* "x" -> true
   - *not*(null) -> null
   - null *and* true -> null
   - null *or* false -> null

We absolutely need a null safe equals function (<=> or "is distinct from")
also, which is what we currently have as equals. So we really need to have
the logical operators also treat null as a special value.

.. Owen



On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> It would be nice to avoid the problem by changing the semantics of
> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
> reasons.
>
> First, I think that API users creating expressions directly expect the
> current behavior. It would be surprising to a user if a notEqual
> expression didn’t return nulls. People integrating Iceberg in SQL engines
> will be more aware of SQL semantics, especially if the behavior we choose
> is documented. I think for API uses, the current behavior is better.
>
> Second, some evaluations require expressions to be rewritten without not,
> so we have a utility that pushes not down an expression tree to the leaf
> predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
> "x"). If we were to change the semantics of notEqual, then this rewrite
> would no longer be valid. If col is null then it is not equal to x, and
> negating that result is true. But notEqual would give a different answer
> so we can’t rewrite it.
>
> We can work around the rewrite problem by adding Expressions.sqlNotEqual
> method for engines to call that has the SQL semantics by returning and(notEqual("col",
> "x"), notNull("col")).
>
> For pushdown, we should add tests for these cases and rewrite expressions
> to account for the difference. Iceberg should push notEqual("col", "x")
> to ORC as SQL (col != 'x' or col is null). Presto can similarly translate col
> != 'x' to and(notEqual("col", "x"), notNull("col").
>
> rb
>
> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <ow...@gmail.com>
> wrote:
>
>> I think that we should follow the SQL semantics to prevent surprises when
>> SQL engines integrate with Iceberg.
>>
>> .. Owen
>>
>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>> shardulsmahadik@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I noticed that Iceberg's predicates are not compatible with SQL
>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>> operands of a scalar comparison predicate is NULL, then the resultant truth
>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>>> more rows than required when an engine pushes down these predicates,
>>> however the engines will filter out these additional rows, so everything
>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>> on Iceberg's expression evaluators for filtering will return the additional
>>> rows. Should we change the behavior of Iceberg expressions to be more
>>> SQL-like or should we keep this behavior and document the differences when
>>> compared with SQL?
>>>
>>> This also has some implications on predicate pushdown e.g. ORC follows
>>> SQL semantics and if we try to push down Iceberg predicates, simply
>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>> as it does not return NULLs contrary to what Iceberg expects.
>>>
>>> Thanks,
>>> Shardul
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: SQL compatibility of Iceberg Expressions

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

It would be nice to avoid the problem by changing the semantics of
Iceberg’s notNull, but I don’t think that’s a good idea for 2 main reasons.

First, I think that API users creating expressions directly expect the
current behavior. It would be surprising to a user if a notEqual expression
didn’t return nulls. People integrating Iceberg in SQL engines will be more
aware of SQL semantics, especially if the behavior we choose is documented.
I think for API uses, the current behavior is better.

Second, some evaluations require expressions to be rewritten without not,
so we have a utility that pushes not down an expression tree to the leaf
predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
"x"). If we were to change the semantics of notEqual, then this rewrite
would no longer be valid. If col is null then it is not equal to x, and
negating that result is true. But notEqual would give a different answer so
we can’t rewrite it.

We can work around the rewrite problem by adding Expressions.sqlNotEqual
method for engines to call that has the SQL semantics by returning
and(notEqual("col",
"x"), notNull("col")).

For pushdown, we should add tests for these cases and rewrite expressions
to account for the difference. Iceberg should push notEqual("col", "x") to
ORC as SQL (col != 'x' or col is null). Presto can similarly translate col
!= 'x' to and(notEqual("col", "x"), notNull("col").

rb

On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <ow...@gmail.com>
wrote:

> I think that we should follow the SQL semantics to prevent surprises when
> SQL engines integrate with Iceberg.
>
> .. Owen
>
> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <sh...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I noticed that Iceberg's predicates are not compatible with SQL
>> predicates when it comes to handling NULL values. In SQL, if any of the
>> operands of a scalar comparison predicate is NULL, then the resultant truth
>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>> resultant output will be different for Iceberg v/s SQL. e.g.
>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>> more rows than required when an engine pushes down these predicates,
>> however the engines will filter out these additional rows, so everything
>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>> on Iceberg's expression evaluators for filtering will return the additional
>> rows. Should we change the behavior of Iceberg expressions to be more
>> SQL-like or should we keep this behavior and document the differences when
>> compared with SQL?
>>
>> This also has some implications on predicate pushdown e.g. ORC follows
>> SQL semantics and if we try to push down Iceberg predicates, simply
>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>> as it does not return NULLs contrary to what Iceberg expects.
>>
>> Thanks,
>> Shardul
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: SQL compatibility of Iceberg Expressions

Posted by Owen O'Malley <ow...@gmail.com>.

I think that we should follow the SQL semantics to prevent surprises when
SQL engines integrate with Iceberg.

.. Owen

On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <sh...@gmail.com>
wrote:

> Hi all,
>
> I noticed that Iceberg's predicates are not compatible with SQL predicates
> when it comes to handling NULL values. In SQL, if any of the operands of a
> scalar comparison predicate is NULL, then the resultant truth value of the
> predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a NULL in SQL and
> not FALSE. If such predicates are used as filters, the resultant output
> will be different for Iceberg v/s SQL. e.g. `.filter(notEqual(column,
> 'x'))` in Iceberg will return rows excluding ‘x’ but including NULL. The
> same thing in Presto SQL `WHERE column != 'x'` will return rows excluding
> both ‘x’ and NULL. So essentially, Iceberg can return more rows than
> required when an engine pushes down these predicates, however the engines
> will filter out these additional rows, so everything seems good. But
> modules like iceberg-data and iceberg-mr which rely solely on Iceberg's
> expression evaluators for filtering will return the additional rows. Should
> we change the behavior of Iceberg expressions to be more SQL-like or should
> we keep this behavior and document the differences when compared with SQL?
>
> This also has some implications on predicate pushdown e.g. ORC follows SQL
> semantics and if we try to push down Iceberg predicates, simply converting
> Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient as it does
> not return NULLs contrary to what Iceberg expects.
>
> Thanks,
> Shardul
>