You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Alexander Shoshin <Al...@epam.com> on 2016/11/01 15:49:00 UTC

RE: Problems with abstract syntax tree

Julian, thank you for help.

I had a wrong picture of NULL values processing. So, it looks like there is some problem in my planner rules.
As for the AST, I was confused by the wrong Flink "explain()" function description :)


Regards,
Alexander

-----Original Message-----
From: Julian Hyde [mailto:jhyde@apache.org] 
Sent: Monday, October 31, 2016 10:43 PM
To: dev@calcite.apache.org
Subject: Re: Problems with abstract syntax tree

The behavior of NOT IN in SQL is complicated when there are NULL values around. In particular, if one "word" value from the sub-query is null, then the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar' null)" would evaluate to UNKNOWN for every row.)

It is valid to deduce that "word" in the sub-query is never null, because of the "WHERE word = 'hello'" condition. I would have hoped that a constant reduction could do that, and then maybe the CASE expression can be simplified.

By the way, to be pedantic, what we are talking about here is the RelNode tree, the relational algebra, which comes out of the SqlToRelConverter. The AST is the SqlNode tree, which comes out of the parser and goes into the SqlToRelConverter.

On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin <Al...@epam.com> wrote:
> Hello, everybody.
>
> Trying to resolve an Apache Flink issue I got some troubles with Calcite. Can you help me to understand is there a problem in Calcite or just in wrong settings passed to Calcite functions?
>
> I have a simple table "Words" with one column named "word" and a query with NOT IN operator:
> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM Words WHERE word = 'hello')"
>
> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() and then transformed to a relational tree by org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>
> As a result I see the following abstract syntax tree
> LogicalProject(word=[$0])
>   LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, IS NULL($3), null, <($2, $1), null, false))])
>     LogicalJoin(condition=[=($3, $4)], joinType=[left])
>       LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>         LogicalJoin(condition=[true], joinType=[inner])
>           EnumerableTableScan(table=[[Words]])
>           LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>             LogicalProject($f0=[$0], $f1=[true])
>               LogicalProject(word=[$0])
>                 LogicalFilter(condition=[=($0, 'hello')])
>                   EnumerableTableScan(table=[[Words]])
>       LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>         LogicalProject($f0=[$0], $f1=[true])
>           LogicalProject(word=[$0])
>             LogicalFilter(condition=[=($0, 'hello')])
>               EnumerableTableScan(table=[[Words]])
>
> which fails later during query plan optimization (while calling org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>
> I think it might be because of a very complex abstract syntax tree generated by Calcite. Shouldn't it be more simple? This one looks good for me:
> LogicalProject(word=[$0])
>   LogicalFilter(condition=[IS NULL($2)])
>     LogicalJoin(condition=[=($0, $1)], joinType=[left])
>       EnumerableTableScan(table=[[Words]])
>       LogicalProject($f0=[$0], $f1=[true])
>         LogicalProject(word=[$0])
>           LogicalFilter(condition=[=($0, 'hello')])
>             EnumerableTableScan(table=[[Words]])
>
> And when I write a query using LEFT OUTER JOIN to receive this syntax tree - the optimization works fine. And the query execution result is the same as must be in case of using NOT IN. So am I wrong with a supposition about bad abstract syntax tree or not? I will be glad to receive any comments.
>
> Regards,
> Alexander

Re: Problems with abstract syntax tree

Posted by Vineet Garg <vg...@hortonworks.com>.

I logged in jira CALCITE-1483 for the query in question. I’ll try to find more examples.




On 11/7/16, 9:17 PM, "Vineet Garg" <vg...@hortonworks.com> wrote:

>Hi Julian,
>
>Apologies for not responding earlier.
>
>I understand that planner rules sometime produces a plan that is sub-optimal. My concern was about planner rules producing a plan consisting of an expression (literal null constant in this case) with null type i.e. SqlTypeName.NULL. I was wondering if this might be a bug on Calcite side. But it looks like Calcite has a concept of null data type and this seems to be expected. 
>
>Vineet
>
>
>
>On 11/3/16, 12:14 PM, "Julian Hyde" <jh...@apache.org> wrote:
>
>>Vineet,
>>
>>In case you forgot, can you please log that JIRA case? If we have a lengthy design discussion without creating an action item, we are wasting everyone’s time.
>>
>>Julian
>>
>>> On Nov 1, 2016, at 11:00 AM, Julian Hyde <jh...@apache.org> wrote:
>>> 
>>> Alexander & Vineet,
>>> 
>>> One further comment about “NOT IN”. SQL in general is fairly close to relational algebra, but “NOT IN” is one of the places where the gap is widest. “NOT IN” is difficult in general to execute efficiently, because of the problem of NULL values (at Oracle, we always recommended to users to rewrite as NOT EXISTS if possible). The gap between SQL and relational algebra is apparent when a short SQL query becomes a complex RelNode tree.
>>> 
>>> There is a silver lining: the RelNode tree, being relational algebra, has well-behaved semantics. Once you’re in RelNode land, you can freely apply transformation rules to make it efficient.
>>> 
>>> Vineet,
>>> 
>>> If the planner rules produce a plan that is sub-optimal I wouldn’t call it a bug but a missing feature. (It would be a bug if the planner over-reached and created a plan that gave wrong results, so I always tend to be conservative about adding rules.)
>>> 
>>> Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few redundant projects and filters are no problem, and can be easily removed later with rules that reduce constants and propagate predicates throughout the tree. But for the general case of NOT IN, we have to add a self-join to deal with the possibility that the key has NULL values. After constant reduction has kicked in and we have realized that NULL key values are not possible, it is not easy to remove that self-join.
>>> 
>>> Here is a very simple query where this happens:
>>> 
>>> sqlline> !connect jdbc:calcite:model=core/src/test/resources/hsqldb-model.json sa ""
>>> sqlline> !set outputformat csv
>>> sqlline> explain plan for select * from scott.emp where deptno not in (
>>>> select deptno from scott.dept where deptno = 20);
>>> 'PLAN'
>>> 'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)], expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, $t14, $t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)], proj#0..7=[{exprs}], $condition=[$t21])
>>>  EnumerableJoin(condition=[=($7, $10)], joinType=[left])
>>>    EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], JOB=[$t4], MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], c=[$t0], ck=[$t1])
>>>      EnumerableJoin(condition=[true], joinType=[inner])
>>>        JdbcToEnumerableConverter
>>>          JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)])
>>>            JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>>>              JdbcTableScan(table=[[SCOTT, DEPT]])
>>>        JdbcToEnumerableConverter
>>>          JdbcTableScan(table=[[SCOTT, EMP]])
>>>    JdbcToEnumerableConverter
>>>      JdbcAggregate(group=[{0, 1}])
>>>        JdbcProject(DEPTNO=[$0], i=[true])
>>>          JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>>>            JdbcTableScan(table=[[SCOTT, DEPT]])
>>> '
>>> 1 row selected (0.067 seconds)
>>> 
>>> Note that there are two scans of DEPT, but one is sufficient because DEPTNO is never null. In the JdbcAggregate, c always equals ck, and therefore the CASE can be simplified, and therefore the scan of DEPT that produces c and ck can be dropped, but Calcite rules cannot deduce that fact.
>>> 
>>> Can you please log a JIRA case for this? See if you can find some other queries (maybe using IN rather than NOT IN, or whose key columns are not so obviously NOT NULL) and include these in the JIRA case also.
>>> 
>>> I doubt we can fix using a planner rule. The best solution may be to use RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add the join.
>>> 
>>> Julian
>>> 
>>> 
>>>> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <Al...@epam.com> wrote:
>>>> 
>>>> Julian, thank you for help.
>>>> 
>>>> I had a wrong picture of NULL values processing. So, it looks like there is some problem in my planner rules.
>>>> As for the AST, I was confused by the wrong Flink "explain()" function description :)
>>>> 
>>>> 
>>>> Regards,
>>>> Alexander
>>>> 
>>>> -----Original Message-----
>>>> From: Julian Hyde [mailto:jhyde@apache.org] 
>>>> Sent: Monday, October 31, 2016 10:43 PM
>>>> To: dev@calcite.apache.org
>>>> Subject: Re: Problems with abstract syntax tree
>>>> 
>>>> The behavior of NOT IN in SQL is complicated when there are NULL values around. In particular, if one "word" value from the sub-query is null, then the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar' null)" would evaluate to UNKNOWN for every row.)
>>>> 
>>>> It is valid to deduce that "word" in the sub-query is never null, because of the "WHERE word = 'hello'" condition. I would have hoped that a constant reduction could do that, and then maybe the CASE expression can be simplified.
>>>> 
>>>> By the way, to be pedantic, what we are talking about here is the RelNode tree, the relational algebra, which comes out of the SqlToRelConverter. The AST is the SqlNode tree, which comes out of the parser and goes into the SqlToRelConverter.
>>>> 
>>>> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin <Al...@epam.com> wrote:
>>>>> Hello, everybody.
>>>>> 
>>>>> Trying to resolve an Apache Flink issue I got some troubles with Calcite. Can you help me to understand is there a problem in Calcite or just in wrong settings passed to Calcite functions?
>>>>> 
>>>>> I have a simple table "Words" with one column named "word" and a query with NOT IN operator:
>>>>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM Words WHERE word = 'hello')"
>>>>> 
>>>>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() and then transformed to a relational tree by org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>>>>> 
>>>>> As a result I see the following abstract syntax tree
>>>>> LogicalProject(word=[$0])
>>>>> LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, IS NULL($3), null, <($2, $1), null, false))])
>>>>>   LogicalJoin(condition=[=($3, $4)], joinType=[left])
>>>>>     LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>>>>>       LogicalJoin(condition=[true], joinType=[inner])
>>>>>         EnumerableTableScan(table=[[Words]])
>>>>>         LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>>>>>           LogicalProject($f0=[$0], $f1=[true])
>>>>>             LogicalProject(word=[$0])
>>>>>               LogicalFilter(condition=[=($0, 'hello')])
>>>>>                 EnumerableTableScan(table=[[Words]])
>>>>>     LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>>>>>       LogicalProject($f0=[$0], $f1=[true])
>>>>>         LogicalProject(word=[$0])
>>>>>           LogicalFilter(condition=[=($0, 'hello')])
>>>>>             EnumerableTableScan(table=[[Words]])
>>>>> 
>>>>> which fails later during query plan optimization (while calling org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>>>>> 
>>>>> I think it might be because of a very complex abstract syntax tree generated by Calcite. Shouldn't it be more simple? This one looks good for me:
>>>>> LogicalProject(word=[$0])
>>>>> LogicalFilter(condition=[IS NULL($2)])
>>>>>   LogicalJoin(condition=[=($0, $1)], joinType=[left])
>>>>>     EnumerableTableScan(table=[[Words]])
>>>>>     LogicalProject($f0=[$0], $f1=[true])
>>>>>       LogicalProject(word=[$0])
>>>>>         LogicalFilter(condition=[=($0, 'hello')])
>>>>>           EnumerableTableScan(table=[[Words]])
>>>>> 
>>>>> And when I write a query using LEFT OUTER JOIN to receive this syntax tree - the optimization works fine. And the query execution result is the same as must be in case of using NOT IN. So am I wrong with a supposition about bad abstract syntax tree or not? I will be glad to receive any comments.
>>>>> 
>>>>> Regards,
>>>>> Alexander
>>> 
>>
>>

Re: Problems with abstract syntax tree

Posted by Vineet Garg <vg...@hortonworks.com>.

Hi Julian,

Apologies for not responding earlier.

I understand that planner rules sometime produces a plan that is sub-optimal. My concern was about planner rules producing a plan consisting of an expression (literal null constant in this case) with null type i.e. SqlTypeName.NULL. I was wondering if this might be a bug on Calcite side. But it looks like Calcite has a concept of null data type and this seems to be expected. 

Vineet



On 11/3/16, 12:14 PM, "Julian Hyde" <jh...@apache.org> wrote:

>Vineet,
>
>In case you forgot, can you please log that JIRA case? If we have a lengthy design discussion without creating an action item, we are wasting everyone’s time.
>
>Julian
>
>> On Nov 1, 2016, at 11:00 AM, Julian Hyde <jh...@apache.org> wrote:
>> 
>> Alexander & Vineet,
>> 
>> One further comment about “NOT IN”. SQL in general is fairly close to relational algebra, but “NOT IN” is one of the places where the gap is widest. “NOT IN” is difficult in general to execute efficiently, because of the problem of NULL values (at Oracle, we always recommended to users to rewrite as NOT EXISTS if possible). The gap between SQL and relational algebra is apparent when a short SQL query becomes a complex RelNode tree.
>> 
>> There is a silver lining: the RelNode tree, being relational algebra, has well-behaved semantics. Once you’re in RelNode land, you can freely apply transformation rules to make it efficient.
>> 
>> Vineet,
>> 
>> If the planner rules produce a plan that is sub-optimal I wouldn’t call it a bug but a missing feature. (It would be a bug if the planner over-reached and created a plan that gave wrong results, so I always tend to be conservative about adding rules.)
>> 
>> Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few redundant projects and filters are no problem, and can be easily removed later with rules that reduce constants and propagate predicates throughout the tree. But for the general case of NOT IN, we have to add a self-join to deal with the possibility that the key has NULL values. After constant reduction has kicked in and we have realized that NULL key values are not possible, it is not easy to remove that self-join.
>> 
>> Here is a very simple query where this happens:
>> 
>> sqlline> !connect jdbc:calcite:model=core/src/test/resources/hsqldb-model.json sa ""
>> sqlline> !set outputformat csv
>> sqlline> explain plan for select * from scott.emp where deptno not in (
>>> select deptno from scott.dept where deptno = 20);
>> 'PLAN'
>> 'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)], expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, $t14, $t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)], proj#0..7=[{exprs}], $condition=[$t21])
>>  EnumerableJoin(condition=[=($7, $10)], joinType=[left])
>>    EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], JOB=[$t4], MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], c=[$t0], ck=[$t1])
>>      EnumerableJoin(condition=[true], joinType=[inner])
>>        JdbcToEnumerableConverter
>>          JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)])
>>            JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>>              JdbcTableScan(table=[[SCOTT, DEPT]])
>>        JdbcToEnumerableConverter
>>          JdbcTableScan(table=[[SCOTT, EMP]])
>>    JdbcToEnumerableConverter
>>      JdbcAggregate(group=[{0, 1}])
>>        JdbcProject(DEPTNO=[$0], i=[true])
>>          JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>>            JdbcTableScan(table=[[SCOTT, DEPT]])
>> '
>> 1 row selected (0.067 seconds)
>> 
>> Note that there are two scans of DEPT, but one is sufficient because DEPTNO is never null. In the JdbcAggregate, c always equals ck, and therefore the CASE can be simplified, and therefore the scan of DEPT that produces c and ck can be dropped, but Calcite rules cannot deduce that fact.
>> 
>> Can you please log a JIRA case for this? See if you can find some other queries (maybe using IN rather than NOT IN, or whose key columns are not so obviously NOT NULL) and include these in the JIRA case also.
>> 
>> I doubt we can fix using a planner rule. The best solution may be to use RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add the join.
>> 
>> Julian
>> 
>> 
>>> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <Al...@epam.com> wrote:
>>> 
>>> Julian, thank you for help.
>>> 
>>> I had a wrong picture of NULL values processing. So, it looks like there is some problem in my planner rules.
>>> As for the AST, I was confused by the wrong Flink "explain()" function description :)
>>> 
>>> 
>>> Regards,
>>> Alexander
>>> 
>>> -----Original Message-----
>>> From: Julian Hyde [mailto:jhyde@apache.org] 
>>> Sent: Monday, October 31, 2016 10:43 PM
>>> To: dev@calcite.apache.org
>>> Subject: Re: Problems with abstract syntax tree
>>> 
>>> The behavior of NOT IN in SQL is complicated when there are NULL values around. In particular, if one "word" value from the sub-query is null, then the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar' null)" would evaluate to UNKNOWN for every row.)
>>> 
>>> It is valid to deduce that "word" in the sub-query is never null, because of the "WHERE word = 'hello'" condition. I would have hoped that a constant reduction could do that, and then maybe the CASE expression can be simplified.
>>> 
>>> By the way, to be pedantic, what we are talking about here is the RelNode tree, the relational algebra, which comes out of the SqlToRelConverter. The AST is the SqlNode tree, which comes out of the parser and goes into the SqlToRelConverter.
>>> 
>>> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin <Al...@epam.com> wrote:
>>>> Hello, everybody.
>>>> 
>>>> Trying to resolve an Apache Flink issue I got some troubles with Calcite. Can you help me to understand is there a problem in Calcite or just in wrong settings passed to Calcite functions?
>>>> 
>>>> I have a simple table "Words" with one column named "word" and a query with NOT IN operator:
>>>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM Words WHERE word = 'hello')"
>>>> 
>>>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() and then transformed to a relational tree by org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>>>> 
>>>> As a result I see the following abstract syntax tree
>>>> LogicalProject(word=[$0])
>>>> LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, IS NULL($3), null, <($2, $1), null, false))])
>>>>   LogicalJoin(condition=[=($3, $4)], joinType=[left])
>>>>     LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>>>>       LogicalJoin(condition=[true], joinType=[inner])
>>>>         EnumerableTableScan(table=[[Words]])
>>>>         LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>>>>           LogicalProject($f0=[$0], $f1=[true])
>>>>             LogicalProject(word=[$0])
>>>>               LogicalFilter(condition=[=($0, 'hello')])
>>>>                 EnumerableTableScan(table=[[Words]])
>>>>     LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>>>>       LogicalProject($f0=[$0], $f1=[true])
>>>>         LogicalProject(word=[$0])
>>>>           LogicalFilter(condition=[=($0, 'hello')])
>>>>             EnumerableTableScan(table=[[Words]])
>>>> 
>>>> which fails later during query plan optimization (while calling org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>>>> 
>>>> I think it might be because of a very complex abstract syntax tree generated by Calcite. Shouldn't it be more simple? This one looks good for me:
>>>> LogicalProject(word=[$0])
>>>> LogicalFilter(condition=[IS NULL($2)])
>>>>   LogicalJoin(condition=[=($0, $1)], joinType=[left])
>>>>     EnumerableTableScan(table=[[Words]])
>>>>     LogicalProject($f0=[$0], $f1=[true])
>>>>       LogicalProject(word=[$0])
>>>>         LogicalFilter(condition=[=($0, 'hello')])
>>>>           EnumerableTableScan(table=[[Words]])
>>>> 
>>>> And when I write a query using LEFT OUTER JOIN to receive this syntax tree - the optimization works fine. And the query execution result is the same as must be in case of using NOT IN. So am I wrong with a supposition about bad abstract syntax tree or not? I will be glad to receive any comments.
>>>> 
>>>> Regards,
>>>> Alexander
>> 
>
>

Re: Problems with abstract syntax tree

Posted by Julian Hyde <jh...@apache.org>.

Vineet,

In case you forgot, can you please log that JIRA case? If we have a lengthy design discussion without creating an action item, we are wasting everyone’s time.

Julian

> On Nov 1, 2016, at 11:00 AM, Julian Hyde <jh...@apache.org> wrote:
> 
> Alexander & Vineet,
> 
> One further comment about “NOT IN”. SQL in general is fairly close to relational algebra, but “NOT IN” is one of the places where the gap is widest. “NOT IN” is difficult in general to execute efficiently, because of the problem of NULL values (at Oracle, we always recommended to users to rewrite as NOT EXISTS if possible). The gap between SQL and relational algebra is apparent when a short SQL query becomes a complex RelNode tree.
> 
> There is a silver lining: the RelNode tree, being relational algebra, has well-behaved semantics. Once you’re in RelNode land, you can freely apply transformation rules to make it efficient.
> 
> Vineet,
> 
> If the planner rules produce a plan that is sub-optimal I wouldn’t call it a bug but a missing feature. (It would be a bug if the planner over-reached and created a plan that gave wrong results, so I always tend to be conservative about adding rules.)
> 
> Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few redundant projects and filters are no problem, and can be easily removed later with rules that reduce constants and propagate predicates throughout the tree. But for the general case of NOT IN, we have to add a self-join to deal with the possibility that the key has NULL values. After constant reduction has kicked in and we have realized that NULL key values are not possible, it is not easy to remove that self-join.
> 
> Here is a very simple query where this happens:
> 
> sqlline> !connect jdbc:calcite:model=core/src/test/resources/hsqldb-model.json sa ""
> sqlline> !set outputformat csv
> sqlline> explain plan for select * from scott.emp where deptno not in (
>> select deptno from scott.dept where deptno = 20);
> 'PLAN'
> 'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)], expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, $t14, $t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)], proj#0..7=[{exprs}], $condition=[$t21])
>  EnumerableJoin(condition=[=($7, $10)], joinType=[left])
>    EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], JOB=[$t4], MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], c=[$t0], ck=[$t1])
>      EnumerableJoin(condition=[true], joinType=[inner])
>        JdbcToEnumerableConverter
>          JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)])
>            JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>              JdbcTableScan(table=[[SCOTT, DEPT]])
>        JdbcToEnumerableConverter
>          JdbcTableScan(table=[[SCOTT, EMP]])
>    JdbcToEnumerableConverter
>      JdbcAggregate(group=[{0, 1}])
>        JdbcProject(DEPTNO=[$0], i=[true])
>          JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>            JdbcTableScan(table=[[SCOTT, DEPT]])
> '
> 1 row selected (0.067 seconds)
> 
> Note that there are two scans of DEPT, but one is sufficient because DEPTNO is never null. In the JdbcAggregate, c always equals ck, and therefore the CASE can be simplified, and therefore the scan of DEPT that produces c and ck can be dropped, but Calcite rules cannot deduce that fact.
> 
> Can you please log a JIRA case for this? See if you can find some other queries (maybe using IN rather than NOT IN, or whose key columns are not so obviously NOT NULL) and include these in the JIRA case also.
> 
> I doubt we can fix using a planner rule. The best solution may be to use RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add the join.
> 
> Julian
> 
> 
>> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <Al...@epam.com> wrote:
>> 
>> Julian, thank you for help.
>> 
>> I had a wrong picture of NULL values processing. So, it looks like there is some problem in my planner rules.
>> As for the AST, I was confused by the wrong Flink "explain()" function description :)
>> 
>> 
>> Regards,
>> Alexander
>> 
>> -----Original Message-----
>> From: Julian Hyde [mailto:jhyde@apache.org] 
>> Sent: Monday, October 31, 2016 10:43 PM
>> To: dev@calcite.apache.org
>> Subject: Re: Problems with abstract syntax tree
>> 
>> The behavior of NOT IN in SQL is complicated when there are NULL values around. In particular, if one "word" value from the sub-query is null, then the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar' null)" would evaluate to UNKNOWN for every row.)
>> 
>> It is valid to deduce that "word" in the sub-query is never null, because of the "WHERE word = 'hello'" condition. I would have hoped that a constant reduction could do that, and then maybe the CASE expression can be simplified.
>> 
>> By the way, to be pedantic, what we are talking about here is the RelNode tree, the relational algebra, which comes out of the SqlToRelConverter. The AST is the SqlNode tree, which comes out of the parser and goes into the SqlToRelConverter.
>> 
>> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin <Al...@epam.com> wrote:
>>> Hello, everybody.
>>> 
>>> Trying to resolve an Apache Flink issue I got some troubles with Calcite. Can you help me to understand is there a problem in Calcite or just in wrong settings passed to Calcite functions?
>>> 
>>> I have a simple table "Words" with one column named "word" and a query with NOT IN operator:
>>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM Words WHERE word = 'hello')"
>>> 
>>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() and then transformed to a relational tree by org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>>> 
>>> As a result I see the following abstract syntax tree
>>> LogicalProject(word=[$0])
>>> LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, IS NULL($3), null, <($2, $1), null, false))])
>>>   LogicalJoin(condition=[=($3, $4)], joinType=[left])
>>>     LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>>>       LogicalJoin(condition=[true], joinType=[inner])
>>>         EnumerableTableScan(table=[[Words]])
>>>         LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>>>           LogicalProject($f0=[$0], $f1=[true])
>>>             LogicalProject(word=[$0])
>>>               LogicalFilter(condition=[=($0, 'hello')])
>>>                 EnumerableTableScan(table=[[Words]])
>>>     LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>>>       LogicalProject($f0=[$0], $f1=[true])
>>>         LogicalProject(word=[$0])
>>>           LogicalFilter(condition=[=($0, 'hello')])
>>>             EnumerableTableScan(table=[[Words]])
>>> 
>>> which fails later during query plan optimization (while calling org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>>> 
>>> I think it might be because of a very complex abstract syntax tree generated by Calcite. Shouldn't it be more simple? This one looks good for me:
>>> LogicalProject(word=[$0])
>>> LogicalFilter(condition=[IS NULL($2)])
>>>   LogicalJoin(condition=[=($0, $1)], joinType=[left])
>>>     EnumerableTableScan(table=[[Words]])
>>>     LogicalProject($f0=[$0], $f1=[true])
>>>       LogicalProject(word=[$0])
>>>         LogicalFilter(condition=[=($0, 'hello')])
>>>           EnumerableTableScan(table=[[Words]])
>>> 
>>> And when I write a query using LEFT OUTER JOIN to receive this syntax tree - the optimization works fine. And the query execution result is the same as must be in case of using NOT IN. So am I wrong with a supposition about bad abstract syntax tree or not? I will be glad to receive any comments.
>>> 
>>> Regards,
>>> Alexander
>

Re: Problems with abstract syntax tree

Posted by Julian Hyde <jh...@apache.org>.

Alexander & Vineet,

One further comment about “NOT IN”. SQL in general is fairly close to relational algebra, but “NOT IN” is one of the places where the gap is widest. “NOT IN” is difficult in general to execute efficiently, because of the problem of NULL values (at Oracle, we always recommended to users to rewrite as NOT EXISTS if possible). The gap between SQL and relational algebra is apparent when a short SQL query becomes a complex RelNode tree.

There is a silver lining: the RelNode tree, being relational algebra, has well-behaved semantics. Once you’re in RelNode land, you can freely apply transformation rules to make it efficient.

Vineet,

If the planner rules produce a plan that is sub-optimal I wouldn’t call it a bug but a missing feature. (It would be a bug if the planner over-reached and created a plan that gave wrong results, so I always tend to be conservative about adding rules.)

Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few redundant projects and filters are no problem, and can be easily removed later with rules that reduce constants and propagate predicates throughout the tree. But for the general case of NOT IN, we have to add a self-join to deal with the possibility that the key has NULL values. After constant reduction has kicked in and we have realized that NULL key values are not possible, it is not easy to remove that self-join.

Here is a very simple query where this happens:

sqlline> !connect jdbc:calcite:model=core/src/test/resources/hsqldb-model.json sa ""
sqlline> !set outputformat csv
sqlline> explain plan for select * from scott.emp where deptno not in (
       > select deptno from scott.dept where deptno = 20);
'PLAN'
'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)], expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, $t14, $t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)], proj#0..7=[{exprs}], $condition=[$t21])
  EnumerableJoin(condition=[=($7, $10)], joinType=[left])
    EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], JOB=[$t4], MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], c=[$t0], ck=[$t1])
      EnumerableJoin(condition=[true], joinType=[inner])
        JdbcToEnumerableConverter
          JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)])
            JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
              JdbcTableScan(table=[[SCOTT, DEPT]])
        JdbcToEnumerableConverter
          JdbcTableScan(table=[[SCOTT, EMP]])
    JdbcToEnumerableConverter
      JdbcAggregate(group=[{0, 1}])
        JdbcProject(DEPTNO=[$0], i=[true])
          JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
            JdbcTableScan(table=[[SCOTT, DEPT]])
'
1 row selected (0.067 seconds)

Note that there are two scans of DEPT, but one is sufficient because DEPTNO is never null. In the JdbcAggregate, c always equals ck, and therefore the CASE can be simplified, and therefore the scan of DEPT that produces c and ck can be dropped, but Calcite rules cannot deduce that fact.

Can you please log a JIRA case for this? See if you can find some other queries (maybe using IN rather than NOT IN, or whose key columns are not so obviously NOT NULL) and include these in the JIRA case also.

I doubt we can fix using a planner rule. The best solution may be to use RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add the join.

Julian


> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <Al...@epam.com> wrote:
> 
> Julian, thank you for help.
> 
> I had a wrong picture of NULL values processing. So, it looks like there is some problem in my planner rules.
> As for the AST, I was confused by the wrong Flink "explain()" function description :)
> 
> 
> Regards,
> Alexander
> 
> -----Original Message-----
> From: Julian Hyde [mailto:jhyde@apache.org] 
> Sent: Monday, October 31, 2016 10:43 PM
> To: dev@calcite.apache.org
> Subject: Re: Problems with abstract syntax tree
> 
> The behavior of NOT IN in SQL is complicated when there are NULL values around. In particular, if one "word" value from the sub-query is null, then the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar' null)" would evaluate to UNKNOWN for every row.)
> 
> It is valid to deduce that "word" in the sub-query is never null, because of the "WHERE word = 'hello'" condition. I would have hoped that a constant reduction could do that, and then maybe the CASE expression can be simplified.
> 
> By the way, to be pedantic, what we are talking about here is the RelNode tree, the relational algebra, which comes out of the SqlToRelConverter. The AST is the SqlNode tree, which comes out of the parser and goes into the SqlToRelConverter.
> 
> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin <Al...@epam.com> wrote:
>> Hello, everybody.
>> 
>> Trying to resolve an Apache Flink issue I got some troubles with Calcite. Can you help me to understand is there a problem in Calcite or just in wrong settings passed to Calcite functions?
>> 
>> I have a simple table "Words" with one column named "word" and a query with NOT IN operator:
>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM Words WHERE word = 'hello')"
>> 
>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() and then transformed to a relational tree by org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>> 
>> As a result I see the following abstract syntax tree
>> LogicalProject(word=[$0])
>>  LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, IS NULL($3), null, <($2, $1), null, false))])
>>    LogicalJoin(condition=[=($3, $4)], joinType=[left])
>>      LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>>        LogicalJoin(condition=[true], joinType=[inner])
>>          EnumerableTableScan(table=[[Words]])
>>          LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>>            LogicalProject($f0=[$0], $f1=[true])
>>              LogicalProject(word=[$0])
>>                LogicalFilter(condition=[=($0, 'hello')])
>>                  EnumerableTableScan(table=[[Words]])
>>      LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>>        LogicalProject($f0=[$0], $f1=[true])
>>          LogicalProject(word=[$0])
>>            LogicalFilter(condition=[=($0, 'hello')])
>>              EnumerableTableScan(table=[[Words]])
>> 
>> which fails later during query plan optimization (while calling org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>> 
>> I think it might be because of a very complex abstract syntax tree generated by Calcite. Shouldn't it be more simple? This one looks good for me:
>> LogicalProject(word=[$0])
>>  LogicalFilter(condition=[IS NULL($2)])
>>    LogicalJoin(condition=[=($0, $1)], joinType=[left])
>>      EnumerableTableScan(table=[[Words]])
>>      LogicalProject($f0=[$0], $f1=[true])
>>        LogicalProject(word=[$0])
>>          LogicalFilter(condition=[=($0, 'hello')])
>>            EnumerableTableScan(table=[[Words]])
>> 
>> And when I write a query using LEFT OUTER JOIN to receive this syntax tree - the optimization works fine. And the query execution result is the same as must be in case of using NOT IN. So am I wrong with a supposition about bad abstract syntax tree or not? I will be glad to receive any comments.
>> 
>> Regards,
>> Alexander