You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Vineet Garg <vg...@apache.org> on 2019/04/16 00:14:51 UTC

How to rewrite <>ANY subqueries

Hello folks,

I am currently stuck on a problem and I wonder if  the SQL folks here could
help me make progress.

There is a bug in CALCITE where subqueries with =ANY and <>ANY are
transformed incorrectly (CALCITE-2986
<https://issues.apache.org/jira/browse/CALCITE-2986>). I was able to figure
out the transformation for =ANY but I am unable to figure out how to
rewrite <>ANY clause.
I came across some sources online suggesting <>ANY could be rewritten using
EXISTS or NOT IN but I fail to see how. Using EXISTS will not work because
EXISTS only produce TRUE/FALSE but ANY requires three valued logic. NOT IN
is equivalent to column<>value1 AND column<>value2 but <>ANY is equivalent
to column<>value1 OR column<>value2 so I am not sure how these are
equivalent either.

There is some discussion on CALCITE-2986 regarding using LEFT OUTER JOIN,
SEMI JOIN or Nested loop join but I don't believe any of them will produce
correct semantics.

Does anyone know what is the correct rewrite for <>ANY?

P.S. Can someone review pull request to fix =ANY rewrite? - #1161
<https://github.com/apache/calcite/pull/1161>

Thanks,
Vineet Garg

Re: Re: How to rewrite <>ANY subqueries

Posted by Haisheng Yuan <h....@alibaba-inc.com>.

There are 2 kinds of subquery context, value context and filter context.
1. Value Context, e.g.
select R.i in (select T.i from T) from R;
2. Filter Context, e.g.
select * from R where R.i in (select T.i from T);

We are talking about subquery in value context here, for filter context it is quite
straightforward to rewrite into semi join or semi apply.

Let's generalize the value context subquery to <op> ANY () or <op> ALL (), e.g.
select (select T.i <op> ANY|ALL (select R.i from R)) from T;

For each tuple from T, we expect the subquery evaluation result to be one of the
following 3 values {TRUE, FALSE, NULL (UNKNOWN)}. This depends on the 
comparison result of T.i with all the qualifying tuples from R.

Let's first define the following:
A: set of comparison results of T.i <op> R.i
B: ANY subquery result
C: ALL subquery  result
D: count aggregate, the number of rows in the subquery that evaluate to NULL or true
E: sum aggregate, the number of rows in the subquery that evaluate to NULL

            A                      B             C            D           E 
------------------      --------     -------    ---------  ---------
  {}                             false         true           0        null
  {false}                     false         false          0        null
  {null}                       null           false           1         1
  {true}                       true          true            1         0
  {null, false}              null           false           1         1
  {null, true}               true           null             2        1
  {false, true}             true           false           1        0
  {null, false, true}      true          false            2        1

For example, for an = ANY subquery:
   - if T.i=1 and R.i = {1,2,3}, return TRUE
   - if T.i=1 and R.i = {2,3}, return FALSE
   - if T.i=1 and R.i = {2,3, NULL}, return NULL
   - if T.i=1 and R.i = {}, return FALSE
   - if T.i=1 and R.i = {NULL}, return NULL

We can decorrelate the correlated subquery as follows:
----------------------------------------------
Project (if (D = 0) then FALSE else if (D > E) then TRUE else NULL)
  +- GroupBy[T.i, rn], D=count(C0), E=sum(C1)
          +- Project (C1: case when (T.i <op> R.i) is null then 1 else 0 end)
                +- Left Outer Apply
                      +- Window (rn: row_number() over())
                       |      +- TableScan on T
                      +- Filter ((T.i <op> R.i) IS NOT FALSE)
                            +- Project (C0: constant(TRUE))
                                   +- TableScan on R
----------------------------------------------

For the Filter ((T.i <op> R.i) IS NOT FALSE), rather than Filter (T.i <op> R.i IS TRUE),
because we want the tuple from R even when T.i or R.i is NULL.

Note that there is a Window on top of T, because we want to give every tuple from T
a unique identifier, so that we can group by it after the outer apply. If we know that
T.i is distinct, the Window operator can be removed, because T.i is already the unique
identifier. For PostgreSQL, we don't need the Window either, because every physical
tuple has a unique identifier, ctid (system column). So we are just using row number
to mock the ctid tuple identifier.

If R.i is distinct and the subquery is equality comparison, we can safely remove
GroupBy and Window, because we won't worry about multiple tuples from R that
can match T.

If the subquery has no outer references, we might be able to move the top aggregate
down to the inner relation of outer apply, providing it is just an IN or NOT IN subquery.

For ALL subqueries, those can be transformed into the equivalence of ANY subquery,
provided that there is corresponding inverse operator:
x <op> ALL (SQ)  <==> NOT(x <inverse-op> ANY (SQ))

Thanks ~
Haisheng Yuan
------------------------------------------------------------------
发件人：Julian Hyde<jh...@apache.org>
日　期：2019年04月16日 09:34:44
收件人：dev@calcite.apache.org<de...@calcite.apache.org>
主　题：Re: How to rewrite <>ANY subqueries

If you are lucky, it will be sufficient to use the the same
intermediate values that are used to execute NOT IN:
* does the value occur in the sub-query?
* how many values does the sub-query return?
* how many null values does the sub-query return?

If you know that x is not one of the values returned by the sub-query,
then the result is either false or unknown, depending on the number of
nulls.

If you know that x is one of the values returned by the sub-query,
then the result is either true or false, depending on whether there is
1 not-null value or more than one.

Write a good test of tests, write a truth table, and keep changing the
logic until all of the tests pass.

Julian

On Mon, Apr 15, 2019 at 5:15 PM Vineet Garg <vg...@apache.org> wrote:
>
> Hello folks,
>
> I am currently stuck on a problem and I wonder if  the SQL folks here could
> help me make progress.
>
> There is a bug in CALCITE where subqueries with =ANY and <>ANY are
> transformed incorrectly (CALCITE-2986
> <https://issues.apache.org/jira/browse/CALCITE-2986>). I was able to figure
> out the transformation for =ANY but I am unable to figure out how to
> rewrite <>ANY clause.
> I came across some sources online suggesting <>ANY could be rewritten using
> EXISTS or NOT IN but I fail to see how. Using EXISTS will not work because
> EXISTS only produce TRUE/FALSE but ANY requires three valued logic. NOT IN
> is equivalent to column<>value1 AND column<>value2 but <>ANY is equivalent
> to column<>value1 OR column<>value2 so I am not sure how these are
> equivalent either.
>
> There is some discussion on CALCITE-2986 regarding using LEFT OUTER JOIN,
> SEMI JOIN or Nested loop join but I don't believe any of them will produce
> correct semantics.
>
> Does anyone know what is the correct rewrite for <>ANY?
>
> P.S. Can someone review pull request to fix =ANY rewrite? - #1161
> <https://github.com/apache/calcite/pull/1161>
>
> Thanks,
> Vineet Garg

Re: How to rewrite <>ANY subqueries

Posted by Julian Hyde <jh...@apache.org>.

If you are lucky, it will be sufficient to use the the same
intermediate values that are used to execute NOT IN:
* does the value occur in the sub-query?
* how many values does the sub-query return?
* how many null values does the sub-query return?

If you know that x is not one of the values returned by the sub-query,
then the result is either false or unknown, depending on the number of
nulls.

If you know that x is one of the values returned by the sub-query,
then the result is either true or false, depending on whether there is
1 not-null value or more than one.

Write a good test of tests, write a truth table, and keep changing the
logic until all of the tests pass.

Julian

On Mon, Apr 15, 2019 at 5:15 PM Vineet Garg <vg...@apache.org> wrote:
>
> Hello folks,
>
> I am currently stuck on a problem and I wonder if  the SQL folks here could
> help me make progress.
>
> There is a bug in CALCITE where subqueries with =ANY and <>ANY are
> transformed incorrectly (CALCITE-2986
> <https://issues.apache.org/jira/browse/CALCITE-2986>). I was able to figure
> out the transformation for =ANY but I am unable to figure out how to
> rewrite <>ANY clause.
> I came across some sources online suggesting <>ANY could be rewritten using
> EXISTS or NOT IN but I fail to see how. Using EXISTS will not work because
> EXISTS only produce TRUE/FALSE but ANY requires three valued logic. NOT IN
> is equivalent to column<>value1 AND column<>value2 but <>ANY is equivalent
> to column<>value1 OR column<>value2 so I am not sure how these are
> equivalent either.
>
> There is some discussion on CALCITE-2986 regarding using LEFT OUTER JOIN,
> SEMI JOIN or Nested loop join but I don't believe any of them will produce
> correct semantics.
>
> Does anyone know what is the correct rewrite for <>ANY?
>
> P.S. Can someone review pull request to fix =ANY rewrite? - #1161
> <https://github.com/apache/calcite/pull/1161>
>
> Thanks,
> Vineet Garg