You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Hyde <ju...@hydromatic.net> on 2014/08/18 21:29:54 UTC

NOT IN and NULL values (OPTIQ-373)

I’m trying to figure out how to attack the NOT IN bug, https://issues.apache.org/jira/browse/OPTIQ-373  (I thought I’d addressed it — see JdbcTest.testNotInQueryWithNull — but I wrote the test case wrong, and we do indeed have a problem.)

I’d like to translate NOT IN similarly to how we translate IN, EXISTS, NOT EXISTS and scalar sub-queries (i.e. to some combination of a join, probably an outer join, on top of an aggregate). Then have a rule that recognizes that pattern and translates to SemiJoin.

That seems more roundabout than going straight to semi-join, but (1) it will allow NOT IN queries to run on back-ends that do not have an implementation of SemiJoin, and (2) it should allow us to do decorrelation, (3) it should allow us to represent cases that cannot be represented as semi-join, e.g. ‘(x NOT IN subquery) IS UNKNOWN OR otherCondition'

There are a bunch of ways to translate the query where you first check then whether the sub-query returns NULL then, then run again using 2-valued boolean semantics.

For example, the query

SELECT * FROM d
WHERE d NOT IN (
  SELECT d FROM e)

can be translated to

SELECT * FROM d
WHERE NOT EXISTS (SELECT 1 FROM e WHERE deptno IS NULL)
AND deptno IN (SELECT deptno FROM e WHERE deptno IS NOT NULL)

But in so doing, we have duplicated the sub-query. That places a big burden on the rule that would recognize the two identical sub-queries and convert to some variant of semi-join.

So, is there a translation that only uses one copy of the query?

Julian

Re: NOT IN and NULL values (OPTIQ-373)

Posted by Julian Hyde <ju...@hydromatic.net>.

On Tue, Aug 19, 2014 at 8:22 AM, Vladimir Sitnikov
<si...@gmail.com> wrote:
>
> Do you have real life cases that require this not in semantics?
>
> In my experience, this behaviour is typically unexpected by developers.
> I admit we want a compliant implementation, however I wonder if it is worth
> tuning these kind of queries.

I've never come across a case where a developer wanted the bizarre
null-semantics of NOT IN. But as you say, it's the standard, so we
don't have a choice to change it.

I agree that we should not spend a great deal of effort optimizing
these queries.

There are cases where it is convenient to express a query using NOT IN
rather than say NOT EXISTS. If you are sure that your sub-query will
never return NULL, it's a reasonable thing to do. For example, all
departments with no female employees:

SELECT * FROM dept
WHERE deptno NOT IN (SELECT deptno FROM emp WHERE gender = 'F')

is more concise than

SELECT * FROM dept
WHERE NOT EXISTS (
  SELECT 1 FROM emp
  WHERE emp.deptno = dept.deptno
  AND emp.gender = 'F')

and is "safe" because emp.deptno has a NOT NULL constraint.

>
> >(2) it should allow us to do decorrelation
> Can you please explain what is that?

Decorrelation is the process of converting a query that contains one
or more correlated sub-queries (i.e. sub-queries that reference
columns on a table in an enclosing query) into an equivalent query
that has no correlated sub-queries.

Queries without correlations are easier to optimize and are usually
more efficient at run time.

> > (3) it should allow us to represent cases that cannot be represented as
> > semi-join, e.g. ‘(x NOT IN subquery) IS UNKNOWN OR otherCondition'

> Can you please explain why this cannot be represented as a semi-join?

Semi-joins are a sequence of filters. They either remove a row, or do
not. Effectively, they implement two-value boolean logic, and can only
be combined using AND.

If "otherCondition", above, used a sub-query, say a NOT IN or NOT
EXISTS, the translator would have real trouble getting the right
semantics.

A technique that has been useful for all other kinds of sub-queries,
including correlated sub-queries, is what I call a "lookup" query. It
is a generalization of semi-join to return more than true/false.
Suppose you are translating the above NOT EXISTS query. You can
translate it to

SELECT d.*
FROM dept LEFT JOIN (
  SELECT DISTINCT valgen.deptno, TRUE AS indicator
  FROM (SELECT DISTINCT deptno FROM dept) AS valgen
    CROSS JOIN emp
  WHERE emp.deptno = valgen.deptno
  AND emp.gender = 'F') AS lookup
ON dept.deptno = lookup.deptno
WHERE NOT lookup.indicator

Note that "EXISTS ..." has been converted to "lookup.indicator". This
approach can handle multiple sub-queries, 3-valued boolean logic, and
(for scalar sub-queries) other types besides boolean.

The "LEFT JOIN" combined with the "DISTINCT" ensures that joining to
the sub-query does not increase or decrease the number of rows. Any
filtering happens under the control of the WHERE clause.

I recently added [1] a rule to convert this generalized semi-join into
a regular semi-join if (a) the LEFT JOIN can be converted to INNER,
(b) the filter condition is "lookup.indicator IS TRUE", (c) the rows
from the indicator query are unique on the join columns.

I'd like to extend that rule to deal with "lookup.indicator IS NOT
TRUE", so we can have anti-semi-joins.

Julian

[1] https://issues.apache.org/jira/browse/OPTIQ-368

Re: NOT IN and NULL values (OPTIQ-373)

Posted by Vladimir Sitnikov <si...@gmail.com>.

>(3) it should allow us to represent cases that cannot be represented as
semi-join, e.g. ‘(x NOT IN subquery) IS UNKNOWN OR otherCondition'

Can you please explain why this cannot be represented as a semi-join?

Vladimir

Re: NOT IN and NULL values (OPTIQ-373)

Posted by Vladimir Sitnikov <si...@gmail.com>.

Do you have real life cases that require this not in semantics?

In my experience, this behaviour is typically unexpected by developers.
I admit we want a compliant implementation, however I wonder if it is worth
tuning these kind of queries.

>(2) it should allow us to do decorrelation
Can you please explain what is that?


Vladimir