You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Stamatis Zampetakis <za...@gmail.com> on 2022/02/17 15:17:32 UTC

Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Hi all,

Till today, I believed that pushing a filter below an inner join is not
strictly related to the operators (equals, plus, minus, cast, division)
present in the filter.

However, the query below will return some results if the filter is executed
after the join or it will fail with an exception if the filter is pushed
below the join.

EMP [EMPNO, NAME, DEPTNO]
0   | Alex | 0
10 | Bob | 1

DEP [DEPTNO]
1

SELECT e.name
FROM emp e
INNER JOIN dep d ON e.deptno = d.deptno
WHERE (10 / e.empno) = 1

It seems that SQL standard (Section 7.4 general rule 1) mandates that WHERE
should be applied to the result of FROM so in the case above pushing
filters below the join seems to violate the standard.

Currently, in Calcite we are going to push the division, cast, and
basically any kind of operator below the join. Many well-known databases
e.g., Postgres, do that as well.

Should we log it as a bug and try to do something about it or let it be as
it is?

Best,
Stamatis

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Posted by Julian Hyde <jh...@gmail.com>.

Thanks for logging this. I agree that it’s not high priority. I added some commentary about related bugs.

> On Oct 7, 2022, at 9:07 AM, Stamatis Zampetakis <za...@gmail.com> wrote:
> 
> I just logged https://issues.apache.org/jira/browse/CALCITE-5315 but I am
> not planning to work on this sometime soon.
> 
> Best,
> Stamatis
> 
> On Fri, Feb 18, 2022 at 10:56 PM Stamatis Zampetakis <za...@gmail.com>
> wrote:
> 
>> Thank you all for the feedback.
>> 
>> I will do a small research about the proper term to use for functions that
>> might throw / never throw and log a JIRA to move the discussion further.
>> 
>> @Viliam : Note that the problem you mention, although similar, is not
>> exactly the same as the one I brought up now.
>> 
>> The reordering of predicates in the WHERE clause has been brought up quite
>> a few times in the dev list and our stance [1] is that since the standard
>> leaves this decision to the implementor people should not rely on this (or
>> put the appropriate guards).
>> 
>> The case here is a bit different, at least the way I read the standard,
>> cause it defines the following:
>> 
>> "If all optional clauses are omitted, then the result of the <table
>> expression> is the same as the result of the
>> <from clause>. Otherwise, each specified clause is applied to the result
>> of the previously specified clause
>> and the result of the <table expression> is the result of the application
>> of the last specified clause."
>> 
>> and one of the optional clauses mentioned in the previous paragraph is the
>> <where clause>. There seems to be a clearly defined order between the <from
>> clause>, which includes inner joins, and the <where clause>.
>> 
>> Best,
>> Stamatis
>> 
>> [1] https://lists.apache.org/thread/mq44cnrohz19hh10btms126vbcoxl50w
>> 
>> On Fri, Feb 18, 2022 at 9:58 AM Viliam Durina <vi...@hazelcast.com>
>> wrote:
>> 
>>> I have observed this issue years ago in well-known databases. My case was
>>> much simpler:
>>> 
>>> data
>>> recordType:int  value:text
>>> -----           ------
>>> 0               1
>>> 1               a
>>> 
>>> SELECT *
>>> FROM data
>>> WHERE recordType='1' AND CAST(value AS INT)<10
>>> 
>>> 
>>> SQL is declarative, and unlike procedural languages, it doesn't prescribe
>>> short-circuit evaluation of the WHERE clause, or any specific evaluation
>>> order. If it was prescribed, the query would be perfectly safe. But
>>> prescribing the evaluation order would rule out many optimizations, or
>>> make
>>> them much harder, such as this half-null-half-error value.
>>> 
>>> For example, reordering additions might or might not lead to overflow:
>>> TINYINTs `100 + 100 - 90`, evaluated in this order, overflow, but `100 -
>>> 90
>>> + 100` don't - imagine each value comes from a different table and we
>>> reorder the joins. Perhaps result of TINYINT addition can be SMALLINT, but
>>> what if they are BIGINTs?
>>> 
>>> My understanding was that any expression must not fail in any plausible
>>> execution plan. Therefore if I parse the column `value`, it must succeed
>>> for every possible scanned row. In my specific case I ended up
>>> implementing
>>> a custom TO_NUMBER function that returns null on parsing error, and that
>>> null never appeared in the result because of the other condition.
>>> 
>>> Viliam
>>> 
>>> On Thu, 17 Feb 2022 at 20:08, Julian Hyde <jh...@gmail.com> wrote:
>>> 
>>>> You should definitely log a bug, even if there are no plans to fix it.
>>> It
>>>> is a violation of the standard, and therefore it is a bug, and
>>> therefore we
>>>> should document that it exists.
>>>> 
>>>> Can you do some research on the right terminology. You use the term
>>>> ‘unsafe’. I think the PL community uses the term ’strict’ for
>>> expressions
>>>> that must be evaluated (and therefore if they throw, the query is sure
>>> to
>>>> abort). We have other related concepts floating around in Calcite -
>>> dynamic
>>>> functions, deterministic operators - and standardizing terminology is
>>> key
>>>> if we are to keep the menagerie in order.
>>>> 
>>>> There might be a way to have our cake and eat it too. We could push down
>>>> the division, and if there is a division-by-zero we return a special
>>> value
>>>> such as NULL. If the special value is not eliminated by the join then
>>> it is
>>>> promoted to a throw. The “cake” here is the performance benefit of
>>> pushing
>>>> down a filter (that may never or rarely throw); the “eating it” is
>>> safety
>>>> on the occasion that the filter does throw.
>>>> 
>>>> Even if that strategy doesn’t work, maybe we could have a flag that says
>>>> whether to push down conditions that might throw.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>>> On Feb 17, 2022, at 8:07 AM, Scott Reynolds <sd...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Is it feasible to prevent the filter push down in cases like this (
>>>> detect
>>>>> potential division by zero) ? What other exceptions that could happen?
>>>>> 
>>>>> Should it only push down some filters and preserve the complex filter
>>>> above
>>>>> the join?
>>>>> 
>>>>> Regarding a Jira ticket, generating an exception when it should
>>> produce
>>>>> valid results is gotta be a Jira case IMHO.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <za...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> Till today, I believed that pushing a filter below an inner join is
>>> not
>>>>>> strictly related to the operators (equals, plus, minus, cast,
>>> division)
>>>>>> present in the filter.
>>>>>> 
>>>>>> However, the query below will return some results if the filter is
>>>> executed
>>>>>> after the join or it will fail with an exception if the filter is
>>> pushed
>>>>>> below the join.
>>>>>> 
>>>>>> EMP [EMPNO, NAME, DEPTNO]
>>>>>> 0   | Alex | 0
>>>>>> 10 | Bob | 1
>>>>>> 
>>>>>> DEP [DEPTNO]
>>>>>> 1
>>>>>> 
>>>>>> SELECT e.name
>>>>>> FROM emp e
>>>>>> INNER JOIN dep d ON e.deptno = d.deptno
>>>>>> WHERE (10 / e.empno) = 1
>>>>>> 
>>>>>> It seems that SQL standard (Section 7.4 general rule 1) mandates that
>>>> WHERE
>>>>>> should be applied to the result of FROM so in the case above pushing
>>>>>> filters below the join seems to violate the standard.
>>>>>> 
>>>>>> Currently, in Calcite we are going to push the division, cast, and
>>>>>> basically any kind of operator below the join. Many well-known
>>> databases
>>>>>> e.g., Postgres, do that as well.
>>>>>> 
>>>>>> Should we log it as a bug and try to do something about it or let it
>>> be
>>>> as
>>>>>> it is?
>>>>>> 
>>>>>> Best,
>>>>>> Stamatis
>>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> This message contains confidential information and is intended only for
>>> the
>>> individuals named. If you are not the named addressee you should not
>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>> immediately by e-mail if you have received this e-mail by mistake and
>>> delete this e-mail from your system. E-mail transmission cannot be
>>> guaranteed to be secure or error-free as information could be
>>> intercepted,
>>> corrupted, lost, destroyed, arrive late or incomplete, or contain
>>> viruses.
>>> The sender therefore does not accept liability for any errors or
>>> omissions
>>> in the contents of this message, which arise as a result of e-mail
>>> transmission. If verification is required, please request a hard-copy
>>> version. -Hazelcast
>>> 
>>

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Posted by Zoltan Haindrich <ki...@rxd.hu>.

I think there might be an another approach; by making it possible to define safetiness of an operator - going back to the first example and considering:

SELECT e.name
FROM emp e
INNER JOIN dep d ON e.deptno = d.deptno
WHERE (10 / e.empno) = 1

and the pushed down version:

SELECT e.name
FROM (select * from emp WHERE (10 / empno)=1) e
INNER JOIN dep d ON e.deptno = d.deptno

There are systems which handle divison_by_zero as a shortcut to NULL; for those the above two queries are conveniently equivalent.

Limiting optimizations by stopping because of unsafety might hurt some system's performance - I think an alternate approach could be to:
* mark operators regarding their safety
* optionally provide a way to declare the safety conversion process;
   for DIVISION this could be: (case e.empno when 0 then NULL else e.empno end)
* during push down the condition with unsafe elements should also be kept at the top level to ensure that we don't loose correctness

so the pushed down version would be:
---
SELECT e.name
FROM (select * from emp WHERE  (10 / (case empno when 0 then NULL else empno end) ) = 1 ) e
INNER JOIN dep d ON e.deptno = d.deptno
WHERE (10 / e.empno) = 1

This approch could possibly also benefit from optimizations based constraints/etc more easily:
* say if empno is known to be >0 at compilation time
* the case could be removed
* get back to the original expression
* finally remove the identical conditional from the top level

cheers,
Zoltan




On 10/7/22 18:07, Stamatis Zampetakis wrote:
> I just logged https://issues.apache.org/jira/browse/CALCITE-5315 but I am
> not planning to work on this sometime soon.
> 
> Best,
> Stamatis
> 
> On Fri, Feb 18, 2022 at 10:56 PM Stamatis Zampetakis <za...@gmail.com>
> wrote:
> 
>> Thank you all for the feedback.
>>
>> I will do a small research about the proper term to use for functions that
>> might throw / never throw and log a JIRA to move the discussion further.
>>
>> @Viliam : Note that the problem you mention, although similar, is not
>> exactly the same as the one I brought up now.
>>
>> The reordering of predicates in the WHERE clause has been brought up quite
>> a few times in the dev list and our stance [1] is that since the standard
>> leaves this decision to the implementor people should not rely on this (or
>> put the appropriate guards).
>>
>> The case here is a bit different, at least the way I read the standard,
>> cause it defines the following:
>>
>> "If all optional clauses are omitted, then the result of the <table
>> expression> is the same as the result of the
>> <from clause>. Otherwise, each specified clause is applied to the result
>> of the previously specified clause
>> and the result of the <table expression> is the result of the application
>> of the last specified clause."
>>
>> and one of the optional clauses mentioned in the previous paragraph is the
>> <where clause>. There seems to be a clearly defined order between the <from
>> clause>, which includes inner joins, and the <where clause>.
>>
>> Best,
>> Stamatis
>>
>> [1] https://lists.apache.org/thread/mq44cnrohz19hh10btms126vbcoxl50w
>>
>> On Fri, Feb 18, 2022 at 9:58 AM Viliam Durina <vi...@hazelcast.com>
>> wrote:
>>
>>> I have observed this issue years ago in well-known databases. My case was
>>> much simpler:
>>>
>>> data
>>> recordType:int  value:text
>>> -----           ------
>>> 0               1
>>> 1               a
>>>
>>> SELECT *
>>> FROM data
>>> WHERE recordType='1' AND CAST(value AS INT)<10
>>>
>>>
>>> SQL is declarative, and unlike procedural languages, it doesn't prescribe
>>> short-circuit evaluation of the WHERE clause, or any specific evaluation
>>> order. If it was prescribed, the query would be perfectly safe. But
>>> prescribing the evaluation order would rule out many optimizations, or
>>> make
>>> them much harder, such as this half-null-half-error value.
>>>
>>> For example, reordering additions might or might not lead to overflow:
>>> TINYINTs `100 + 100 - 90`, evaluated in this order, overflow, but `100 -
>>> 90
>>> + 100` don't - imagine each value comes from a different table and we
>>> reorder the joins. Perhaps result of TINYINT addition can be SMALLINT, but
>>> what if they are BIGINTs?
>>>
>>> My understanding was that any expression must not fail in any plausible
>>> execution plan. Therefore if I parse the column `value`, it must succeed
>>> for every possible scanned row. In my specific case I ended up
>>> implementing
>>> a custom TO_NUMBER function that returns null on parsing error, and that
>>> null never appeared in the result because of the other condition.
>>>
>>> Viliam
>>>
>>> On Thu, 17 Feb 2022 at 20:08, Julian Hyde <jh...@gmail.com> wrote:
>>>
>>>> You should definitely log a bug, even if there are no plans to fix it.
>>> It
>>>> is a violation of the standard, and therefore it is a bug, and
>>> therefore we
>>>> should document that it exists.
>>>>
>>>> Can you do some research on the right terminology. You use the term
>>>> ‘unsafe’. I think the PL community uses the term ’strict’ for
>>> expressions
>>>> that must be evaluated (and therefore if they throw, the query is sure
>>> to
>>>> abort). We have other related concepts floating around in Calcite -
>>> dynamic
>>>> functions, deterministic operators - and standardizing terminology is
>>> key
>>>> if we are to keep the menagerie in order.
>>>>
>>>> There might be a way to have our cake and eat it too. We could push down
>>>> the division, and if there is a division-by-zero we return a special
>>> value
>>>> such as NULL. If the special value is not eliminated by the join then
>>> it is
>>>> promoted to a throw. The “cake” here is the performance benefit of
>>> pushing
>>>> down a filter (that may never or rarely throw); the “eating it” is
>>> safety
>>>> on the occasion that the filter does throw.
>>>>
>>>> Even if that strategy doesn’t work, maybe we could have a flag that says
>>>> whether to push down conditions that might throw.
>>>>
>>>> Julian
>>>>
>>>>
>>>>> On Feb 17, 2022, at 8:07 AM, Scott Reynolds <sd...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Is it feasible to prevent the filter push down in cases like this (
>>>> detect
>>>>> potential division by zero) ? What other exceptions that could happen?
>>>>>
>>>>> Should it only push down some filters and preserve the complex filter
>>>> above
>>>>> the join?
>>>>>
>>>>> Regarding a Jira ticket, generating an exception when it should
>>> produce
>>>>> valid results is gotta be a Jira case IMHO.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <za...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Till today, I believed that pushing a filter below an inner join is
>>> not
>>>>>> strictly related to the operators (equals, plus, minus, cast,
>>> division)
>>>>>> present in the filter.
>>>>>>
>>>>>> However, the query below will return some results if the filter is
>>>> executed
>>>>>> after the join or it will fail with an exception if the filter is
>>> pushed
>>>>>> below the join.
>>>>>>
>>>>>> EMP [EMPNO, NAME, DEPTNO]
>>>>>> 0   | Alex | 0
>>>>>> 10 | Bob | 1
>>>>>>
>>>>>> DEP [DEPTNO]
>>>>>> 1
>>>>>>
>>>>>> SELECT e.name
>>>>>> FROM emp e
>>>>>> INNER JOIN dep d ON e.deptno = d.deptno
>>>>>> WHERE (10 / e.empno) = 1
>>>>>>
>>>>>> It seems that SQL standard (Section 7.4 general rule 1) mandates that
>>>> WHERE
>>>>>> should be applied to the result of FROM so in the case above pushing
>>>>>> filters below the join seems to violate the standard.
>>>>>>
>>>>>> Currently, in Calcite we are going to push the division, cast, and
>>>>>> basically any kind of operator below the join. Many well-known
>>> databases
>>>>>> e.g., Postgres, do that as well.
>>>>>>
>>>>>> Should we log it as a bug and try to do something about it or let it
>>> be
>>>> as
>>>>>> it is?
>>>>>>
>>>>>> Best,
>>>>>> Stamatis
>>>>>>
>>>>
>>>>
>>>
>>> --
>>> This message contains confidential information and is intended only for
>>> the
>>> individuals named. If you are not the named addressee you should not
>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>> immediately by e-mail if you have received this e-mail by mistake and
>>> delete this e-mail from your system. E-mail transmission cannot be
>>> guaranteed to be secure or error-free as information could be
>>> intercepted,
>>> corrupted, lost, destroyed, arrive late or incomplete, or contain
>>> viruses.
>>> The sender therefore does not accept liability for any errors or
>>> omissions
>>> in the contents of this message, which arise as a result of e-mail
>>> transmission. If verification is required, please request a hard-copy
>>> version. -Hazelcast
>>>
>>
>

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Posted by Stamatis Zampetakis <za...@gmail.com>.

I just logged https://issues.apache.org/jira/browse/CALCITE-5315 but I am
not planning to work on this sometime soon.

Best,
Stamatis

On Fri, Feb 18, 2022 at 10:56 PM Stamatis Zampetakis <za...@gmail.com>
wrote:

> Thank you all for the feedback.
>
> I will do a small research about the proper term to use for functions that
> might throw / never throw and log a JIRA to move the discussion further.
>
> @Viliam : Note that the problem you mention, although similar, is not
> exactly the same as the one I brought up now.
>
> The reordering of predicates in the WHERE clause has been brought up quite
> a few times in the dev list and our stance [1] is that since the standard
> leaves this decision to the implementor people should not rely on this (or
> put the appropriate guards).
>
> The case here is a bit different, at least the way I read the standard,
> cause it defines the following:
>
> "If all optional clauses are omitted, then the result of the <table
> expression> is the same as the result of the
> <from clause>. Otherwise, each specified clause is applied to the result
> of the previously specified clause
> and the result of the <table expression> is the result of the application
> of the last specified clause."
>
> and one of the optional clauses mentioned in the previous paragraph is the
> <where clause>. There seems to be a clearly defined order between the <from
> clause>, which includes inner joins, and the <where clause>.
>
> Best,
> Stamatis
>
> [1] https://lists.apache.org/thread/mq44cnrohz19hh10btms126vbcoxl50w
>
> On Fri, Feb 18, 2022 at 9:58 AM Viliam Durina <vi...@hazelcast.com>
> wrote:
>
>> I have observed this issue years ago in well-known databases. My case was
>> much simpler:
>>
>> data
>> recordType:int  value:text
>> -----           ------
>> 0               1
>> 1               a
>>
>> SELECT *
>> FROM data
>> WHERE recordType='1' AND CAST(value AS INT)<10
>>
>>
>> SQL is declarative, and unlike procedural languages, it doesn't prescribe
>> short-circuit evaluation of the WHERE clause, or any specific evaluation
>> order. If it was prescribed, the query would be perfectly safe. But
>> prescribing the evaluation order would rule out many optimizations, or
>> make
>> them much harder, such as this half-null-half-error value.
>>
>> For example, reordering additions might or might not lead to overflow:
>> TINYINTs `100 + 100 - 90`, evaluated in this order, overflow, but `100 -
>> 90
>> + 100` don't - imagine each value comes from a different table and we
>> reorder the joins. Perhaps result of TINYINT addition can be SMALLINT, but
>> what if they are BIGINTs?
>>
>> My understanding was that any expression must not fail in any plausible
>> execution plan. Therefore if I parse the column `value`, it must succeed
>> for every possible scanned row. In my specific case I ended up
>> implementing
>> a custom TO_NUMBER function that returns null on parsing error, and that
>> null never appeared in the result because of the other condition.
>>
>> Viliam
>>
>> On Thu, 17 Feb 2022 at 20:08, Julian Hyde <jh...@gmail.com> wrote:
>>
>> > You should definitely log a bug, even if there are no plans to fix it.
>> It
>> > is a violation of the standard, and therefore it is a bug, and
>> therefore we
>> > should document that it exists.
>> >
>> > Can you do some research on the right terminology. You use the term
>> > ‘unsafe’. I think the PL community uses the term ’strict’ for
>> expressions
>> > that must be evaluated (and therefore if they throw, the query is sure
>> to
>> > abort). We have other related concepts floating around in Calcite -
>> dynamic
>> > functions, deterministic operators - and standardizing terminology is
>> key
>> > if we are to keep the menagerie in order.
>> >
>> > There might be a way to have our cake and eat it too. We could push down
>> > the division, and if there is a division-by-zero we return a special
>> value
>> > such as NULL. If the special value is not eliminated by the join then
>> it is
>> > promoted to a throw. The “cake” here is the performance benefit of
>> pushing
>> > down a filter (that may never or rarely throw); the “eating it” is
>> safety
>> > on the occasion that the filter does throw.
>> >
>> > Even if that strategy doesn’t work, maybe we could have a flag that says
>> > whether to push down conditions that might throw.
>> >
>> > Julian
>> >
>> >
>> > > On Feb 17, 2022, at 8:07 AM, Scott Reynolds <sd...@gmail.com>
>> > wrote:
>> > >
>> > > Is it feasible to prevent the filter push down in cases like this (
>> > detect
>> > > potential division by zero) ? What other exceptions that could happen?
>> > >
>> > > Should it only push down some filters and preserve the complex filter
>> > above
>> > > the join?
>> > >
>> > > Regarding a Jira ticket, generating an exception when it should
>> produce
>> > > valid results is gotta be a Jira case IMHO.
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <za...@gmail.com>
>> > wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> Till today, I believed that pushing a filter below an inner join is
>> not
>> > >> strictly related to the operators (equals, plus, minus, cast,
>> division)
>> > >> present in the filter.
>> > >>
>> > >> However, the query below will return some results if the filter is
>> > executed
>> > >> after the join or it will fail with an exception if the filter is
>> pushed
>> > >> below the join.
>> > >>
>> > >> EMP [EMPNO, NAME, DEPTNO]
>> > >> 0   | Alex | 0
>> > >> 10 | Bob | 1
>> > >>
>> > >> DEP [DEPTNO]
>> > >> 1
>> > >>
>> > >> SELECT e.name
>> > >> FROM emp e
>> > >> INNER JOIN dep d ON e.deptno = d.deptno
>> > >> WHERE (10 / e.empno) = 1
>> > >>
>> > >> It seems that SQL standard (Section 7.4 general rule 1) mandates that
>> > WHERE
>> > >> should be applied to the result of FROM so in the case above pushing
>> > >> filters below the join seems to violate the standard.
>> > >>
>> > >> Currently, in Calcite we are going to push the division, cast, and
>> > >> basically any kind of operator below the join. Many well-known
>> databases
>> > >> e.g., Postgres, do that as well.
>> > >>
>> > >> Should we log it as a bug and try to do something about it or let it
>> be
>> > as
>> > >> it is?
>> > >>
>> > >> Best,
>> > >> Stamatis
>> > >>
>> >
>> >
>>
>> --
>> This message contains confidential information and is intended only for
>> the
>> individuals named. If you are not the named addressee you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately by e-mail if you have received this e-mail by mistake and
>> delete this e-mail from your system. E-mail transmission cannot be
>> guaranteed to be secure or error-free as information could be
>> intercepted,
>> corrupted, lost, destroyed, arrive late or incomplete, or contain
>> viruses.
>> The sender therefore does not accept liability for any errors or
>> omissions
>> in the contents of this message, which arise as a result of e-mail
>> transmission. If verification is required, please request a hard-copy
>> version. -Hazelcast
>>
>

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Posted by Stamatis Zampetakis <za...@gmail.com>.

Thank you all for the feedback.

I will do a small research about the proper term to use for functions that
might throw / never throw and log a JIRA to move the discussion further.

@Viliam : Note that the problem you mention, although similar, is not
exactly the same as the one I brought up now.

The reordering of predicates in the WHERE clause has been brought up quite
a few times in the dev list and our stance [1] is that since the standard
leaves this decision to the implementor people should not rely on this (or
put the appropriate guards).

The case here is a bit different, at least the way I read the standard,
cause it defines the following:

"If all optional clauses are omitted, then the result of the <table
expression> is the same as the result of the
<from clause>. Otherwise, each specified clause is applied to the result of
the previously specified clause
and the result of the <table expression> is the result of the application
of the last specified clause."

and one of the optional clauses mentioned in the previous paragraph is the
<where clause>. There seems to be a clearly defined order between the <from
clause>, which includes inner joins, and the <where clause>.

Best,
Stamatis

[1] https://lists.apache.org/thread/mq44cnrohz19hh10btms126vbcoxl50w

On Fri, Feb 18, 2022 at 9:58 AM Viliam Durina <vi...@hazelcast.com> wrote:

> I have observed this issue years ago in well-known databases. My case was
> much simpler:
>
> data
> recordType:int  value:text
> -----           ------
> 0               1
> 1               a
>
> SELECT *
> FROM data
> WHERE recordType='1' AND CAST(value AS INT)<10
>
>
> SQL is declarative, and unlike procedural languages, it doesn't prescribe
> short-circuit evaluation of the WHERE clause, or any specific evaluation
> order. If it was prescribed, the query would be perfectly safe. But
> prescribing the evaluation order would rule out many optimizations, or make
> them much harder, such as this half-null-half-error value.
>
> For example, reordering additions might or might not lead to overflow:
> TINYINTs `100 + 100 - 90`, evaluated in this order, overflow, but `100 - 90
> + 100` don't - imagine each value comes from a different table and we
> reorder the joins. Perhaps result of TINYINT addition can be SMALLINT, but
> what if they are BIGINTs?
>
> My understanding was that any expression must not fail in any plausible
> execution plan. Therefore if I parse the column `value`, it must succeed
> for every possible scanned row. In my specific case I ended up implementing
> a custom TO_NUMBER function that returns null on parsing error, and that
> null never appeared in the result because of the other condition.
>
> Viliam
>
> On Thu, 17 Feb 2022 at 20:08, Julian Hyde <jh...@gmail.com> wrote:
>
> > You should definitely log a bug, even if there are no plans to fix it. It
> > is a violation of the standard, and therefore it is a bug, and therefore
> we
> > should document that it exists.
> >
> > Can you do some research on the right terminology. You use the term
> > ‘unsafe’. I think the PL community uses the term ’strict’ for expressions
> > that must be evaluated (and therefore if they throw, the query is sure to
> > abort). We have other related concepts floating around in Calcite -
> dynamic
> > functions, deterministic operators - and standardizing terminology is key
> > if we are to keep the menagerie in order.
> >
> > There might be a way to have our cake and eat it too. We could push down
> > the division, and if there is a division-by-zero we return a special
> value
> > such as NULL. If the special value is not eliminated by the join then it
> is
> > promoted to a throw. The “cake” here is the performance benefit of
> pushing
> > down a filter (that may never or rarely throw); the “eating it” is safety
> > on the occasion that the filter does throw.
> >
> > Even if that strategy doesn’t work, maybe we could have a flag that says
> > whether to push down conditions that might throw.
> >
> > Julian
> >
> >
> > > On Feb 17, 2022, at 8:07 AM, Scott Reynolds <sd...@gmail.com>
> > wrote:
> > >
> > > Is it feasible to prevent the filter push down in cases like this (
> > detect
> > > potential division by zero) ? What other exceptions that could happen?
> > >
> > > Should it only push down some filters and preserve the complex filter
> > above
> > > the join?
> > >
> > > Regarding a Jira ticket, generating an exception when it should produce
> > > valid results is gotta be a Jira case IMHO.
> > >
> > >
> > >
> > >
> > > On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <za...@gmail.com>
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> Till today, I believed that pushing a filter below an inner join is
> not
> > >> strictly related to the operators (equals, plus, minus, cast,
> division)
> > >> present in the filter.
> > >>
> > >> However, the query below will return some results if the filter is
> > executed
> > >> after the join or it will fail with an exception if the filter is
> pushed
> > >> below the join.
> > >>
> > >> EMP [EMPNO, NAME, DEPTNO]
> > >> 0   | Alex | 0
> > >> 10 | Bob | 1
> > >>
> > >> DEP [DEPTNO]
> > >> 1
> > >>
> > >> SELECT e.name
> > >> FROM emp e
> > >> INNER JOIN dep d ON e.deptno = d.deptno
> > >> WHERE (10 / e.empno) = 1
> > >>
> > >> It seems that SQL standard (Section 7.4 general rule 1) mandates that
> > WHERE
> > >> should be applied to the result of FROM so in the case above pushing
> > >> filters below the join seems to violate the standard.
> > >>
> > >> Currently, in Calcite we are going to push the division, cast, and
> > >> basically any kind of operator below the join. Many well-known
> databases
> > >> e.g., Postgres, do that as well.
> > >>
> > >> Should we log it as a bug and try to do something about it or let it
> be
> > as
> > >> it is?
> > >>
> > >> Best,
> > >> Stamatis
> > >>
> >
> >
>
> --
> This message contains confidential information and is intended only for
> the
> individuals named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. If verification is required, please request a hard-copy
> version. -Hazelcast
>

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Posted by Viliam Durina <vi...@hazelcast.com>.

I have observed this issue years ago in well-known databases. My case was
much simpler:

data
recordType:int  value:text
-----           ------
0               1
1               a

SELECT *
FROM data
WHERE recordType='1' AND CAST(value AS INT)<10


SQL is declarative, and unlike procedural languages, it doesn't prescribe
short-circuit evaluation of the WHERE clause, or any specific evaluation
order. If it was prescribed, the query would be perfectly safe. But
prescribing the evaluation order would rule out many optimizations, or make
them much harder, such as this half-null-half-error value.

For example, reordering additions might or might not lead to overflow:
TINYINTs `100 + 100 - 90`, evaluated in this order, overflow, but `100 - 90
+ 100` don't - imagine each value comes from a different table and we
reorder the joins. Perhaps result of TINYINT addition can be SMALLINT, but
what if they are BIGINTs?

My understanding was that any expression must not fail in any plausible
execution plan. Therefore if I parse the column `value`, it must succeed
for every possible scanned row. In my specific case I ended up implementing
a custom TO_NUMBER function that returns null on parsing error, and that
null never appeared in the result because of the other condition.

Viliam

On Thu, 17 Feb 2022 at 20:08, Julian Hyde <jh...@gmail.com> wrote:

> You should definitely log a bug, even if there are no plans to fix it. It
> is a violation of the standard, and therefore it is a bug, and therefore we
> should document that it exists.
>
> Can you do some research on the right terminology. You use the term
> ‘unsafe’. I think the PL community uses the term ’strict’ for expressions
> that must be evaluated (and therefore if they throw, the query is sure to
> abort). We have other related concepts floating around in Calcite - dynamic
> functions, deterministic operators - and standardizing terminology is key
> if we are to keep the menagerie in order.
>
> There might be a way to have our cake and eat it too. We could push down
> the division, and if there is a division-by-zero we return a special value
> such as NULL. If the special value is not eliminated by the join then it is
> promoted to a throw. The “cake” here is the performance benefit of pushing
> down a filter (that may never or rarely throw); the “eating it” is safety
> on the occasion that the filter does throw.
>
> Even if that strategy doesn’t work, maybe we could have a flag that says
> whether to push down conditions that might throw.
>
> Julian
>
>
> > On Feb 17, 2022, at 8:07 AM, Scott Reynolds <sd...@gmail.com>
> wrote:
> >
> > Is it feasible to prevent the filter push down in cases like this (
> detect
> > potential division by zero) ? What other exceptions that could happen?
> >
> > Should it only push down some filters and preserve the complex filter
> above
> > the join?
> >
> > Regarding a Jira ticket, generating an exception when it should produce
> > valid results is gotta be a Jira case IMHO.
> >
> >
> >
> >
> > On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <za...@gmail.com>
> wrote:
> >
> >> Hi all,
> >>
> >> Till today, I believed that pushing a filter below an inner join is not
> >> strictly related to the operators (equals, plus, minus, cast, division)
> >> present in the filter.
> >>
> >> However, the query below will return some results if the filter is
> executed
> >> after the join or it will fail with an exception if the filter is pushed
> >> below the join.
> >>
> >> EMP [EMPNO, NAME, DEPTNO]
> >> 0   | Alex | 0
> >> 10 | Bob | 1
> >>
> >> DEP [DEPTNO]
> >> 1
> >>
> >> SELECT e.name
> >> FROM emp e
> >> INNER JOIN dep d ON e.deptno = d.deptno
> >> WHERE (10 / e.empno) = 1
> >>
> >> It seems that SQL standard (Section 7.4 general rule 1) mandates that
> WHERE
> >> should be applied to the result of FROM so in the case above pushing
> >> filters below the join seems to violate the standard.
> >>
> >> Currently, in Calcite we are going to push the division, cast, and
> >> basically any kind of operator below the join. Many well-known databases
> >> e.g., Postgres, do that as well.
> >>
> >> Should we log it as a bug and try to do something about it or let it be
> as
> >> it is?
> >>
> >> Best,
> >> Stamatis
> >>
>
>

-- 
This message contains confidential information and is intended only for the 
individuals named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
The sender therefore does not accept liability for any errors or omissions 
in the contents of this message, which arise as a result of e-mail 
transmission. If verification is required, please request a hard-copy 
version. -Hazelcast

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Posted by Julian Hyde <jh...@gmail.com>.

You should definitely log a bug, even if there are no plans to fix it. It is a violation of the standard, and therefore it is a bug, and therefore we should document that it exists.

Can you do some research on the right terminology. You use the term ‘unsafe’. I think the PL community uses the term ’strict’ for expressions that must be evaluated (and therefore if they throw, the query is sure to abort). We have other related concepts floating around in Calcite - dynamic functions, deterministic operators - and standardizing terminology is key if we are to keep the menagerie in order.

There might be a way to have our cake and eat it too. We could push down the division, and if there is a division-by-zero we return a special value such as NULL. If the special value is not eliminated by the join then it is promoted to a throw. The “cake” here is the performance benefit of pushing down a filter (that may never or rarely throw); the “eating it” is safety on the occasion that the filter does throw.

Even if that strategy doesn’t work, maybe we could have a flag that says whether to push down conditions that might throw.

Julian

> On Feb 17, 2022, at 8:07 AM, Scott Reynolds <sd...@gmail.com> wrote:
> 
> Is it feasible to prevent the filter push down in cases like this ( detect
> potential division by zero) ? What other exceptions that could happen?
> 
> Should it only push down some filters and preserve the complex filter above
> the join?
> 
> Regarding a Jira ticket, generating an exception when it should produce
> valid results is gotta be a Jira case IMHO.
> 
> 
> 
> 
> On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <za...@gmail.com> wrote:
> 
>> Hi all,
>> 
>> Till today, I believed that pushing a filter below an inner join is not
>> strictly related to the operators (equals, plus, minus, cast, division)
>> present in the filter.
>> 
>> However, the query below will return some results if the filter is executed
>> after the join or it will fail with an exception if the filter is pushed
>> below the join.
>> 
>> EMP [EMPNO, NAME, DEPTNO]
>> 0   | Alex | 0
>> 10 | Bob | 1
>> 
>> DEP [DEPTNO]
>> 1
>> 
>> SELECT e.name
>> FROM emp e
>> INNER JOIN dep d ON e.deptno = d.deptno
>> WHERE (10 / e.empno) = 1
>> 
>> It seems that SQL standard (Section 7.4 general rule 1) mandates that WHERE
>> should be applied to the result of FROM so in the case above pushing
>> filters below the join seems to violate the standard.
>> 
>> Currently, in Calcite we are going to push the division, cast, and
>> basically any kind of operator below the join. Many well-known databases
>> e.g., Postgres, do that as well.
>> 
>> Should we log it as a bug and try to do something about it or let it be as
>> it is?
>> 
>> Best,
>> Stamatis
>>

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Posted by Scott Reynolds <sd...@gmail.com>.

 Is it feasible to prevent the filter push down in cases like this ( detect
potential division by zero) ? What other exceptions that could happen?

Should it only push down some filters and preserve the complex filter above
the join?

Regarding a Jira ticket, generating an exception when it should produce
valid results is gotta be a Jira case IMHO.




On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <za...@gmail.com> wrote:

> Hi all,
>
> Till today, I believed that pushing a filter below an inner join is not
> strictly related to the operators (equals, plus, minus, cast, division)
> present in the filter.
>
> However, the query below will return some results if the filter is executed
> after the join or it will fail with an exception if the filter is pushed
> below the join.
>
> EMP [EMPNO, NAME, DEPTNO]
> 0   | Alex | 0
> 10 | Bob | 1
>
> DEP [DEPTNO]
> 1
>
> SELECT e.name
> FROM emp e
> INNER JOIN dep d ON e.deptno = d.deptno
> WHERE (10 / e.empno) = 1
>
> It seems that SQL standard (Section 7.4 general rule 1) mandates that WHERE
> should be applied to the result of FROM so in the case above pushing
> filters below the join seems to violate the standard.
>
> Currently, in Calcite we are going to push the division, cast, and
> basically any kind of operator below the join. Many well-known databases
> e.g., Postgres, do that as well.
>
> Should we log it as a bug and try to do something about it or let it be as
> it is?
>
> Best,
> Stamatis
>