You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Hyde <jh...@gmail.com> on 2020/08/09 19:26:55 UTC

[DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

We have had several discussions over the years about how to represent IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.

I have generally taken the position that we should expand to ORs (e.g. “x = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow IN in RexCall.

I have given this some further thought, as part of a couple of RexSimplify bugs [1] [2] and I now think we should replace IN with something more powerful, namely sargs. A sarg (or search argument [3]) is an ordered set of intervals, and can be represented as a Guava ImmutableRangeSet, such as "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants, but also ranges.

Today you would write

  RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))

With sargs, you would instead write

  RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1], [3..3], [5..5]]”)))

There is a new operator IN_SARG, and a new node type RexSarg (it could almost be a RexLiteral).

Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if we invest in sarg support in RexSimplify and RelOptPredicateList, that investment is likely to pay dividends in better plans.

Guava RangeSets, and hence sargs, have support for discrete domains, so they can easily optimize ">2 AND <4" to "3”.

Sargs would be the preferred (therefore canonical) form for any AND or OR list that has more than one comparison on the same operand (e.g. "x > 3 AND x < 17 AND x <> 10”).
 
This proposal would subsume IN and therefore we would stop supporting IN in RexCall.

Julian

[1] https://issues.apache.org/jira/browse/CALCITE-4155 <https://issues.apache.org/jira/browse/CALCITE-4155>

[2] https://github.com/julianhyde/calcite/tree/4159-simplify <https://github.com/julianhyde/calcite/tree/4159-simplify> 

[3] https://en.wikipedia.org/wiki/Sargable <https://en.wikipedia.org/wiki/Sargable>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

> 1. It has to support range operations, not every type supports, especially for User Defined Types that are used in IN constant lists, they might only support equal operation.

If types are not ordered you can still create lists of points.

We’d just create a Comparable wrapper that uses the natural order of the UDT.

As you know, sargs very closely related to b-tree indexes; a sarg is basically the definition of a b-tree index scan. Do databases prevent you from building indexes on non-ordered types? Of course not.

> 2. The range optimization of ">2 AND <4" to "3” is only valid for integer-like types, but I don't think it will bring much gains. Optimizing  col in (1,2,3,4,5) to a >=1 and a <=5 is ok, but only ok for a single or very limited number of disjoint ranges, but this is a very limited corner case, most likely we may end up with many disjoint ranges, especially when there are 10k values. I don't think it is worth doing SARG for IN for this sake.

What is the cost that you are worried about? I believe that a Guava ImmutableRangeSet containing 10k point ranges will use less memory than a RexCall(IN, RexInputRef, RexLiteral, …, RexLiteral).

It can be quickly converted back to another format (say RexCall(OR, ...)) when you need it.

> 3. For non-integer data types, like double or string, we will end up with ranges {[a,a], [b,b], [c,c]...}, the stats derivation e.g. inferring selectivity from histogram may take much longer time depends on the implementation.

Histograms are usually based on ranges, not points. So I’d expect that sargs would seem to be a better representation.

> 4. It is not extensible. It can only be used for iN or NOT IN. What about customized operators like geospatial intersect? e.g. col intersect ANY(area1, area2, area3)

Sure, the approach has its limits. But it goes quite a lot further than IN, which is just a list of points, at very low cost.

Areas that have not-totally-ordered data types tend to have their own operators already. For your example I’d write ST_Intersects(col, ST_Collect(area1, area2, area3)), and that would be a perfectly suitable representation in RexNode-land.

Julian

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

There is already an issue requesting RangeSet taken out of beta: https://github.com/google/guava/issues/3376 <https://github.com/google/guava/issues/3376> 

> On Sep 29, 2020, at 1:41 PM, Stamatis Zampetakis <za...@gmail.com> wrote:
> 
> If it is a big concern can't we mark our own classes/fields relying on Beta
> APIs as experimental and subject to change?
> 
> In [1] they also say the following about Beta APIs:
> 
> "All this said, @Beta features tend to remain relatively stable. If we
> decide to delete a @Beta feature, we will typically deprecate it for one
> release before deleting it.
> 
> On the other hand, if you want something taken out of @Beta, file an issue.
> We generally promote features out of @Beta only when it's specifically
> requested, so if you don't ask, it won't happen."
> 
> So in the meantime let's also request them to promote the respective APIs
> from beta.
> 
> Best,
> Stamatis
> 
> [1] https://github.com/google/guava/wiki/PhilosophyExplained
> 
> On Tue, Sep 29, 2020 at 9:31 PM Julian Hyde <jh...@gmail.com> wrote:
> 
>> For the record, the Druid adapter has used RangeSet for a long while, and
>> it made sense, because Druid was doing tricky computations on date ranges.
>> Introducing Sargs brought that style to other parts of Calcite.
>> 
>> If someone was to build an adapter similar to the Druid adapter, based on
>> 1.26, externally to Calcite, they probably would not have to depend on
>> RangeSet because Calcite’s Sarg class would do the rewrites that they need.
>> 
>> Julian
>> 
>> 
>>> On Sep 29, 2020, at 12:24 PM, Vladimir Sitnikov <
>> sitnikov.vladimir@gmail.com> wrote:
>>> 
>>> Julian>The vast majority of clients who use Sarg (or expressions that
>>> contain Sarg) will not reference RangeSet
>>> Julian> directly and therefore would not be impacted. So I think it’s an
>>> acceptable risk.
>>> 
>>> Well, it is hard to tell, however, I know Druid is using Sargs, and, I
>>> guess, Druid is one among the very least tested adapters.
>>> See
>>> 
>> https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699
>>> 
>>> In other words, Druid adapter proves Calcite forces clients to use
>>> Guava's @Beta API :(
>>> 
>>> Vladimir
>> 
>>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Stamatis Zampetakis <za...@gmail.com>.

If it is a big concern can't we mark our own classes/fields relying on Beta
APIs as experimental and subject to change?

In [1] they also say the following about Beta APIs:

"All this said, @Beta features tend to remain relatively stable. If we
decide to delete a @Beta feature, we will typically deprecate it for one
release before deleting it.

On the other hand, if you want something taken out of @Beta, file an issue.
We generally promote features out of @Beta only when it's specifically
requested, so if you don't ask, it won't happen."

So in the meantime let's also request them to promote the respective APIs
from beta.

Best,
Stamatis

[1] https://github.com/google/guava/wiki/PhilosophyExplained

On Tue, Sep 29, 2020 at 9:31 PM Julian Hyde <jh...@gmail.com> wrote:

> For the record, the Druid adapter has used RangeSet for a long while, and
> it made sense, because Druid was doing tricky computations on date ranges.
> Introducing Sargs brought that style to other parts of Calcite.
>
> If someone was to build an adapter similar to the Druid adapter, based on
> 1.26, externally to Calcite, they probably would not have to depend on
> RangeSet because Calcite’s Sarg class would do the rewrites that they need.
>
> Julian
>
>
> > On Sep 29, 2020, at 12:24 PM, Vladimir Sitnikov <
> sitnikov.vladimir@gmail.com> wrote:
> >
> > Julian>The vast majority of clients who use Sarg (or expressions that
> > contain Sarg) will not reference RangeSet
> > Julian> directly and therefore would not be impacted. So I think it’s an
> > acceptable risk.
> >
> > Well, it is hard to tell, however, I know Druid is using Sargs, and, I
> > guess, Druid is one among the very least tested adapters.
> > See
> >
> https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699
> >
> > In other words, Druid adapter proves Calcite forces clients to use
> > Guava's @Beta API :(
> >
> > Vladimir
>
>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

For the record, the Druid adapter has used RangeSet for a long while, and it made sense, because Druid was doing tricky computations on date ranges. Introducing Sargs brought that style to other parts of Calcite.

If someone was to build an adapter similar to the Druid adapter, based on 1.26, externally to Calcite, they probably would not have to depend on RangeSet because Calcite’s Sarg class would do the rewrites that they need.

Julian

> On Sep 29, 2020, at 12:24 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Julian>The vast majority of clients who use Sarg (or expressions that
> contain Sarg) will not reference RangeSet
> Julian> directly and therefore would not be impacted. So I think it’s an
> acceptable risk.
> 
> Well, it is hard to tell, however, I know Druid is using Sargs, and, I
> guess, Druid is one among the very least tested adapters.
> See
> https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699
> 
> In other words, Druid adapter proves Calcite forces clients to use
> Guava's @Beta API :(
> 
> Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Vladimir Sitnikov <si...@gmail.com>.

Julian>The vast majority of clients who use Sarg (or expressions that
contain Sarg) will not reference RangeSet
Julian> directly and therefore would not be impacted. So I think it’s an
acceptable risk.

Well, it is hard to tell, however, I know Druid is using Sargs, and, I
guess, Druid is one among the very least tested adapters.
See
https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699

In other words, Druid adapter proves Calcite forces clients to use
Guava's @Beta API :(

Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

> On Sep 29, 2020, at 11:40 AM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Julian>Suppose - worst case scenario - that Guava 30 removes RangeSet. The
> solution would be for us to copy RangeSet and enough dependent classes into
> Calcite to serve our needs
> 
> Then all the clients would have to adjust packages because we can't copy
> Guava code and keep Guava's package names.

That is a valid concern. RangeSet used in a public field and a public method in class Sarg.

The vast majority of clients who use Sarg (or expressions that contain Sarg) will not reference RangeSet directly and therefore would not be impacted. So I think it’s an acceptable risk.

Julian

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Vladimir Sitnikov <si...@gmail.com>.

Julian>Suppose - worst case scenario - that Guava 30 removes RangeSet. The
solution would be for us to copy RangeSet and enough dependent classes into
Calcite to serve our needs

Then all the clients would have to adjust packages because we can't copy
Guava code and keep Guava's package names.

Julian>I knew the risks going in, and I’m not very concerned

I see Guava adjusted RangeSet#toString(). The change like that triggers
lots of plan representation changes :-(

Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

I knew the risks going in, and I’m not very concerned. Suppose - worst case scenario - that Guava 30 removes RangeSet. The solution would be for us to copy RangeSet and enough dependent classes into Calcite to serve our needs. Our support for Guava 30 would be delayed by a month or two.

Sarg is a powerful abstraction for planning queries, and RangeSet is a good library to implement it on. If it hadn’t been in Guava, we would have had to implement something like it ourselves. (In fact, I recall one Calcite PR that basically had its own implementation of RangeSet.)

Julian

> On Sep 29, 2020, at 7:44 AM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Apparently, RangeSet is a @Beta API in Guava, and they stress that @Beta
> APIs should not be used in libraries.
> 
> I suggest we do something with that, otherwise, it results in a case where
> Calcite enforces all the consumers to depend on @Beta API which
> might disappear or drift.
> 
> See https://github.com/google/guava/wiki/PhilosophyExplained#beta-apis
> 
> It is literally the very first "important warning":
> 
> https://github.com/google/guava#important-warnings
> 
> 1, APIs marked with the @Beta annotation at the class or method level are
> subject to change.
> They can be modified in any way, or even removed, at any time. If your
> code is a library itself
> (i.e., it is used on the CLASSPATH of users outside your own control),
> you should not use beta APIs unless you repackage them. If your code is a
> library,
> we strongly recommend using the Guava Beta Checker to ensure that you do
> not use any @Beta APIs!
> 
> Should we revert RangeSets?
> Should we ask Guava to promote it from @Beta?
> 
> Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Vladimir Sitnikov <si...@gmail.com>.

Apparently, RangeSet is a @Beta API in Guava, and they stress that @Beta
APIs should not be used in libraries.

I suggest we do something with that, otherwise, it results in a case where
Calcite enforces all the consumers to depend on @Beta API which
might disappear or drift.

See https://github.com/google/guava/wiki/PhilosophyExplained#beta-apis

It is literally the very first "important warning":

https://github.com/google/guava#important-warnings

1, APIs marked with the @Beta annotation at the class or method level are
subject to change.
 They can be modified in any way, or even removed, at any time. If your
code is a library itself
(i.e., it is used on the CLASSPATH of users outside your own control),
you should not use beta APIs unless you repackage them. If your code is a
library,
we strongly recommend using the Guava Beta Checker to ensure that you do
not use any @Beta APIs!

Should we revert RangeSets?
Should we ask Guava to promote it from @Beta?

Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Chunwei Lei <ch...@gmail.com>.

Hi, Julian.

I would like to review it.


Best,
Chunwei


On Sat, Aug 29, 2020 at 5:19 AM Julian Hyde <jh...@gmail.com> wrote:

> I now have a PR for this change. Can some people review?
>
> https://github.com/apache/calcite/pull/2124 <
> https://github.com/apache/calcite/pull/2124>
>
> There are legitimate concerns that we are adding a new RexCall operator
> (SEARCH) and there is a considerable effort to support it everywhere. But
> we are removing two RexCall operators (IN-list and BETWEEN), because SEARCH
> subsumes them.
>
> The benefit is that we can do ‘range algebra’ simplifications without
> resorting to error-prone Boolean logic simplifications as often as we do
> today.
>
> I have added utilities to introduce SEARCH into SEARCH-less RexNode
> expressions, and to remove SEARCH from RexNode expressions, and to convert
> RexNode expressions that contain SEARCH directly into SqlNode expressions.
>
> In several places, I just expanded SEARCH, but it would be better to
> handle SEARCH explicitly, as if it were a comparison operator (along with
> =, <, <= etc.) It might be worth revisiting the Mongo, Geode and Druid
> adapters. Geode has an ‘IS SET’ construct, very similar to SQL ‘IN-list’,
> that would easier to recognize on SEARCH than on OR.
>
> Julian
>
>
>
>
> > On Aug 13, 2020, at 12:48 PM, Julian Hyde <jh...@gmail.com>
> wrote:
> >
> > I have logged https://issues.apache.org/jira/browse/CALCITE-4173 <
> https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a
> prototype.
> >
> > See comments inline, below.
> >
> >> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <hyuan@apache.org <mailto:
> hyuan@apache.org>> wrote:
> >>
> >>> I am proposing to use sargs only where we would today use RexCall(IN).
> The data structures have about the same size. Sargs are sorted. I cannot
> see any cases that would become more expensive.
> >>
> >> IIRC, RexCall(IN) is not supported yet.
> >
> > It’s not officially supported. But it is there - added in
> https://issues.apache.org/jira/browse/CALCITE-2444 <
> https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql
> part of the process, and people seem to be using it elsewhere, e.g.
> https://issues.apache.org/jira/browse/CALCITE-1413 <
> https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE.
> >
> > I have not used RexCall.IN personally. I think there’s too much cost (in
> making all places aware of the IN operator) and not enough benefit (because
> it cannot handle ranges or <>).
> >
> > One part of the cost is null-semantics. If the values are not literals,
> one of them might evaluate to NULL. So we have to be very conservative,
> especially when optimizing “NOT IN (list)”. There are no tests for that
> simplification, so we’re probably doing it wrong.
> >
> > So, I propose to remove IN.
> >
> >> With sargs, you have to sort it, no matter explicitly or implicitly, in
> case 20k values, it will take some time.
> >
> > I am not too concerned about the cost of sorting. If you have parsed and
> validated an IN list with 20k items, the query has already burned a few CPU
> cycles.
> >
> > The Sarg data structure is immutable and one instance should suffice for
> the whole planning process, so the cost of sorting on creation is more than
> repaid by the faster access later on.
> >
> >> I am not saying the data structure itself is expensive, but how it is
> used, it all depends on how downstream projects derive stats from it. I
> have seen Greenplum optimizer experienced performance issue for large IN
> list when deriving all the stats info from the thousands of disjoint range
> intervals. Some downstream project may not even use histogram. Some may
> just use the number of IN constants to derive the number of distinct
> values, I would imagine people have to construct the Sarg back to constant
> list, this is an extra burden.
> >
> > If there are performance issues, we can add extra operations to the Sarg
> data structure without changing its size/cost. For example, record whether
> all the ranges are points. And if so, efficiently iterate over all the
> values.
> >
> >> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24,
> 24+2.....) 10k values, now the range sets will be {[1,2], [4,4],
> ....[10,10], [12,13], [15,15]....[21,22], [24,24]....}, in the execution
> engine, what if I want to do the look up for this expression? Find out the
> points, extract ranges and disjoint them, or infer points from integer
> ranges?
> >
> > You should transcribe the Sarg into a data structure that makes sense
> for your execution engine. Say a bloom filter, a compressed bitmap, or a
> perfect hash.
> >
> >> IMHO, the interval constraint is more like a kind of logical property
> and can be strategy for simplification, people can derive the interval
> constraints from the IN list, do simplification on it (but I doubt the
> value in practice, which usually brings more trouble than benefits), but we
> don't have to represent it with sarg in RexNode world.
> >
> > I agree that it should be a logical property. It should be part of
> RelOptPredicateList. But to apply those predicates efficiently, we need one
> predicate that searches in a range rather than 20k predicates.
> >
> >> I am not asking sargs to support geospatial operations. And geospatial
> intersect is just an example of potential customized operation, which are
> not limited to geospatials. It can be some other binary operations. We
> can't require all the customized operation to have something like
> ST_Collect.
> >
> > Sure, there are other techniques. One of which is generating rasters
> (e.g. if you are doing a spatial join of states to national parks,
> pre-generate for each national park a bitmap of the all 100x100km squares,
> and put a 1 in the square if it overlaps the park). Rasters are kind of
> similar to sargs.
> >
> > I have been working on space-filling curves as indexes for
> point-to-shape joins, as part of
> https://issues.apache.org/jira/browse/CALCITE-1861 <
> https://issues.apache.org/jira/browse/CALCITE-1861>. In
> https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966
> <
> https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966>
> see how we transform a call to ST_DWithin to the predicate
> >
> > LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)),
> AND(>=($4, 36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)),
> AND(>=($4, 33236), <=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)),
> AND(>=($4, 33112), <=($4, 33156)), AND(>=($4, 33092), <=($4, 33100)),
> AND(>=($4, 33055), <=($4, 33080)), AND(>=($4, 33050), <=($4, 33053)),
> AND(>=($4, 33033), <=($4, 33035))), ST_DWITHIN(POINT (10 20):GEOMETRY,
> ST_POINT($1, $2), 6))])
> >
> > The first part of the condition is precisely a Sarg - a set of integer
> ranges. If represented an a set of points, it would be much larger.
> >
> >> Oracle support this:
> >> select * from foo where a > ANY(1, 2, 3, ....., 999);
> >> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
> >> Although in reality that kind of query might be rare, how would Calcite
> represent in Oracle dialect? And would it be good to support operations
> other than equal, like <, >=, or even customized operation? Especially when
> downstream projects may use calcite to build their own in-house system,
> some are not sql standard.
> >
> > I think of ANY and ALL - especially on constant lists, as opposed to
> queries - as syntactic sugar.
> >
> > The first one can be represented as a Sarg:
> >
> >   SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf))
> >
> > And the Sarg would optimize itself, on construction, to a single range,
> SARG((1, inf)). This illustrates the power of Sargs - they are simple
> mathematical objects, they are in a canonical form, and there are only a
> few operations. It would take thought to optimize
> >
> >   a > ANY (1, 2, 3, …, 9)
> >
> > to
> >
> >   a > 1
> >
> > but Sarg does it automatically.
> >
> > The second query cannot be represented as a Sarg, because it is not a
> constant.
> >
> >> BTW, what is your concern about the proposal of RexListCmp [1]?
> >
> > RexListCmp is certainly moving in the same direction as Sarg - a
> generalized IN-list. But it is less elegant - it is not clear to me what
> are the fundamental operations on a RexListCmp.
> >
> > The ability to choose your operator adds complexity but little power.
> The “generalized IN lists” that I see in plans are better modeled by a
> mixture of operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than
> having one operator throughout the list.
> >
> > I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax
> and I don’t think it’s very common. It can be expanded at compile time to
> an OR, and that OR can be converted to a Sarg if list consists only of
> constant expressions. I think that is a good solution to the common case.
> >
> > Julian
> >
>
>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

I now have a PR for this change. Can some people review?

https://github.com/apache/calcite/pull/2124 <https://github.com/apache/calcite/pull/2124>

There are legitimate concerns that we are adding a new RexCall operator (SEARCH) and there is a considerable effort to support it everywhere. But we are removing two RexCall operators (IN-list and BETWEEN), because SEARCH subsumes them.

The benefit is that we can do ‘range algebra’ simplifications without resorting to error-prone Boolean logic simplifications as often as we do today.

I have added utilities to introduce SEARCH into SEARCH-less RexNode expressions, and to remove SEARCH from RexNode expressions, and to convert RexNode expressions that contain SEARCH directly into SqlNode expressions.

In several places, I just expanded SEARCH, but it would be better to handle SEARCH explicitly, as if it were a comparison operator (along with =, <, <= etc.) It might be worth revisiting the Mongo, Geode and Druid adapters. Geode has an ‘IS SET’ construct, very similar to SQL ‘IN-list’, that would easier to recognize on SEARCH than on OR.

Julian




> On Aug 13, 2020, at 12:48 PM, Julian Hyde <jh...@gmail.com> wrote:
> 
> I have logged https://issues.apache.org/jira/browse/CALCITE-4173 <https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a prototype.
> 
> See comments inline, below.
> 
>> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <hyuan@apache.org <ma...@apache.org>> wrote:
>> 
>>> I am proposing to use sargs only where we would today use RexCall(IN). The data structures have about the same size. Sargs are sorted. I cannot see any cases that would become more expensive.
>> 
>> IIRC, RexCall(IN) is not supported yet.
> 
> It’s not officially supported. But it is there - added in https://issues.apache.org/jira/browse/CALCITE-2444 <https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql part of the process, and people seem to be using it elsewhere, e.g. https://issues.apache.org/jira/browse/CALCITE-1413 <https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE.
> 
> I have not used RexCall.IN personally. I think there’s too much cost (in making all places aware of the IN operator) and not enough benefit (because it cannot handle ranges or <>).
> 
> One part of the cost is null-semantics. If the values are not literals, one of them might evaluate to NULL. So we have to be very conservative, especially when optimizing “NOT IN (list)”. There are no tests for that simplification, so we’re probably doing it wrong.
> 
> So, I propose to remove IN.
> 
>> With sargs, you have to sort it, no matter explicitly or implicitly, in case 20k values, it will take some time.
> 
> I am not too concerned about the cost of sorting. If you have parsed and validated an IN list with 20k items, the query has already burned a few CPU cycles.
> 
> The Sarg data structure is immutable and one instance should suffice for the whole planning process, so the cost of sorting on creation is more than repaid by the faster access later on.
> 
>> I am not saying the data structure itself is expensive, but how it is used, it all depends on how downstream projects derive stats from it. I have seen Greenplum optimizer experienced performance issue for large IN list when deriving all the stats info from the thousands of disjoint range intervals. Some downstream project may not even use histogram. Some may just use the number of IN constants to derive the number of distinct values, I would imagine people have to construct the Sarg back to constant list, this is an extra burden. 
> 
> If there are performance issues, we can add extra operations to the Sarg data structure without changing its size/cost. For example, record whether all the ranges are points. And if so, efficiently iterate over all the values.
> 
>> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.....) 10k values, now the range sets will be {[1,2], [4,4], ....[10,10], [12,13], [15,15]....[21,22], [24,24]....}, in the execution engine, what if I want to do the look up for this expression? Find out the points, extract ranges and disjoint them, or infer points from integer ranges?
> 
> You should transcribe the Sarg into a data structure that makes sense for your execution engine. Say a bloom filter, a compressed bitmap, or a perfect hash.
> 
>> IMHO, the interval constraint is more like a kind of logical property and can be strategy for simplification, people can derive the interval constraints from the IN list, do simplification on it (but I doubt the value in practice, which usually brings more trouble than benefits), but we don't have to represent it with sarg in RexNode world.
> 
> I agree that it should be a logical property. It should be part of RelOptPredicateList. But to apply those predicates efficiently, we need one predicate that searches in a range rather than 20k predicates.
> 
>> I am not asking sargs to support geospatial operations. And geospatial intersect is just an example of potential customized operation, which are not limited to geospatials. It can be some other binary operations. We can't require all the customized operation to have something like ST_Collect. 
> 
> Sure, there are other techniques. One of which is generating rasters (e.g. if you are doing a spatial join of states to national parks, pre-generate for each national park a bitmap of the all 100x100km squares, and put a 1 in the square if it overlaps the park). Rasters are kind of similar to sargs.
> 
> I have been working on space-filling curves as indexes for point-to-shape joins, as part of https://issues.apache.org/jira/browse/CALCITE-1861 <https://issues.apache.org/jira/browse/CALCITE-1861>. In https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966 <https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966> see how we transform a call to ST_DWithin to the predicate 
> 
> LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)), AND(>=($4, 36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)), AND(>=($4, 33236), <=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)), AND(>=($4, 33112), <=($4, 33156)), AND(>=($4, 33092), <=($4, 33100)), AND(>=($4, 33055), <=($4, 33080)), AND(>=($4, 33050), <=($4, 33053)), AND(>=($4, 33033), <=($4, 33035))), ST_DWITHIN(POINT (10 20):GEOMETRY, ST_POINT($1, $2), 6))])
> 
> The first part of the condition is precisely a Sarg - a set of integer ranges. If represented an a set of points, it would be much larger.
> 
>> Oracle support this:
>> select * from foo where a > ANY(1, 2, 3, ....., 999);
>> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
>> Although in reality that kind of query might be rare, how would Calcite represent in Oracle dialect? And would it be good to support operations other than equal, like <, >=, or even customized operation? Especially when downstream projects may use calcite to build their own in-house system, some are not sql standard.
> 
> I think of ANY and ALL - especially on constant lists, as opposed to queries - as syntactic sugar.
> 
> The first one can be represented as a Sarg:
> 
>   SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf))
> 
> And the Sarg would optimize itself, on construction, to a single range, SARG((1, inf)). This illustrates the power of Sargs - they are simple mathematical objects, they are in a canonical form, and there are only a few operations. It would take thought to optimize
> 
>   a > ANY (1, 2, 3, …, 9)
> 
> to
> 
>   a > 1
> 
> but Sarg does it automatically.
> 
> The second query cannot be represented as a Sarg, because it is not a constant.
> 
>> BTW, what is your concern about the proposal of RexListCmp [1]?
> 
> RexListCmp is certainly moving in the same direction as Sarg - a generalized IN-list. But it is less elegant - it is not clear to me what are the fundamental operations on a RexListCmp.
> 
> The ability to choose your operator adds complexity but little power. The “generalized IN lists” that I see in plans are better modeled by a mixture of operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than having one operator throughout the list.
> 
> I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax and I don’t think it’s very common. It can be expanded at compile time to an OR, and that OR can be converted to a Sarg if list consists only of constant expressions. I think that is a good solution to the common case.
> 
> Julian
>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

I have logged https://issues.apache.org/jira/browse/CALCITE-4173 <https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a prototype.

See comments inline, below.

> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <hy...@apache.org> wrote:
> 
>> I am proposing to use sargs only where we would today use RexCall(IN). The data structures have about the same size. Sargs are sorted. I cannot see any cases that would become more expensive.
> 
> IIRC, RexCall(IN) is not supported yet.

It’s not officially supported. But it is there - added in https://issues.apache.org/jira/browse/CALCITE-2444 <https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql part of the process, and people seem to be using it elsewhere, e.g. https://issues.apache.org/jira/browse/CALCITE-1413 <https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE.

I have not used RexCall.IN personally. I think there’s too much cost (in making all places aware of the IN operator) and not enough benefit (because it cannot handle ranges or <>).

One part of the cost is null-semantics. If the values are not literals, one of them might evaluate to NULL. So we have to be very conservative, especially when optimizing “NOT IN (list)”. There are no tests for that simplification, so we’re probably doing it wrong.

So, I propose to remove IN.

> With sargs, you have to sort it, no matter explicitly or implicitly, in case 20k values, it will take some time.

I am not too concerned about the cost of sorting. If you have parsed and validated an IN list with 20k items, the query has already burned a few CPU cycles.

The Sarg data structure is immutable and one instance should suffice for the whole planning process, so the cost of sorting on creation is more than repaid by the faster access later on.

> I am not saying the data structure itself is expensive, but how it is used, it all depends on how downstream projects derive stats from it. I have seen Greenplum optimizer experienced performance issue for large IN list when deriving all the stats info from the thousands of disjoint range intervals. Some downstream project may not even use histogram. Some may just use the number of IN constants to derive the number of distinct values, I would imagine people have to construct the Sarg back to constant list, this is an extra burden. 

If there are performance issues, we can add extra operations to the Sarg data structure without changing its size/cost. For example, record whether all the ranges are points. And if so, efficiently iterate over all the values.

> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.....) 10k values, now the range sets will be {[1,2], [4,4], ....[10,10], [12,13], [15,15]....[21,22], [24,24]....}, in the execution engine, what if I want to do the look up for this expression? Find out the points, extract ranges and disjoint them, or infer points from integer ranges?

You should transcribe the Sarg into a data structure that makes sense for your execution engine. Say a bloom filter, a compressed bitmap, or a perfect hash.

> IMHO, the interval constraint is more like a kind of logical property and can be strategy for simplification, people can derive the interval constraints from the IN list, do simplification on it (but I doubt the value in practice, which usually brings more trouble than benefits), but we don't have to represent it with sarg in RexNode world.

I agree that it should be a logical property. It should be part of RelOptPredicateList. But to apply those predicates efficiently, we need one predicate that searches in a range rather than 20k predicates.

> I am not asking sargs to support geospatial operations. And geospatial intersect is just an example of potential customized operation, which are not limited to geospatials. It can be some other binary operations. We can't require all the customized operation to have something like ST_Collect. 

Sure, there are other techniques. One of which is generating rasters (e.g. if you are doing a spatial join of states to national parks, pre-generate for each national park a bitmap of the all 100x100km squares, and put a 1 in the square if it overlaps the park). Rasters are kind of similar to sargs.

I have been working on space-filling curves as indexes for point-to-shape joins, as part of https://issues.apache.org/jira/browse/CALCITE-1861 <https://issues.apache.org/jira/browse/CALCITE-1861>. In https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966 <https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966> see how we transform a call to ST_DWithin to the predicate 

LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)), AND(>=($4, 36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)), AND(>=($4, 33236), <=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)), AND(>=($4, 33112), <=($4, 33156)), AND(>=($4, 33092), <=($4, 33100)), AND(>=($4, 33055), <=($4, 33080)), AND(>=($4, 33050), <=($4, 33053)), AND(>=($4, 33033), <=($4, 33035))), ST_DWITHIN(POINT (10 20):GEOMETRY, ST_POINT($1, $2), 6))])

The first part of the condition is precisely a Sarg - a set of integer ranges. If represented an a set of points, it would be much larger.

> Oracle support this:
> select * from foo where a > ANY(1, 2, 3, ....., 999);
> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
> Although in reality that kind of query might be rare, how would Calcite represent in Oracle dialect? And would it be good to support operations other than equal, like <, >=, or even customized operation? Especially when downstream projects may use calcite to build their own in-house system, some are not sql standard.

I think of ANY and ALL - especially on constant lists, as opposed to queries - as syntactic sugar.

The first one can be represented as a Sarg:

  SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf))

And the Sarg would optimize itself, on construction, to a single range, SARG((1, inf)). This illustrates the power of Sargs - they are simple mathematical objects, they are in a canonical form, and there are only a few operations. It would take thought to optimize

  a > ANY (1, 2, 3, …, 9)

to

  a > 1

but Sarg does it automatically.

The second query cannot be represented as a Sarg, because it is not a constant.

> BTW, what is your concern about the proposal of RexListCmp [1]?

RexListCmp is certainly moving in the same direction as Sarg - a generalized IN-list. But it is less elegant - it is not clear to me what are the fundamental operations on a RexListCmp.

The ability to choose your operator adds complexity but little power. The “generalized IN lists” that I see in plans are better modeled by a mixture of operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than having one operator throughout the list.

I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax and I don’t think it’s very common. It can be expanded at compile time to an OR, and that OR can be converted to a Sarg if list consists only of constant expressions. I think that is a good solution to the common case.

Julian

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Haisheng Yuan <hy...@apache.org>.

> I am proposing to use sargs only where we would today use RexCall(IN). The data structures have about the same size. Sargs are sorted. I cannot see any cases that would become more expensive.

IIRC, RexCall(IN) is not supported yet.

With sargs, you have to sort it, no matter explicitly or implicitly, in case 20k values, it will take some time. I am not saying the data structure itself is expensive, but how it is used, it all depends on how downstream projects derive stats from it. I have seen Greenplum optimizer experienced performance issue for large IN list when deriving all the stats info from the thousands of disjoint range intervals. Some downstream project may not even use histogram. Some may just use the number of IN constants to derive the number of distinct values, I would imagine people have to construct the Sarg back to constant list, this is an extra burden. 

Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.....) 10k values, now the range sets will be {[1,2], [4,4], ....[10,10], [12,13], [15,15]....[21,22], [24,24]....}, in the execution engine, what if I want to do the look up for this expression? Find out the points, extract ranges and disjoint them, or infer points from integer ranges?

IMHO, the interval constraint is more like a kind of logical property and can be strategy for simplification, people can derive the interval constraints from the IN list, do simplification on it (but I doubt the value in practice, which usually brings more trouble than benefits), but we don't have to represent it with sarg in RexNode world.

> Requiring sargs to support geospatial operations seems an unreasonably high bar. Many techniques that we use (sorting, hashing, sketches) do not support geospatial.
> Areas that have not-totally-ordered data types tend to have their own operators already. For your example I’d write ST_Intersects(col, ST_Collect(area1, area2, area3)), and that would be a perfectly suitable representation in RexNode-land.

I am not asking sargs to support geospatial operations. And geospatial intersect is just an example of potential customized operation, which are not limited to geospatials. It can be some other binary operations. We can't require all the customized operation to have something like ST_Collect. 

Oracle support this:
select * from foo where a > ANY(1, 2, 3, ....., 999);
select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
Although in reality that kind of query might be rare, how would Calcite represent in Oracle dialect? And would it be good to support operations other than equal, like <, >=, or even customized operation? Especially when downstream projects may use calcite to build their own in-house system, some are not sql standard. 

BTW, what is your concern about the proposal of RexListCmp [1]?

Haisheng

[1] https://lists.apache.org/thread.html/ra36f256e3f86f78ee126455b3ba67dbcb7bd1862214886c65ffba432%40%3Cdev.calcite.apache.org%3E

On 2020/08/11 16:44:56, Julian Hyde <jh...@gmail.com> wrote: 
> Let’s bring this discussion back to the original question: for these kinds of predicates, is Sarg a better data structure than IN?
> 
> It absolutely is. There are algorithms in RexSimplify that, for each term in an OR list, simplify using all available predicates, and when they have simplified that term, add it to the list of predicates. It is therefore cartesian in the size of the OR list.
> 
> The sarg data structure converts many kinds of large OR-lists and predicate sets into a single term. Because that term is immutable and based on sorted ranges (or points) it can be handled efficiently (e.g. two sargs can be intersected using a merge, rather than nested loops). That will make our simplification process more efficient.
> 
> Whether we then add new classes of optimization is a discussion for another day.
> 
> Julian
> 
> 
> > On Aug 10, 2020, at 1:13 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> > 
> > Julian>I cannot see any cases that would become more expensive.
> > 
> > I mean the optimization passes might be complicated, not the storage of
> > sargs themselves.
> > For instance, CALCITE-4155 converts a in (1, 2, 3, 4, 5) to a >= 1 and a <=
> > 5
> > Is it a significant improvement?
> > Is between representation always better?
> > Does the case appear very often in practice?
> > 
> > However, the optimization does take time, so it looks like extra logic with
> > doubtful gains.
> > Of course, it is unlikely sargs would be a major time consumer, however, if
> > we keep adding tiny simplifications we might
> > end up with a condition where the planning is slow and we have no way to
> > fix it.
> > 
> > I'm ok with people updating RexSimplify (and 4155 looks like an innocent
> > feature), however, I think the current design is not really scalable
> > (e.g. it might process the same expression multiple times).
> > 
> > Vladimir
> 
>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

Let’s bring this discussion back to the original question: for these kinds of predicates, is Sarg a better data structure than IN?

It absolutely is. There are algorithms in RexSimplify that, for each term in an OR list, simplify using all available predicates, and when they have simplified that term, add it to the list of predicates. It is therefore cartesian in the size of the OR list.

The sarg data structure converts many kinds of large OR-lists and predicate sets into a single term. Because that term is immutable and based on sorted ranges (or points) it can be handled efficiently (e.g. two sargs can be intersected using a merge, rather than nested loops). That will make our simplification process more efficient.

Whether we then add new classes of optimization is a discussion for another day.

Julian

> On Aug 10, 2020, at 1:13 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> Julian>I cannot see any cases that would become more expensive.
> 
> I mean the optimization passes might be complicated, not the storage of
> sargs themselves.
> For instance, CALCITE-4155 converts a in (1, 2, 3, 4, 5) to a >= 1 and a <=
> 5
> Is it a significant improvement?
> Is between representation always better?
> Does the case appear very often in practice?
> 
> However, the optimization does take time, so it looks like extra logic with
> doubtful gains.
> Of course, it is unlikely sargs would be a major time consumer, however, if
> we keep adding tiny simplifications we might
> end up with a condition where the planning is slow and we have no way to
> fix it.
> 
> I'm ok with people updating RexSimplify (and 4155 looks like an innocent
> feature), however, I think the current design is not really scalable
> (e.g. it might process the same expression multiple times).
> 
> Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Vladimir Sitnikov <si...@gmail.com>.

Julian>I cannot see any cases that would become more expensive.

I mean the optimization passes might be complicated, not the storage of
sargs themselves.
For instance, CALCITE-4155 converts a in (1, 2, 3, 4, 5) to a >= 1 and a <=
5
Is it a significant improvement?
Is between representation always better?
Does the case appear very often in practice?

However, the optimization does take time, so it looks like extra logic with
doubtful gains.
Of course, it is unlikely sargs would be a major time consumer, however, if
we keep adding tiny simplifications we might
end up with a condition where the planning is slow and we have no way to
fix it.

I'm ok with people updating RexSimplify (and 4155 looks like an innocent
feature), however, I think the current design is not really scalable
(e.g. it might process the same expression multiple times).

Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

> Vladimir wrote:
> 
> 5. The optimization of disjoint conditions might take significant
> optimization time with insignificant gains.

Can you give an example?

I am proposing to use sargs only where we would today use RexCall(IN). The data structures have about the same size. Sargs are sorted. I cannot see any cases that would become more expensive.

> Sargs might be helpful as databases often need to implement "integer/string
> index range scan", however, sargs do not seem to suit well for `col
> intersect any(...)` and similar cases.

Requiring sargs to support geospatial operations seems an unreasonably high bar. Many techniques that we use (sorting, hashing, sketches) do not support geospatial.

> I remember Calcite sources did include sargs package, however, it was
> removed for some reason.

That is correct. We once had b-tree indexes and column store; when we removed these, sargs were no longer used. Guava RangeSet is superior to the sargs classes we threw away.

Julian

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Vladimir Sitnikov <si...@gmail.com>.

+1 to Haisheng.

5. The optimization of disjoint conditions might take significant
optimization time with insignificant gains.

Sargs might be helpful as databases often need to implement "integer/string
index range scan", however, sargs do not seem to suit well for `col
intersect any(...)` and similar cases.

I remember Calcite sources did include sargs package, however, it was
removed for some reason.

Vladimir

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Haisheng Yuan <hy...@apache.org>.

I am against the proposal, for the following reasons:

1. It has to support range operations, not every type supports, especially for User Defined Types that are used in IN constant lists, they might only support equal operation.

2. The range optimization of ">2 AND <4" to "3” is only valid for integer-like types, but I don't think it will bring much gains. Optimizing  col in (1,2,3,4,5) to a >=1 and a <=5 is ok, but only ok for a single or very limited number of disjoint ranges, but this is a very limited corner case, most likely we may end up with many disjoint ranges, especially when there are 10k values. I don't think it is worth doing SARG for IN for this sake.

3. For non-integer data types, like double or string, we will end up with ranges {[a,a], [b,b], [c,c]...}, the stats derivation e.g. inferring selectivity from histogram may take much longer time depends on the implementation. And when passing to executor, we have to transform it back to form of col = ANY(a,b,c) or col IN (a,b,c) to be able to utilize hash look up.

4. It is not extensible. It can only be used for iN or NOT IN. What about customized operators like geospatial intersect? e.g. col intersect ANY(area1, area2, area3)

Haisheng

On 2020/08/10 16:15:56, Michael Mior <mm...@apache.org> wrote: 
> The simplifications you present would also of course assume that the
> column has integer type. In any case, overall the proposal seems solid
> to me.
> 
> --
> Michael Mior
> mmior@apache.org
> 
> Le lun. 10 août 2020 à 04:21, Fan Liya <li...@gmail.com> a écrit :
> >
> > Hi Julian,
> >
> > Thanks for opening the discussion.
> >
> > In general, I am convinced of the value and importance of the proposal.
> >
> > I want to discuss more about the algebra involved, as the simplified range
> > sets may not have the minimum computational costs.
> > Some examples:
> >
> > 1. Suppose we have expression x in (1, 2, 4, 5, 6).
> >
> > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> > 4 and x <= 6), which represents 4 comparisons and 3 logical operations.
> > Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
> > which represents 5 comparisons and 4 logical operations.
> > It's hard to say which one has a smaller computational cost.
> >
> > 2. Suppose we have expression x in (1, 2, 4, 5)
> > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> > 4 and x <= 5).
> > Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
> > I am not sure if the range set implementation supports this type of
> > simplification.
> >
> > Therefore, to address above problems,
> > 1. We may need a model to estimate the cost, and the simplification process
> > should be guided by the cost model.
> > 2. Maybe we need to extend the range set implementation so it produces
> > expressions with minimum computational cost.
> >
> > Best,
> > Liya Fan
> >
> >
> >
> >
> >
> > On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde <jh...@gmail.com> wrote:
> >
> > > We have had several discussions over the years about how to represent
> > > IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
> > >
> > > I have generally taken the position that we should expand to ORs (e.g. “x
> > > = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
> > > IN in RexCall.
> > >
> > > I have given this some further thought, as part of a couple of RexSimplify
> > > bugs [1] [2] and I now think we should replace IN with something more
> > > powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
> > > of intervals, and can be represented as a Guava ImmutableRangeSet, such as
> > > "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
> > > but also ranges.
> > >
> > > Today you would write
> > >
> > >   RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
> > >
> > > With sargs, you would instead write
> > >
> > >   RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
> > > [3..3], [5..5]]”)))
> > >
> > > There is a new operator IN_SARG, and a new node type RexSarg (it could
> > > almost be a RexLiteral).
> > >
> > > Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
> > > we invest in sarg support in RexSimplify and RelOptPredicateList, that
> > > investment is likely to pay dividends in better plans.
> > >
> > > Guava RangeSets, and hence sargs, have support for discrete domains, so
> > > they can easily optimize ">2 AND <4" to "3”.
> > >
> > > Sargs would be the preferred (therefore canonical) form for any AND or OR
> > > list that has more than one comparison on the same operand (e.g. "x > 3 AND
> > > x < 17 AND x <> 10”).
> > >
> > > This proposal would subsume IN and therefore we would stop supporting IN
> > > in RexCall.
> > >
> > > Julian
> > >
> > > [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
> > > https://issues.apache.org/jira/browse/CALCITE-4155>
> > >
> > > [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
> > > https://github.com/julianhyde/calcite/tree/4159-simplify>
> > >
> > > [3] https://en.wikipedia.org/wiki/Sargable <
> > > https://en.wikipedia.org/wiki/Sargable>
>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Julian Hyde <jh...@gmail.com>.

Integer is of course the main example but when I say I like “algebraic” approach of Guava range sets I mean that it operates based on the properties of data types, not any hard-coded list.

Sargs (and Guava range sets) merely require the values to be comparable (i.e. totally ordered). So they would apply to string and float types, for instance. Or lexically ordered tuples, (‘a’, 1), (‘b’, 2).

Null does not fit neatly into a total order, so we’d have to find a way to finesse expressions like ‘x between 3 and 5 or x is null’.

Sargs (and Guava range sets) are able to do additional optimizations if the domain is discrete. The optimization example I gave,  ">2 AND <4" to “3”, uses that discrete property. Integers are the main example of discrete domains, but “INTERVAL DAY” and “BOOLEAN” are other examples.

Bounded-length strings (e.g. VARCHAR(10)) are also a discrete domain, but the Guava documentation recommends against it. Maybe there are just too many values. I do think it would be worth treating constrained types, e.g.

  paymentType VARCHAR(6) CHECK (paymentType IN (‘CASH’,’CREDIT’))

as discrete domains.

Julian


> On Aug 10, 2020, at 9:15 AM, Michael Mior <mm...@apache.org> wrote:
> 
> The simplifications you present would also of course assume that the
> column has integer type. In any case, overall the proposal seems solid
> to me.
> 
> --
> Michael Mior
> mmior@apache.org
> 
> Le lun. 10 août 2020 à 04:21, Fan Liya <li...@gmail.com> a écrit :
>> 
>> Hi Julian,
>> 
>> Thanks for opening the discussion.
>> 
>> In general, I am convinced of the value and importance of the proposal.
>> 
>> I want to discuss more about the algebra involved, as the simplified range
>> sets may not have the minimum computational costs.
>> Some examples:
>> 
>> 1. Suppose we have expression x in (1, 2, 4, 5, 6).
>> 
>> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
>> 4 and x <= 6), which represents 4 comparisons and 3 logical operations.
>> Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
>> which represents 5 comparisons and 4 logical operations.
>> It's hard to say which one has a smaller computational cost.
>> 
>> 2. Suppose we have expression x in (1, 2, 4, 5)
>> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
>> 4 and x <= 5).
>> Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
>> I am not sure if the range set implementation supports this type of
>> simplification.
>> 
>> Therefore, to address above problems,
>> 1. We may need a model to estimate the cost, and the simplification process
>> should be guided by the cost model.
>> 2. Maybe we need to extend the range set implementation so it produces
>> expressions with minimum computational cost.
>> 
>> Best,
>> Liya Fan
>> 
>> 
>> 
>> 
>> 
>> On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde <jh...@gmail.com> wrote:
>> 
>>> We have had several discussions over the years about how to represent
>>> IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
>>> 
>>> I have generally taken the position that we should expand to ORs (e.g. “x
>>> = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
>>> IN in RexCall.
>>> 
>>> I have given this some further thought, as part of a couple of RexSimplify
>>> bugs [1] [2] and I now think we should replace IN with something more
>>> powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
>>> of intervals, and can be represented as a Guava ImmutableRangeSet, such as
>>> "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
>>> but also ranges.
>>> 
>>> Today you would write
>>> 
>>>  RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
>>> 
>>> With sargs, you would instead write
>>> 
>>>  RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
>>> [3..3], [5..5]]”)))
>>> 
>>> There is a new operator IN_SARG, and a new node type RexSarg (it could
>>> almost be a RexLiteral).
>>> 
>>> Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
>>> we invest in sarg support in RexSimplify and RelOptPredicateList, that
>>> investment is likely to pay dividends in better plans.
>>> 
>>> Guava RangeSets, and hence sargs, have support for discrete domains, so
>>> they can easily optimize ">2 AND <4" to "3”.
>>> 
>>> Sargs would be the preferred (therefore canonical) form for any AND or OR
>>> list that has more than one comparison on the same operand (e.g. "x > 3 AND
>>> x < 17 AND x <> 10”).
>>> 
>>> This proposal would subsume IN and therefore we would stop supporting IN
>>> in RexCall.
>>> 
>>> Julian
>>> 
>>> [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
>>> https://issues.apache.org/jira/browse/CALCITE-4155>
>>> 
>>> [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
>>> https://github.com/julianhyde/calcite/tree/4159-simplify>
>>> 
>>> [3] https://en.wikipedia.org/wiki/Sargable <
>>> https://en.wikipedia.org/wiki/Sargable>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Michael Mior <mm...@apache.org>.

The simplifications you present would also of course assume that the
column has integer type. In any case, overall the proposal seems solid
to me.

--
Michael Mior
mmior@apache.org

Le lun. 10 août 2020 à 04:21, Fan Liya <li...@gmail.com> a écrit :
>
> Hi Julian,
>
> Thanks for opening the discussion.
>
> In general, I am convinced of the value and importance of the proposal.
>
> I want to discuss more about the algebra involved, as the simplified range
> sets may not have the minimum computational costs.
> Some examples:
>
> 1. Suppose we have expression x in (1, 2, 4, 5, 6).
>
> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> 4 and x <= 6), which represents 4 comparisons and 3 logical operations.
> Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
> which represents 5 comparisons and 4 logical operations.
> It's hard to say which one has a smaller computational cost.
>
> 2. Suppose we have expression x in (1, 2, 4, 5)
> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> 4 and x <= 5).
> Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
> I am not sure if the range set implementation supports this type of
> simplification.
>
> Therefore, to address above problems,
> 1. We may need a model to estimate the cost, and the simplification process
> should be guided by the cost model.
> 2. Maybe we need to extend the range set implementation so it produces
> expressions with minimum computational cost.
>
> Best,
> Liya Fan
>
>
>
>
>
> On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde <jh...@gmail.com> wrote:
>
> > We have had several discussions over the years about how to represent
> > IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
> >
> > I have generally taken the position that we should expand to ORs (e.g. “x
> > = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
> > IN in RexCall.
> >
> > I have given this some further thought, as part of a couple of RexSimplify
> > bugs [1] [2] and I now think we should replace IN with something more
> > powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
> > of intervals, and can be represented as a Guava ImmutableRangeSet, such as
> > "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
> > but also ranges.
> >
> > Today you would write
> >
> >   RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
> >
> > With sargs, you would instead write
> >
> >   RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
> > [3..3], [5..5]]”)))
> >
> > There is a new operator IN_SARG, and a new node type RexSarg (it could
> > almost be a RexLiteral).
> >
> > Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
> > we invest in sarg support in RexSimplify and RelOptPredicateList, that
> > investment is likely to pay dividends in better plans.
> >
> > Guava RangeSets, and hence sargs, have support for discrete domains, so
> > they can easily optimize ">2 AND <4" to "3”.
> >
> > Sargs would be the preferred (therefore canonical) form for any AND or OR
> > list that has more than one comparison on the same operand (e.g. "x > 3 AND
> > x < 17 AND x <> 10”).
> >
> > This proposal would subsume IN and therefore we would stop supporting IN
> > in RexCall.
> >
> > Julian
> >
> > [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
> > https://issues.apache.org/jira/browse/CALCITE-4155>
> >
> > [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
> > https://github.com/julianhyde/calcite/tree/4159-simplify>
> >
> > [3] https://en.wikipedia.org/wiki/Sargable <
> > https://en.wikipedia.org/wiki/Sargable>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Posted by Fan Liya <li...@gmail.com>.

Hi Julian,

Thanks for opening the discussion.

In general, I am convinced of the value and importance of the proposal.

I want to discuss more about the algebra involved, as the simplified range
sets may not have the minimum computational costs.
Some examples:

1. Suppose we have expression x in (1, 2, 4, 5, 6).

The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
4 and x <= 6), which represents 4 comparisons and 3 logical operations.
Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
which represents 5 comparisons and 4 logical operations.
It's hard to say which one has a smaller computational cost.

2. Suppose we have expression x in (1, 2, 4, 5)
The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
4 and x <= 5).
Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
I am not sure if the range set implementation supports this type of
simplification.

Therefore, to address above problems,
1. We may need a model to estimate the cost, and the simplification process
should be guided by the cost model.
2. Maybe we need to extend the range set implementation so it produces
expressions with minimum computational cost.

Best,
Liya Fan





On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde <jh...@gmail.com> wrote:

> We have had several discussions over the years about how to represent
> IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
>
> I have generally taken the position that we should expand to ORs (e.g. “x
> = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
> IN in RexCall.
>
> I have given this some further thought, as part of a couple of RexSimplify
> bugs [1] [2] and I now think we should replace IN with something more
> powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
> of intervals, and can be represented as a Guava ImmutableRangeSet, such as
> "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
> but also ranges.
>
> Today you would write
>
>   RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
>
> With sargs, you would instead write
>
>   RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
> [3..3], [5..5]]”)))
>
> There is a new operator IN_SARG, and a new node type RexSarg (it could
> almost be a RexLiteral).
>
> Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
> we invest in sarg support in RexSimplify and RelOptPredicateList, that
> investment is likely to pay dividends in better plans.
>
> Guava RangeSets, and hence sargs, have support for discrete domains, so
> they can easily optimize ">2 AND <4" to "3”.
>
> Sargs would be the preferred (therefore canonical) form for any AND or OR
> list that has more than one comparison on the same operand (e.g. "x > 3 AND
> x < 17 AND x <> 10”).
>
> This proposal would subsume IN and therefore we would stop supporting IN
> in RexCall.
>
> Julian
>
> [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
> https://issues.apache.org/jira/browse/CALCITE-4155>
>
> [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
> https://github.com/julianhyde/calcite/tree/4159-simplify>
>
> [3] https://en.wikipedia.org/wiki/Sargable <
> https://en.wikipedia.org/wiki/Sargable>