You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2014/11/09 16:03:58 UTC

Calcite fullscan vs indexscan

Hi,

I am having troubles implementing indexed accesses via Calcite.
Can you please guide me?

Here's the problem statement:
1) I have "table full scans" working.
2) I want Calcite to transform joins into nested-loops with "lookup by
id" inner loop.

Here's sample query: https://github.com/vlsi/mat-calcite-plugin#join-sample
explain plan for
 select u."@ID", s."@RETAINED"
   from "java.lang.String" s
   join "java.net.URL" u
     on (s."@ID" = get_id(u.path))

The "@ID" column is a primary key, so I want Calcite to generate the
following plan: Filter(NestedLoops(Scan("java.net.URL" u),
FetchObjectBy(get_id(u.path))), get_class(s)=="java.lang.String")

Current plan is just a join of two "full scans" :(

My "storage engine" is a java library (Eclipse Memory Analyzer in
fact), thus the perfect generated code would be as follows:
for(IObject url: snapshot.getObjectsByClass("java.net.URL")){
  IObject path = (IObject) url.resolveValue("path");
  pipe row(url.getObjectId(), path.getRetainedHeapSize()); // return results
}

Here's what I did:
1) I found NestedLoopsJoinRule that seems to generate the required
kind of plan. I have no idea why the rule is disabled by default.
2) However, I find no "EnumerableCorrelatorRel", thus it looks like I
would get that "cannotplan" exception even if I create my
CorellatorRel("@ID"=get_id) rule.

3) Another my idea is to match JoinRel(MyRel, MyRel) and replace the
second argument with a TableFunction, so the final plan would be
Join(Scan("java.net.URL" u), TableFunction("getObject", get_id(u.path))
Using table function machinery for retrieving a single row looks like
an overkill.

This ends up in the following questions:
1) What is the suggested way to implement this kind of optimizations?
2) Why there is no such thing as EnumerableCorrelatorRel?

-- 
Regards,
Vladimir Sitnikov

Re: Calcite fullscan vs indexscan

Posted by Vladimir Sitnikov <si...@gmail.com>.

> sargs
That is an interesting part of a config.
Am I right SargFactory is to "figure out which expressions can be
pushed down in terms of simple conditions"?

However I guess some notion of "supported set of features" is required.
For instance, Eclipse MAT does not support range scans. Just "unique scans".
Current "boolean simpleMode" is too restrictive: it does not allow to
enable ORs but disable range predicates.

>That would be a useful exercise. I would be curious what you find works best.

It looks like I have two options to represent my "index" table in model:
1) Table function
2) ProjectableFilterableTable

As I tried the first approach I found this area is not yet explored:
1.1) Left correlation of table function does not work:
   select * from emp e, table(calc_dept(e.deptno) t
https://issues.apache.org/jira/browse/CALCITE-463

1.2) If I rewrite it as lateral expression, then it fails in runtime:
cannot translate expression $cor0.c at
net.hydromatic.optiq.rules.java.RexToLixTranslator.translate0
https://issues.apache.org/jira/browse/CALCITE-462

Will try to go with EnumeratorCorrelatorRel first and look into the
left-corellation/lateral issues later (hopefully after renaming is
done).

Vladimir

Re: Calcite fullscan vs indexscan

Posted by Julian Hyde <ju...@hydromatic.net>.

On Nov 11, 2014, at 12:18 PM, Vladimir Sitnikov <si...@gmail.com> wrote:

>> You could add Programs.VLADIMIR_RULES. :)
> The problem is what to put there.
> 
> Am I right those rules do not have advantages over the rules
> registered in RelNode.register(RelOptPlanner planner) callback?

No one has curated those rule sets very carefully. You are welcome to re-organize them - you just need to make sure that the tests pass. Or make your own rule set, with just the rules you need, and someone else will re-organize in future.

>> I had thought of it — not just for indexes, but also for lateral join into nested collections
> 
> I also thought of lateral join.
> Will see what will be easier for me: EnumeratorCorrelatorRel or table functions.

That would be a useful exercise. I would be curious what you find works best.

Julian

Re: Calcite fullscan vs indexscan

Posted by Vladimir Sitnikov <si...@gmail.com>.

>You could add Programs.VLADIMIR_RULES. :)
The problem is what to put there.

Am I right those rules do not have advantages over the rules
registered in RelNode.register(RelOptPlanner planner) callback?

> I had thought of it — not just for indexes, but also for lateral join into nested collections

I also thought of lateral join.
Will see what will be easier for me: EnumeratorCorrelatorRel or table functions.

Vladimir

Re: Calcite fullscan vs indexscan

Posted by Julian Hyde <ju...@hydromatic.net>.

On Nov 11, 2014, at 10:51 AM, Vladimir Sitnikov <si...@gmail.com> wrote:

>> How about creating your set of core rules as a data structure in Programs?
> 
> Not sure I get this. What do you mean by Program?

The class org.apache.calcite.tools.Programs contains utilities for building instances of org.apache.calcite.tools.Program, and it has some pre-built programs and rule-sets, such as Programs.CALC_RULES. You could add Programs.VLADIMIR_RULES. :)

> I do not mind having some rules that I enable on-demand. I just do not
> see the road ahead.
> 
>> Also, there isn’t a version of EnumerableScan that takes correlating variables (in sargs) and without that, there’s no point in doing a re-start — you’d get the same results every time
> 
> Is that intentional or just because no one asked that before?

No one asked before. I had thought of it — not just for indexes, but also for lateral join into nested collections (e.g. JdbcTest.Department.employees), tables based on parameterized web-service calls, and now potentially tables based ProjectableFilterableTable — but it never rose to the top of my list. I’m delighted that you are going there...

Julian

Re: Calcite fullscan vs indexscan

Posted by Vladimir Sitnikov <si...@gmail.com>.

>How about creating your set of core rules as a data structure in Programs?

Not sure I get this. What do you mean by Program?

I do not mind having some rules that I enable on-demand. I just do not
see the road ahead.

> Also, there isn’t a version of EnumerableScan that takes correlating variables (in sargs) and without that, there’s no point in doing a re-start — you’d get the same results every time

Is that intentional or just because no one asked that before?

Vladimir

Re: Calcite fullscan vs indexscan

Posted by Julian Hyde <ju...@gmail.com>.

I think you’re going about it the right way. Index lookups are traditionally optimized by treating them as a join between two tables. You transform a lateral join (which is correlated) into a CorrelatorRel.

There isn’t an EnumeratorCorrelatorRel, and we didn’t include NestedLoopsJoinRule, because correlations require re-starts, and re-starts are no good for analytic queries, especially distributed ones. But what you suggest is totally valid. Also, there isn’t a version of EnumerableScan that takes correlating variables (in sargs) and without that, there’s no point in doing a re-start — you’d get the same results every time.

What you are doing with Eclipse MAT is a valid and very cool use of Calcite, but it needs different rules. How about creating your set of core rules as a data structure in Programs? Then we can write some tests for them and ensure that they continue to work together.

Julian

On Nov 9, 2014, at 7:03 AM, Vladimir Sitnikov <si...@gmail.com> wrote:

> Hi,
> 
> I am having troubles implementing indexed accesses via Calcite.
> Can you please guide me?
> 
> Here's the problem statement:
> 1) I have "table full scans" working.
> 2) I want Calcite to transform joins into nested-loops with "lookup by
> id" inner loop.
> 
> Here's sample query: https://github.com/vlsi/mat-calcite-plugin#join-sample
> explain plan for
> select u."@ID", s."@RETAINED"
>   from "java.lang.String" s
>   join "java.net.URL" u
>     on (s."@ID" = get_id(u.path))
> 
> The "@ID" column is a primary key, so I want Calcite to generate the
> following plan: Filter(NestedLoops(Scan("java.net.URL" u),
> FetchObjectBy(get_id(u.path))), get_class(s)=="java.lang.String")
> 
> Current plan is just a join of two "full scans" :(
> 
> My "storage engine" is a java library (Eclipse Memory Analyzer in
> fact), thus the perfect generated code would be as follows:
> for(IObject url: snapshot.getObjectsByClass("java.net.URL")){
>  IObject path = (IObject) url.resolveValue("path");
>  pipe row(url.getObjectId(), path.getRetainedHeapSize()); // return results
> }
> 
> Here's what I did:
> 1) I found NestedLoopsJoinRule that seems to generate the required
> kind of plan. I have no idea why the rule is disabled by default.
> 2) However, I find no "EnumerableCorrelatorRel", thus it looks like I
> would get that "cannotplan" exception even if I create my
> CorellatorRel("@ID"=get_id) rule.
> 
> 3) Another my idea is to match JoinRel(MyRel, MyRel) and replace the
> second argument with a TableFunction, so the final plan would be
> Join(Scan("java.net.URL" u), TableFunction("getObject", get_id(u.path))
> Using table function machinery for retrieving a single row looks like
> an overkill.
> 
> This ends up in the following questions:
> 1) What is the suggested way to implement this kind of optimizations?
> 2) Why there is no such thing as EnumerableCorrelatorRel?
> 
> -- 
> Regards,
> Vladimir Sitnikov