You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2014/11/09 16:03:58 UTC
Calcite fullscan vs indexscan
Hi,
I am having troubles implementing indexed accesses via Calcite.
Can you please guide me?
Here's the problem statement:
1) I have "table full scans" working.
2) I want Calcite to transform joins into nested-loops with "lookup by
id" inner loop.
Here's sample query: https://github.com/vlsi/mat-calcite-plugin#join-sample
explain plan for
select u."@ID", s."@RETAINED"
from "java.lang.String" s
join "java.net.URL" u
on (s."@ID" = get_id(u.path))
The "@ID" column is a primary key, so I want Calcite to generate the
following plan: Filter(NestedLoops(Scan("java.net.URL" u),
FetchObjectBy(get_id(u.path))), get_class(s)=="java.lang.String")
Current plan is just a join of two "full scans" :(
My "storage engine" is a java library (Eclipse Memory Analyzer in
fact), thus the perfect generated code would be as follows:
for(IObject url: snapshot.getObjectsByClass("java.net.URL")){
IObject path = (IObject) url.resolveValue("path");
pipe row(url.getObjectId(), path.getRetainedHeapSize()); // return results
}
Here's what I did:
1) I found NestedLoopsJoinRule that seems to generate the required
kind of plan. I have no idea why the rule is disabled by default.
2) However, I find no "EnumerableCorrelatorRel", thus it looks like I
would get that "cannotplan" exception even if I create my
CorellatorRel("@ID"=get_id) rule.
3) Another my idea is to match JoinRel(MyRel, MyRel) and replace the
second argument with a TableFunction, so the final plan would be
Join(Scan("java.net.URL" u), TableFunction("getObject", get_id(u.path))
Using table function machinery for retrieving a single row looks like
an overkill.
This ends up in the following questions:
1) What is the suggested way to implement this kind of optimizations?
2) Why there is no such thing as EnumerableCorrelatorRel?
--
Regards,
Vladimir Sitnikov
Re: Calcite fullscan vs indexscan
Posted by Vladimir Sitnikov <si...@gmail.com>.
> sargs
That is an interesting part of a config.
Am I right SargFactory is to "figure out which expressions can be
pushed down in terms of simple conditions"?
However I guess some notion of "supported set of features" is required.
For instance, Eclipse MAT does not support range scans. Just "unique scans".
Current "boolean simpleMode" is too restrictive: it does not allow to
enable ORs but disable range predicates.
>That would be a useful exercise. I would be curious what you find works best.
It looks like I have two options to represent my "index" table in model:
1) Table function
2) ProjectableFilterableTable
As I tried the first approach I found this area is not yet explored:
1.1) Left correlation of table function does not work:
select * from emp e, table(calc_dept(e.deptno) t
https://issues.apache.org/jira/browse/CALCITE-463
1.2) If I rewrite it as lateral expression, then it fails in runtime:
cannot translate expression $cor0.c at
net.hydromatic.optiq.rules.java.RexToLixTranslator.translate0
https://issues.apache.org/jira/browse/CALCITE-462
Will try to go with EnumeratorCorrelatorRel first and look into the
left-corellation/lateral issues later (hopefully after renaming is
done).
Vladimir
Re: Calcite fullscan vs indexscan
Posted by Julian Hyde <ju...@hydromatic.net>.
On Nov 11, 2014, at 12:18 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
>> You could add Programs.VLADIMIR_RULES. :)
> The problem is what to put there.
>
> Am I right those rules do not have advantages over the rules
> registered in RelNode.register(RelOptPlanner planner) callback?
No one has curated those rule sets very carefully. You are welcome to re-organize them - you just need to make sure that the tests pass. Or make your own rule set, with just the rules you need, and someone else will re-organize in future.
>> I had thought of it — not just for indexes, but also for lateral join into nested collections
>
> I also thought of lateral join.
> Will see what will be easier for me: EnumeratorCorrelatorRel or table functions.
That would be a useful exercise. I would be curious what you find works best.
Julian
Re: Calcite fullscan vs indexscan
Posted by Vladimir Sitnikov <si...@gmail.com>.
>You could add Programs.VLADIMIR_RULES. :)
The problem is what to put there.
Am I right those rules do not have advantages over the rules
registered in RelNode.register(RelOptPlanner planner) callback?
> I had thought of it — not just for indexes, but also for lateral join into nested collections
I also thought of lateral join.
Will see what will be easier for me: EnumeratorCorrelatorRel or table functions.
Vladimir
Re: Calcite fullscan vs indexscan
Posted by Julian Hyde <ju...@hydromatic.net>.
On Nov 11, 2014, at 10:51 AM, Vladimir Sitnikov <si...@gmail.com> wrote:
>> How about creating your set of core rules as a data structure in Programs?
>
> Not sure I get this. What do you mean by Program?
The class org.apache.calcite.tools.Programs contains utilities for building instances of org.apache.calcite.tools.Program, and it has some pre-built programs and rule-sets, such as Programs.CALC_RULES. You could add Programs.VLADIMIR_RULES. :)
> I do not mind having some rules that I enable on-demand. I just do not
> see the road ahead.
>
>> Also, there isn’t a version of EnumerableScan that takes correlating variables (in sargs) and without that, there’s no point in doing a re-start — you’d get the same results every time
>
> Is that intentional or just because no one asked that before?
No one asked before. I had thought of it — not just for indexes, but also for lateral join into nested collections (e.g. JdbcTest.Department.employees), tables based on parameterized web-service calls, and now potentially tables based ProjectableFilterableTable — but it never rose to the top of my list. I’m delighted that you are going there...
Julian
Re: Calcite fullscan vs indexscan
Posted by Vladimir Sitnikov <si...@gmail.com>.
>How about creating your set of core rules as a data structure in Programs?
Not sure I get this. What do you mean by Program?
I do not mind having some rules that I enable on-demand. I just do not
see the road ahead.
> Also, there isn’t a version of EnumerableScan that takes correlating variables (in sargs) and without that, there’s no point in doing a re-start — you’d get the same results every time
Is that intentional or just because no one asked that before?
Vladimir
Re: Calcite fullscan vs indexscan
Posted by Julian Hyde <ju...@gmail.com>.
I think you’re going about it the right way. Index lookups are traditionally optimized by treating them as a join between two tables. You transform a lateral join (which is correlated) into a CorrelatorRel.
There isn’t an EnumeratorCorrelatorRel, and we didn’t include NestedLoopsJoinRule, because correlations require re-starts, and re-starts are no good for analytic queries, especially distributed ones. But what you suggest is totally valid. Also, there isn’t a version of EnumerableScan that takes correlating variables (in sargs) and without that, there’s no point in doing a re-start — you’d get the same results every time.
What you are doing with Eclipse MAT is a valid and very cool use of Calcite, but it needs different rules. How about creating your set of core rules as a data structure in Programs? Then we can write some tests for them and ensure that they continue to work together.
Julian
On Nov 9, 2014, at 7:03 AM, Vladimir Sitnikov <si...@gmail.com> wrote:
> Hi,
>
> I am having troubles implementing indexed accesses via Calcite.
> Can you please guide me?
>
> Here's the problem statement:
> 1) I have "table full scans" working.
> 2) I want Calcite to transform joins into nested-loops with "lookup by
> id" inner loop.
>
> Here's sample query: https://github.com/vlsi/mat-calcite-plugin#join-sample
> explain plan for
> select u."@ID", s."@RETAINED"
> from "java.lang.String" s
> join "java.net.URL" u
> on (s."@ID" = get_id(u.path))
>
> The "@ID" column is a primary key, so I want Calcite to generate the
> following plan: Filter(NestedLoops(Scan("java.net.URL" u),
> FetchObjectBy(get_id(u.path))), get_class(s)=="java.lang.String")
>
> Current plan is just a join of two "full scans" :(
>
> My "storage engine" is a java library (Eclipse Memory Analyzer in
> fact), thus the perfect generated code would be as follows:
> for(IObject url: snapshot.getObjectsByClass("java.net.URL")){
> IObject path = (IObject) url.resolveValue("path");
> pipe row(url.getObjectId(), path.getRetainedHeapSize()); // return results
> }
>
> Here's what I did:
> 1) I found NestedLoopsJoinRule that seems to generate the required
> kind of plan. I have no idea why the rule is disabled by default.
> 2) However, I find no "EnumerableCorrelatorRel", thus it looks like I
> would get that "cannotplan" exception even if I create my
> CorellatorRel("@ID"=get_id) rule.
>
> 3) Another my idea is to match JoinRel(MyRel, MyRel) and replace the
> second argument with a TableFunction, so the final plan would be
> Join(Scan("java.net.URL" u), TableFunction("getObject", get_id(u.path))
> Using table function machinery for retrieving a single row looks like
> an overkill.
>
> This ends up in the following questions:
> 1) What is the suggested way to implement this kind of optimizations?
> 2) Why there is no such thing as EnumerableCorrelatorRel?
>
> --
> Regards,
> Vladimir Sitnikov