You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Junwei Li <ju...@gmail.com> on 2017/09/01 04:13:18 UTC

Two newbie questions regarding Calcite Cassandra Adapter

1. I happen to notice that the value in limit could change the query plan.
For example, if I do explain plan for "select \"tweet_id\" from
\"userline\" limit 1", the plan is:
PLAN=EnumerableCalc(expr#0..2=[{inputs}], tweet_id=[$t2])
  EnumerableLimit(fetch=[1])
    CassandraToEnumerableConverter
      CassandraTableScan(table=[[twissandra, userline]])

But if limit value no smaller than 8, such as for "select \"tweet_id\" from
\"userline\" limit 8", the plan is:
PLAN=CassandraToEnumerableConverter
  CassandraLimit(fetch=[8])
    CassandraProject(tweet_id=[$2])
      CassandraTableScan(table=[[twissandra, userline]])

Since computeSelfCost() depends on the row count, I guess that is why the
query plans are different. But I want to make sure the query plan stay the
same as the second one even the limit value is smaller than 8. How should I
tweak the computeSelfCost()?



2. If I do a inner join query that spans across cassandra and another
database, such as mongodb, an example for illustration purpose:
                "with\n" +
                "tweet as (\n" +
                "    select \"tweet_id\" as id from
\"twissandra\".\"userline\" where "\"username\"='!PUBLIC!' order by
\"time\")\n," +
                "metrics as (\n" +
                "    select \"id\" as id, \"metricName\", \"value\"\n" +
                "    from \"mongodb\".METRIC\n" +
                "    where \"metricName\" = 'cpu.usage.average'\n" +
                ")\n" +
                "select tweet.id, metrics.\"value\" from tweet\n" +
                "inner join metrics using(id)\n" +
                "where metrics.\"value\" > 70"

According to CassandraProjectRule, super(LogicalProject.class,
"CassandraProjectRule"), seems like as long as there is a
LogicalProject.class, this CassandraProjectRule  will get triggered. Does
this mean LogicalProject that belongs to mongodb will also get matched with
CassandraProjectRule? But when I do explain plan, the generated plan does
separate the projections correctly to Cassandra and mongodb. How can
Calcite tell which projection belong to which schema? Thanks.


Sincerely,
Junwei Li

Re: Two newbie questions regarding Calcite Cassandra Adapter

Posted by Michael Mior <mm...@uwaterloo.ca>.
I wasn't actually using the JDBC adapter as a reference when implementing
the Cassandra adapter. It should be possible to have both CassandraLimit
and CassandraSort in the same plan. That said, the adapter is still pretty
immature and it's been mostly just me working on it to have it do what
suits my needs. Improvements are definitely welcome.

As far as why CassandraLimit is sometimes not used, I would consider that a
bug. It may be that it will also be necessary to add a rule which matches
Sort as with EnumerableLimitRule. A fix for that would be welcome although
I'm unlikely to have the opportunity to do so myself any time soon.

--
Michael Mior
mmior@apache.org

2017-09-04 14:14 GMT-04:00 Luis Fernando Kauer <lfkauer@yahoo.com.br.invalid
>:

> Hi,
>
> I'll try to answer your questions since nobody else did.
> But I'm still fairly new to Calcite and I'm still learning it.
>
> I'll answer your questions in reverse order because the answer to the
> second question is important to answer the first question.
>
> 2. ConverterRules: how Calcite knows which LogicalProject conversion rule
> to apply
> Calcite has the concept of "Convention", which determines the calling
> convention.  Relational expressions start with Convention.NONE and each
> adapter sets its own convention when implementing TranslatableTable.toRel.
> CassandraRel implements a Convention for Cassandra.
> MongoRel implements a Convention for MongoDB.
> CassandraTable.toRel returns a CassandraTableScan with a traitSet of
> CassandraRel.CONVENTION.
> If you check the rules, there are 2 types basically:
> - Normal rule (extends RelOptRule): rule defined by "operands"
> - Conversion rule (extends ConverterRule): defined by a class and a
> traitSet (convention)
> The converter rule uses the convention of the input to match the rule.
> CassandraProjectRule is a ConverterRule so it matches only when the input
> has CassandraRel.CONVENTION. So it won't match a LogicalProject with
> MongoRel.CONVENTION.
> The adapters also have a rule to convert between its convention and the
> EnumerableConvention.INSTANCE (CassandraToEnumerableConverter).
> So the convention is the key to know which conversion rule to match and
> apply.
> 1. Different plans with different "Limit" clause
>
> Looking into calcite-cassandra adapter I noticed that some of its rules
> are not convertion rules and they try to convert/apply to nodes already
> converted to Enumerable.
> Take a look at CassandraLimitRule:
>   super(operand(EnumerableLimit.class, operand(
> CassandraToEnumerableConverter.class, any())), "CassandraLimitRule");
> The problem with this approach (IMHO) is that it depends heavily on the
> overall costs to have the CassandraLimit matched and chosen, but this kind
> of node should always be pushed.
> That's what is happening when the limit is 1 in your query.
> LogicalSort is used for sorting but also for limit (fetch) and offset.
> Cassandra adapter has a conversion rule for Sort but not when offset or
> limit is set.
> I don't know the details why this approach was chosen, maybe it was
> because the jdbc adapter is a reference and it does not implement limit and
> offset correctly.
> But I think CassandraSortRule could use the limit (fetch) and offset
> information and set it in the implement method, with no need for
> CassandraLimit.
> Trying to solve this by only changing the costs is hard because the
> planner generates many different plans and you'll have to make sure the
> costs will always be smaller and keeping the enumerable nodes.
>
> Hope it helps.
>
> Best regards,
>
> Luis Fernando
>
> Em Sexta-feira, 1 de Setembro de 2017 1:13, Junwei Li <
> junwei.li.1987@gmail.com> escreveu:
>
>
>
> 1. I happen to notice that the value in limit could change the query plan.
>
> For example, if I do explain plan for "select \"tweet_id\" from
>
> \"userline\" limit 1", the plan is:
>
> PLAN=EnumerableCalc(expr#0..2=[{inputs}], tweet_id=[$t2])
>
>   EnumerableLimit(fetch=[1])
>
>     CassandraToEnumerableConverter
>
>       CassandraTableScan(table=[[twissandra, userline]])
>
>
> But if limit value no smaller than 8, such as for "select \"tweet_id\" from
>
> \"userline\" limit 8", the plan is:
>
> PLAN=CassandraToEnumerableConverter
>
>   CassandraLimit(fetch=[8])
>
>     CassandraProject(tweet_id=[$2])
>
>       CassandraTableScan(table=[[twissandra, userline]])
>
>
> Since computeSelfCost() depends on the row count, I guess that is why the
>
> query plans are different. But I want to make sure the query plan stay the
>
> same as the second one even the limit value is smaller than 8. How should I
>
> tweak the computeSelfCost()?
>
>
>
>
> 2. If I do a inner join query that spans across cassandra and another
>
> database, such as mongodb, an example for illustration purpose:
>
>                 "with\n" +
>
>                 "tweet as (\n" +
>
>                 "    select \"tweet_id\" as id from
>
> \"twissandra\".\"userline\" where "\"username\"='!PUBLIC!' order by
>
> \"time\")\n," +
>
>                 "metrics as (\n" +
>
>                 "    select \"id\" as id, \"metricName\", \"value\"\n" +
>
>                 "    from \"mongodb\".METRIC\n" +
>
>                 "    where \"metricName\" = 'cpu.usage.average'\n" +
>
>                 ")\n" +
>
>                 "select tweet.id, metrics.\"value\" from tweet\n" +
>
>                 "inner join metrics using(id)\n" +
>
>                 "where metrics.\"value\" > 70"
>
>
> According to CassandraProjectRule, super(LogicalProject.class,
>
> "CassandraProjectRule"), seems like as long as there is a
>
> LogicalProject.class, this CassandraProjectRule  will get triggered. Does
>
> this mean LogicalProject that belongs to mongodb will also get matched with
>
> CassandraProjectRule? But when I do explain plan, the generated plan does
>
> separate the projections correctly to Cassandra and mongodb. How can
>
> Calcite tell which projection belong to which schema? Thanks.
>
>
>
> Sincerely,
>
> Junwei Li
>

Re: Two newbie questions regarding Calcite Cassandra Adapter

Posted by Luis Fernando Kauer <lf...@yahoo.com.br.INVALID>.
Hi,

I'll try to answer your questions since nobody else did.
But I'm still fairly new to Calcite and I'm still learning it.

I'll answer your questions in reverse order because the answer to the second question is important to answer the first question.

2. ConverterRules: how Calcite knows which LogicalProject conversion rule to apply
Calcite has the concept of "Convention", which determines the calling convention.  Relational expressions start with Convention.NONE and each adapter sets its own convention when implementing TranslatableTable.toRel.
CassandraRel implements a Convention for Cassandra.
MongoRel implements a Convention for MongoDB.
CassandraTable.toRel returns a CassandraTableScan with a traitSet of CassandraRel.CONVENTION.
If you check the rules, there are 2 types basically:
- Normal rule (extends RelOptRule): rule defined by "operands"
- Conversion rule (extends ConverterRule): defined by a class and a traitSet (convention)
The converter rule uses the convention of the input to match the rule.
CassandraProjectRule is a ConverterRule so it matches only when the input has CassandraRel.CONVENTION. So it won't match a LogicalProject with MongoRel.CONVENTION.
The adapters also have a rule to convert between its convention and the 
EnumerableConvention.INSTANCE (CassandraToEnumerableConverter).
So the convention is the key to know which conversion rule to match and apply.
1. Different plans with different "Limit" clause

Looking into calcite-cassandra adapter I noticed that some of its rules are not convertion rules and they try to convert/apply to nodes already converted to Enumerable.
Take a look at CassandraLimitRule:
  super(operand(EnumerableLimit.class, operand(CassandraToEnumerableConverter.class, any())), "CassandraLimitRule");
The problem with this approach (IMHO) is that it depends heavily on the overall costs to have the CassandraLimit matched and chosen, but this kind of node should always be pushed.
That's what is happening when the limit is 1 in your query. 
LogicalSort is used for sorting but also for limit (fetch) and offset.
Cassandra adapter has a conversion rule for Sort but not when offset or limit is set.
I don't know the details why this approach was chosen, maybe it was because the jdbc adapter is a reference and it does not implement limit and offset correctly.
But I think CassandraSortRule could use the limit (fetch) and offset information and set it in the implement method, with no need for CassandraLimit.
Trying to solve this by only changing the costs is hard because the planner generates many different plans and you'll have to make sure the costs will always be smaller and keeping the enumerable nodes.

Hope it helps.

Best regards,

Luis Fernando

Em Sexta-feira, 1 de Setembro de 2017 1:13, Junwei Li <ju...@gmail.com> escreveu:



1. I happen to notice that the value in limit could change the query plan.

For example, if I do explain plan for "select \"tweet_id\" from

\"userline\" limit 1", the plan is:

PLAN=EnumerableCalc(expr#0..2=[{inputs}], tweet_id=[$t2])

  EnumerableLimit(fetch=[1])

    CassandraToEnumerableConverter

      CassandraTableScan(table=[[twissandra, userline]])


But if limit value no smaller than 8, such as for "select \"tweet_id\" from

\"userline\" limit 8", the plan is:

PLAN=CassandraToEnumerableConverter

  CassandraLimit(fetch=[8])

    CassandraProject(tweet_id=[$2])

      CassandraTableScan(table=[[twissandra, userline]])


Since computeSelfCost() depends on the row count, I guess that is why the

query plans are different. But I want to make sure the query plan stay the

same as the second one even the limit value is smaller than 8. How should I

tweak the computeSelfCost()?




2. If I do a inner join query that spans across cassandra and another

database, such as mongodb, an example for illustration purpose:

                "with\n" +

                "tweet as (\n" +

                "    select \"tweet_id\" as id from

\"twissandra\".\"userline\" where "\"username\"='!PUBLIC!' order by

\"time\")\n," +

                "metrics as (\n" +

                "    select \"id\" as id, \"metricName\", \"value\"\n" +

                "    from \"mongodb\".METRIC\n" +

                "    where \"metricName\" = 'cpu.usage.average'\n" +

                ")\n" +

                "select tweet.id, metrics.\"value\" from tweet\n" +

                "inner join metrics using(id)\n" +

                "where metrics.\"value\" > 70"


According to CassandraProjectRule, super(LogicalProject.class,

"CassandraProjectRule"), seems like as long as there is a

LogicalProject.class, this CassandraProjectRule  will get triggered. Does

this mean LogicalProject that belongs to mongodb will also get matched with

CassandraProjectRule? But when I do explain plan, the generated plan does

separate the projections correctly to Cassandra and mongodb. How can

Calcite tell which projection belong to which schema? Thanks.



Sincerely,

Junwei Li