You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2020/01/03 20:21:09 UTC

[DISCUSS] CALCITE-3661, CALCITE-3665, MaterializationTest vs HR schema statistics

Hi,

It looks like MaterializationTest heavily relies on inaccurate statistics
for hr.emps and hr.depts tables.

I was trying to improve statistic estimation for better join planning (see
https://github.com/apache/calcite/pull/1712 ),
and it looks like better estimates open the eyes of the optimizer, and now
it realizes it does not really need to use materialized view
for 4-row long table.

In other words, the cost of the table access is more-or-less the same as
the cost of the materialized view access.

It looks like the way to go here is to add hr_with_extra_rows scheme so it
contains the same emps and depts tables, but it should
have bigger tables.
Adding rows to the existing emps table is not an option because it would
invalidate lots of tests.

Does anybody have better ideas?

Vladimir

Re: [DISCUSS] CALCITE-3661, CALCITE-3665, MaterializationTest vs HR schema statistics

Posted by Vladimir Sitnikov <si...@gmail.com>.
Jin>In ReflectiveSchema, Statistics of FieldTable is given as UNKNOWN[1][2].

Please check    [CALCITE-3661] Derive rowCount statistics for tables in
ReflectiveSchema that are based on arrays/collections
and [CALCITE-3680] Add ability to express unique constraints in
ReflectiveSchema
commits in https://github.com/apache/calcite/pull/1702/commits

The commits enable the optimizer to see the proper row count for tables in
ReflectiveSchema.

Vladimir

Re: [DISCUSS] CALCITE-3661, CALCITE-3665, MaterializationTest vs HR schema statistics

Posted by XING JIN <ji...@gmail.com>.
Hi, Vladimir ~

In ReflectiveSchema, Statistics of FieldTable is given as UNKNOWN[1][2].
When reading a table's row count, if no statistics given, a default value
of 100 will be returned [3] -- this is relatively a bigger value compared
with the fields defined in HRFKUKSchema.
When a materialized view gets matched, view-sql is executed and the values
are wrapped in an ArrayTable and accurate row count is given [4].
So I'm not sure when a materialized view containing JOIN gets matched but
cannot help reduce cost of the plan.

HRFKUKSchema is only used in MaterializationTest. There's no existing test
checking content of the query result. Most of them checks whether same
results are returned no matter if materialized view is used or not. If we
add rows to existing emps table, how can tests be invalidated ?

Best,
Jin

[1]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/adapter/java/ReflectiveSchema.java#L369
[2]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/schema/Statistics.java#L40
[3]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/prepare/RelOptTableImpl.java#L239
[4]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/adapter/clone/ArrayTable.java#L82

Vladimir Sitnikov <si...@gmail.com> 于2020年1月4日周六 上午4:21写道:

> Hi,
>
> It looks like MaterializationTest heavily relies on inaccurate statistics
> for hr.emps and hr.depts tables.
>
> I was trying to improve statistic estimation for better join planning (see
> https://github.com/apache/calcite/pull/1712 ),
> and it looks like better estimates open the eyes of the optimizer, and now
> it realizes it does not really need to use materialized view
> for 4-row long table.
>
> In other words, the cost of the table access is more-or-less the same as
> the cost of the materialized view access.
>
> It looks like the way to go here is to add hr_with_extra_rows scheme so it
> contains the same emps and depts tables, but it should
> have bigger tables.
> Adding rows to the existing emps table is not an option because it would
> invalidate lots of tests.
>
> Does anybody have better ideas?
>
> Vladimir
>