You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Vladimir Sitnikov <si...@gmail.com> on 2020/01/03 20:21:09 UTC
[DISCUSS] CALCITE-3661, CALCITE-3665, MaterializationTest vs HR
schema statistics
Hi,
It looks like MaterializationTest heavily relies on inaccurate statistics
for hr.emps and hr.depts tables.
I was trying to improve statistic estimation for better join planning (see
https://github.com/apache/calcite/pull/1712 ),
and it looks like better estimates open the eyes of the optimizer, and now
it realizes it does not really need to use materialized view
for 4-row long table.
In other words, the cost of the table access is more-or-less the same as
the cost of the materialized view access.
It looks like the way to go here is to add hr_with_extra_rows scheme so it
contains the same emps and depts tables, but it should
have bigger tables.
Adding rows to the existing emps table is not an option because it would
invalidate lots of tests.
Does anybody have better ideas?
Vladimir
Re: [DISCUSS] CALCITE-3661, CALCITE-3665, MaterializationTest vs HR
schema statistics
Posted by Vladimir Sitnikov <si...@gmail.com>.
Jin>In ReflectiveSchema, Statistics of FieldTable is given as UNKNOWN[1][2].
Please check [CALCITE-3661] Derive rowCount statistics for tables in
ReflectiveSchema that are based on arrays/collections
and [CALCITE-3680] Add ability to express unique constraints in
ReflectiveSchema
commits in https://github.com/apache/calcite/pull/1702/commits
The commits enable the optimizer to see the proper row count for tables in
ReflectiveSchema.
Vladimir
Re: [DISCUSS] CALCITE-3661, CALCITE-3665, MaterializationTest vs HR
schema statistics
Posted by XING JIN <ji...@gmail.com>.
Hi, Vladimir ~
In ReflectiveSchema, Statistics of FieldTable is given as UNKNOWN[1][2].
When reading a table's row count, if no statistics given, a default value
of 100 will be returned [3] -- this is relatively a bigger value compared
with the fields defined in HRFKUKSchema.
When a materialized view gets matched, view-sql is executed and the values
are wrapped in an ArrayTable and accurate row count is given [4].
So I'm not sure when a materialized view containing JOIN gets matched but
cannot help reduce cost of the plan.
HRFKUKSchema is only used in MaterializationTest. There's no existing test
checking content of the query result. Most of them checks whether same
results are returned no matter if materialized view is used or not. If we
add rows to existing emps table, how can tests be invalidated ?
Best,
Jin
[1]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/adapter/java/ReflectiveSchema.java#L369
[2]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/schema/Statistics.java#L40
[3]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/prepare/RelOptTableImpl.java#L239
[4]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/adapter/clone/ArrayTable.java#L82
Vladimir Sitnikov <si...@gmail.com> 于2020年1月4日周六 上午4:21写道:
> Hi,
>
> It looks like MaterializationTest heavily relies on inaccurate statistics
> for hr.emps and hr.depts tables.
>
> I was trying to improve statistic estimation for better join planning (see
> https://github.com/apache/calcite/pull/1712 ),
> and it looks like better estimates open the eyes of the optimizer, and now
> it realizes it does not really need to use materialized view
> for 4-row long table.
>
> In other words, the cost of the table access is more-or-less the same as
> the cost of the materialized view access.
>
> It looks like the way to go here is to add hr_with_extra_rows scheme so it
> contains the same emps and depts tables, but it should
> have bigger tables.
> Adding rows to the existing emps table is not an option because it would
> invalidate lots of tests.
>
> Does anybody have better ideas?
>
> Vladimir
>