You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/09/14 09:39:00 UTC
[jira] [Commented] (FLINK-6516) using real row count instead of
dummy row count when optimizing plan
[ https://issues.apache.org/jira/browse/FLINK-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166006#comment-16166006 ]
ASF GitHub Bot commented on FLINK-6516:
---------------------------------------
Github user godfreyhe commented on a diff in the pull request:
https://github.com/apache/flink/pull/3860#discussion_r138843966
--- Diff: flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/PhysicalTableSourceScan.scala ---
@@ -70,4 +73,18 @@ abstract class PhysicalTableSourceScan(
def copy(traitSet: RelTraitSet, tableSource: TableSource[_]): PhysicalTableSourceScan
+ override def estimateRowCount(mq: RelMetadataQuery): Double = {
+ val tableSourceTable = getTable.unwrap(classOf[TableSourceTable[_]])
+
+ if (tableSourceTable.getStatistic != FlinkStatistic.UNKNOWN) {
--- End diff --
hi @fhueske, so sorry for late response. Some time ago, my work focus is not on Table API & SQL, now I will refocus on it.
There are mainly 4 cases:
1. `DataSetTable` or `DataStreamTable`, default statistics (row count = 1000).
2. `TableSourceTable` is registered without catalog, no statistic now. (we can get statistics from `TableSource`)
3. `TableSourceTable` is in catalog, and the catalog constains statistics.
3.1. If the `TableSource` is filterable (or partitionable) TableSource, maybe we can not use the catalog's statistics any more, should use `TableSource` statistics. (such as Parquet table source)
3.2 If the `TableSource` is non-filterable (or non-partitionable) TableSource, we are prefer catalog's statistics because of more efficient access.
4. `TableSourceTable` is in catalog, but the catalog does not have statistics. (get statistics from `TableSource`)
Furthermore, the statistics in catalog may be wrong, someone want the statistics from `TableSource`.
So, it's too difficult to let framework choose the statistics source. I prefer that let user choose the statistics source. There are two approach:
1. a simple way: add config(`var preferCatalogStats: Boolean`) in `TableConfig`, user can choose prefer statistics source by table config. If `preferCatalogStats` is true, framework will use catalog statistics first; if the statistics is null (or unknown), framework use `TableSource` statistics. If `preferCatalogStats` is false, the access order will be reversed.
2. a complex way: Let user decide the statistics source when register the table. We can change the `registerTable` and `registerTableSource` methods in `TableEnvironment`:
```
// register table with statistic, the framework will always use the given statistic.
def registerTable(name: String, table: Table, statistic = FlinkStatistic.of(TableStats(1000L)))
// register table source with user prefer statistics source.
def registerTableSource(name: String, tableSource: TableSource[_], preferCatalogStats: Boolean=false): Unit
// register table source with statistic, the framework will always use the given statistic.
def registerTableSource(name: String, tableSource: TableSource[_], statistic: FlinkStatistic): Unit
```
So, user can choose statistics source and add more accurate statistics for each table.
Looking forward to your advice, many thanks!
> using real row count instead of dummy row count when optimizing plan
> --------------------------------------------------------------------
>
> Key: FLINK-6516
> URL: https://issues.apache.org/jira/browse/FLINK-6516
> Project: Flink
> Issue Type: Improvement
> Components: Table API & SQL
> Reporter: godfrey he
> Assignee: godfrey he
>
> Currently, the statistic of {{TableSourceTable}} is {{UNKNOWN}} mostly, and the statistic from {{ExternalCatalog}} maybe is null also. Actually, only each {{TableSource}} knows its statistic exactly, especial for {{FilterableTableSource}} and {{PartitionableTableSource}}. So we can add {{getTableStats}} method in {{TableSource}}, and use it in TableSourceScan's estimateRowCount method to get real row count.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)