You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/01/04 09:11:58 UTC

[jira] [Commented] (FLINK-5394) the estimateRowCount method of DataSetCalc didn't work

    [ https://issues.apache.org/jira/browse/FLINK-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15797712#comment-15797712 ] 

ASF GitHub Bot commented on FLINK-5394:
---------------------------------------

GitHub user beyond1920 opened a pull request:

    https://github.com/apache/flink/pull/3058

    [FLINK-5394] [Table API & SQL]the estimateRowCount method of DataSetCalc didn't work

    This pr aims to fix a bug which is referenced by https://issues.apache.org/jira/browse/FLINK-5394.
    The main changes including:
    1. add FlinkRelMdRowCount and  FlinkDefaultRelMetadataProvider to override getRowCount  of some Flink RelNodes
    2. add getRowCount method in DatasetSort to provide more accurate estimate

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/alibaba/flink flink-5394

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3058.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3058
    
----
commit 8099920fb8759ed1068e7b8153816a7b63089e45
Author: beyond1920 <be...@126.com>
Date:   2016-12-29T07:52:17Z

    the estimateRowCount method of DataSetCalc didn't work now, fix it

----


> the estimateRowCount method of DataSetCalc didn't work
> ------------------------------------------------------
>
>                 Key: FLINK-5394
>                 URL: https://issues.apache.org/jira/browse/FLINK-5394
>             Project: Flink
>          Issue Type: Bug
>          Components: Table API & SQL
>            Reporter: zhangjing
>            Assignee: zhangjing
>
> The estimateRowCount method of DataSetCalc didn't work now. 
> If I run the following code,
> `
> Table table = tableEnv
> 				.fromDataSet(data, "a, b, c")
> 				.groupBy("a")
> 				.select("a, a.avg, b.sum, c.count")
> 				.where("a == 1");
> `
> the cost of every node in Optimized node tree is :
> `
> DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, COUNT(c) AS TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, 5000.0 cpu, 28000.0 io}
>   DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, cumulative cost = {2000.0 rows, 2000.0 cpu, 0.0 io}
>       DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative cost = {1000.0 rows, 1000.0 cpu, 0.0 io}
> `
> We expect the input rowcount of DataSetAggregate less than 1000, however the actual input rowcount is still 1000 because the the estimateRowCount method of DataSetCalc didn't work. 
> There are two reasons caused to this:
> 1. Didn't provide custom metadataProvider yet. So when DataSetAggregate calls RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input rowcount which would dispatch to RelMdRowCount.
> 2. DataSetCalc is subclass of SingleRel. So previous function call would match getRowCount(SingleRel rel, RelMetadataQuery mq) which would never use DataSetCalc.estimateRowCount.
> The question would also appear to all Flink RelNodes which are subclass of SingleRel.
> I plan to resolve this problem by adding a FlinkRelMdRowCount which contains specific getRowCount of Flink RelNodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)