You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by godfrey he <go...@gmail.com> on 2022/05/13 10:56:20 UTC

[DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Hi all,

I would like to open a discussion on FLIP-231:  Introduce SupportStatisticReport
to support reporting statistics from source connectors.

Statistics are one of the most important inputs to the optimizer.
Accurate and complete statistics allows the optimizer to be more powerful.
Currently, the statistics of Flink SQL come from Catalog only,
while many Connectors have the ability to provide statistics, e.g. FileSystem.
In production, we find many tables in Catalog do not have any statistics.
As a result, the optimizer can't generate better execution plans,
especially for Batch jobs.

There are two approaches to enhance statistics for the planner,
one is to introduce the "ANALYZE TABLE" syntax which will write
the analyzed result to the catalog, another is to introduce a new
connector interface
which allows the connector itself to report statistics directly to the planner.
The second one is a supplement to the catalog statistics.

Here, we will discuss the second approach. Compared to the first one,
the second one is to get statistics in real time, no need to run an
analysis job for each table. This could help improve the user
experience.
(We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)

You can find more details in FLIP-231 document[1]. Looking forward to
your feedback.

[1] https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
[2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231


Best,
Godfrey

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by godfrey he <go...@gmail.com>.
Hi, Jark

Thanks for the feedback.

> 1) All the ability interfaces begin with "Supports" instead of "Support".
+1

> The "connect" word should be "collect"?
Yes, it's a typo.

> CatalogStatistics
Yes, we should use TableStats.
I forgot that TableStats and ColumnStats have been ported to the API module.

> What's the difference between them?
table.optimizer.source.collect-statistics-enabled is used for all collectors,
while source.statistics-type is file base connectors.
It may take a long time to get the detailed statistics,
but may be the file size (will be introduced later) is enough.

> IMO, we should also support Hive source as well in this FLIP.
+1

Best,
Godfrey

Jark Wu <im...@gmail.com> 于2022年5月20日周五 12:04写道:
>
> Hi Godfrey,
>
> I just left some comments here:
>
> 1) SupportStatisticReport => SupportsStatisticReport
> All the ability interfaces begin with "Supports" instead of "Support".
>
> 2) table.optimizer.source.connect-statistics-enabled
> The "connect" word should be "collect"?
>
> 3) CatalogStatistics
> I was a little confused when I first saw the name. I thought it reports
> stats for a catalog...
> Why not use "TableStats" which already wraps "ColumnStats" in it and is a
> public API as well?
>
> 4) source.statistics-type
> vs table.optimizer.source.collect-statistics-enabled
> What's the difference between them? It seems that they are both used to
> enable or disable reporting stats.
>
> 5) "Which connectors and formats will be supported by default?"
> IMO, we should also support Hive source as well in this FLIP.
> Hive source is more widely used than Filesystem connector.
>
> Best,
> Jark
>
>
>
>
> On Tue, 17 May 2022 at 10:52, Jingsong Li <ji...@gmail.com> wrote:
>
> > Hi Godfrey,
> >
> > Thanks for your reply.
> >
> > Sounds good to me.
> >
> > > I think we should also introduce a config option
> >
> > We can add this option to the FLIP. I prefer a option for
> > FileSystemConnector, maybe a enum.
> >
> > Best,
> > Jingsong
> >
> > On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com> wrote:
> >
> > > Hi Jingsong,
> > >
> > > Thanks for the feedback.
> > >
> > >
> > > >One concern I have is that we read the footer for each file, and this
> > may
> > > >be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way
> > > yes, if there are thousands of orc/parquet files, it may take a long
> > time.
> > > So we can introduce a config option to let the user choose the
> > > granularity of the statistics.
> > > But the SIZE will not be introduced, because the planner does not use
> > > the file size statistics now.
> > > We can introduce once file size statistics is introduce in the future.
> > > I think we should also introduce a config option to enable/disable
> > > SupportStatisticReport,
> > > because it's a heavy operation for some connectors in some cases.
> > >
> > > > is the filter pushdown already happening at
> > > > this time?
> > > That's a good point. Currently, the filter push down is after partition
> > > pruning
> > > to prevent the filter push down rule from consuming the partition
> > > predicates.
> > > The statistics will be set to unknown if filter is pushed down now.
> > > To combine them all, we can create an optimization program after filter
> > > push
> > > down program to collect the statistics. This could avoid collecting
> > > statistics multiple times.
> > >
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > >
> > > > Thank Godfrey for driving.
> > > >
> > > > Looks very good~ This will undoubtedly greatly enhance the various
> > batch
> > > > mode connectors.
> > > >
> > > > I left some comments:
> > > >
> > > > ## FileBasedStatisticsReportableDecodingFormat
> > > >
> > > > One concern I have is that we read the footer for each file, and this
> > may
> > > > be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way, e.g.
> > > > - No statistics are collected for files by default.
> > > > - SIZE: Generate statistics based on file Size, get the size of the
> > file
> > > > only with access to the master of the FileSystem.
> > > > - DETAILED: Get the complete statistics by format, possibly by
> > accessing
> > > > the footer of the file.
> > > >
> > > > ## When use the statistics reported by connector
> > > >
> > > > > When partitions are pruned by PushPartitionIntoTableSourceScanRule,
> > the
> > > > statistics should also be updated.
> > > >
> > > > I understand that we definitely need to use reporter after the
> > partition
> > > > prune, but another question: is the filter pushdown already happening
> > at
> > > > this time?
> > > > Can we make sure that in the following three cases, both the filter
> > > > pushdown and the partition prune happen before the stats reporting.
> > > > - only partition prune happens
> > > > - only filter pushdown happens
> > > > - both filter pushdown and partition prune happen
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > SupportStatisticReport
> > > > > to support reporting statistics from source connectors.
> > > > >
> > > > > Statistics are one of the most important inputs to the optimizer.
> > > > > Accurate and complete statistics allows the optimizer to be more
> > > powerful.
> > > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > > while many Connectors have the ability to provide statistics, e.g.
> > > > > FileSystem.
> > > > > In production, we find many tables in Catalog do not have any
> > > statistics.
> > > > > As a result, the optimizer can't generate better execution plans,
> > > > > especially for Batch jobs.
> > > > >
> > > > > There are two approaches to enhance statistics for the planner,
> > > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > > the analyzed result to the catalog, another is to introduce a new
> > > > > connector interface
> > > > > which allows the connector itself to report statistics directly to
> > the
> > > > > planner.
> > > > > The second one is a supplement to the catalog statistics.
> > > > >
> > > > > Here, we will discuss the second approach. Compared to the first one,
> > > > > the second one is to get statistics in real time, no need to run an
> > > > > analysis job for each table. This could help improve the user
> > > > > experience.
> > > > > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> > > > >
> > > > > You can find more details in FLIP-231 document[1]. Looking forward to
> > > > > your feedback.
> > > > >
> > > > > [1]
> > > > >
> > >
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > >
> > > > >
> > > > > Best,
> > > > > Godfrey
> > > > >
> > >
> >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by Jark Wu <im...@gmail.com>.
Hi Godfrey,

I just left some comments here:

1) SupportStatisticReport => SupportsStatisticReport
All the ability interfaces begin with "Supports" instead of "Support".

2) table.optimizer.source.connect-statistics-enabled
The "connect" word should be "collect"?

3) CatalogStatistics
I was a little confused when I first saw the name. I thought it reports
stats for a catalog...
Why not use "TableStats" which already wraps "ColumnStats" in it and is a
public API as well?

4) source.statistics-type
vs table.optimizer.source.collect-statistics-enabled
What's the difference between them? It seems that they are both used to
enable or disable reporting stats.

5) "Which connectors and formats will be supported by default?"
IMO, we should also support Hive source as well in this FLIP.
Hive source is more widely used than Filesystem connector.

Best,
Jark




On Tue, 17 May 2022 at 10:52, Jingsong Li <ji...@gmail.com> wrote:

> Hi Godfrey,
>
> Thanks for your reply.
>
> Sounds good to me.
>
> > I think we should also introduce a config option
>
> We can add this option to the FLIP. I prefer a option for
> FileSystemConnector, maybe a enum.
>
> Best,
> Jingsong
>
> On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com> wrote:
>
> > Hi Jingsong,
> >
> > Thanks for the feedback.
> >
> >
> > >One concern I have is that we read the footer for each file, and this
> may
> > >be a bit costly in some cases. Is it possible for us to have some
> > > hierarchical way
> > yes, if there are thousands of orc/parquet files, it may take a long
> time.
> > So we can introduce a config option to let the user choose the
> > granularity of the statistics.
> > But the SIZE will not be introduced, because the planner does not use
> > the file size statistics now.
> > We can introduce once file size statistics is introduce in the future.
> > I think we should also introduce a config option to enable/disable
> > SupportStatisticReport,
> > because it's a heavy operation for some connectors in some cases.
> >
> > > is the filter pushdown already happening at
> > > this time?
> > That's a good point. Currently, the filter push down is after partition
> > pruning
> > to prevent the filter push down rule from consuming the partition
> > predicates.
> > The statistics will be set to unknown if filter is pushed down now.
> > To combine them all, we can create an optimization program after filter
> > push
> > down program to collect the statistics. This could avoid collecting
> > statistics multiple times.
> >
> >
> > Best,
> > Godfrey
> >
> > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > >
> > > Thank Godfrey for driving.
> > >
> > > Looks very good~ This will undoubtedly greatly enhance the various
> batch
> > > mode connectors.
> > >
> > > I left some comments:
> > >
> > > ## FileBasedStatisticsReportableDecodingFormat
> > >
> > > One concern I have is that we read the footer for each file, and this
> may
> > > be a bit costly in some cases. Is it possible for us to have some
> > > hierarchical way, e.g.
> > > - No statistics are collected for files by default.
> > > - SIZE: Generate statistics based on file Size, get the size of the
> file
> > > only with access to the master of the FileSystem.
> > > - DETAILED: Get the complete statistics by format, possibly by
> accessing
> > > the footer of the file.
> > >
> > > ## When use the statistics reported by connector
> > >
> > > > When partitions are pruned by PushPartitionIntoTableSourceScanRule,
> the
> > > statistics should also be updated.
> > >
> > > I understand that we definitely need to use reporter after the
> partition
> > > prune, but another question: is the filter pushdown already happening
> at
> > > this time?
> > > Can we make sure that in the following three cases, both the filter
> > > pushdown and the partition prune happen before the stats reporting.
> > > - only partition prune happens
> > > - only filter pushdown happens
> > > - both filter pushdown and partition prune happen
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > SupportStatisticReport
> > > > to support reporting statistics from source connectors.
> > > >
> > > > Statistics are one of the most important inputs to the optimizer.
> > > > Accurate and complete statistics allows the optimizer to be more
> > powerful.
> > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > while many Connectors have the ability to provide statistics, e.g.
> > > > FileSystem.
> > > > In production, we find many tables in Catalog do not have any
> > statistics.
> > > > As a result, the optimizer can't generate better execution plans,
> > > > especially for Batch jobs.
> > > >
> > > > There are two approaches to enhance statistics for the planner,
> > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > the analyzed result to the catalog, another is to introduce a new
> > > > connector interface
> > > > which allows the connector itself to report statistics directly to
> the
> > > > planner.
> > > > The second one is a supplement to the catalog statistics.
> > > >
> > > > Here, we will discuss the second approach. Compared to the first one,
> > > > the second one is to get statistics in real time, no need to run an
> > > > analysis job for each table. This could help improve the user
> > > > experience.
> > > > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> > > >
> > > > You can find more details in FLIP-231 document[1]. Looking forward to
> > > > your feedback.
> > > >
> > > > [1]
> > > >
> >
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > >
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> >
>

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by godfrey he <go...@gmail.com>.
Hi, everyone.

Thanks for all the inputs.
Since there is no feedback any more, I will start the vote tomorrow.

Best,
Godfrey

godfrey he <go...@gmail.com> 于2022年6月1日周三 13:30写道:
>
> Hi, Jing.
>
> Thanks for the suggestion, I have updated the doc and
> will continue to optimize the code in subsequent PR.
>
> Best,
> Godfrey
>
> Jing Ge <ji...@ververica.com> 于2022年6月1日周三 04:41写道:
> >
> > Hi Godfrey,
> >
> > Thanks for clarifying it. I personally prefer the new change you suggested.
> >
> > Would you please help to understand one more thing? The else if
> > (filterPushDownSpec != null) branch is the only branch that doesn't have to
> > check if the newTableStat has been calculated previously. The reason might
> > be, after the filter has been pushed down to the table source,
> > ((SupportStatisticReport) tableSource).reportStatistics() will return a new
> > TableStats, which turns out the newTableStat has to be re-computed for each
> > filter push down. In this case, it might be good to add it into the FLIP
> > description. Otherwise, We could also add optimization for this branch to
> > avoid re-computing the table statistics.
> >
> > NIT: There are many conditions in the nested if-else statements. In order
> > to improve the readability (and maintainability in the future), we could
> > consider moving up some common checks like collectStatEnabled or
> > tableSource instanceof SupportStatisticReport, e.g.:
> >
> >  private LogicalTableScan collectStatistics(LogicalTableScan scan) {
> >       ......
> >      FlinkStatistic newStatistic = FlinkStatistic.builder()
> > .statistic(table.getStatistic()) .tableStats(refreshTableStat(...))
> > .build();
> >      return new LogicalTableScan( scan.getCluster(), scan.getTraitSet(),
> > scan.getHints(), table.copy(newStatistic));
> > }
> >
> > private TableStats refreshTableStat(boolean collectStatEnabled,
> > TableSourceTable table , PartitionPushDownSpec partitionPushDownSpec,
> > FilterPushDownSpec filterPushDownSpec) {
> >
> >      if (!collectStatEnabled)   return null;
> >
> >      if (!(table.tableSource() instanceof SupportStatisticReport)) return
> > null;
> >
> >      SupportStatisticReport tableSource =
> > (SupportStatisticReport)table.tableSource();
> >
> >       if (filterPushDownSpec != null) {
> >              // filter push down, no matter  partition push down or not
> >              return  tableSource.reportStatistics();
> >      } else {
> >          if (partitionPushDownSpec != null) {
> >         // partition push down, while filter not
> >          } else {
> >          // no partition and filter push down
> >              return table.getStatistic().getTableStats() ==
> > TableStats.UNKNOWN ? tableSource.reportStatistics() :
> > table.getStatistic().getTableStats();
> >         }
> >      }
> > }
> >
> > This just improved a little bit without introducing some kind of
> > Action/Performer interface with many subclasses and factory class to get
> > rid of some if-else statements, which could optionally be the next step
> > provement in the future.
> >
> > Best regards,
> > Jing
> >
> > On Tue, May 31, 2022 at 3:42 PM godfrey he <go...@gmail.com> wrote:
> >
> > > Hi Jark and Jing,
> > >
> > > +1 to use "report" instead of "collect".
> > >
> > > >  // only filter push down (*the description means
> > > > partitionPushDownSpec == null but misses the case of
> > > > partitionPushDownSpec != null*)
> > >
> > > `if (partitionPushDownSpec != null && filterPushDownSpec == null)`
> > > this branch is only consider that the partition is partition is pushed
> > > down,
> > > but no filter is push down. The planner will collect the statistics
> > > from catalog first,
> > > and then try to collect the statistics from collectors if the catalog
> > > statistics is unknown.
> > >
> > > `else if (filterPushDownSpec != null)` this branch means  whether
> > > the partitionPushDownSpec is null or not, the planner will collect
> > > statistics from
> > > collectors only, because the catalog do not support get statistics with
> > > filters.
> > >
> > > `else if (collectStatEnabled
> > >                 && (table.getStatistic().getTableStats() ==
> > > TableStats.UNKNOWN)
> > >                 && tableSource instanceof SupportStatisticReport)`
> > > this branch means no partition and no filter are pushed down.
> > >
> > > or we can change the pseudocode to:
> > >  if (filterPushDownSpec != null) {
> > >    // filter push down, no mater  partition push down or not
> > > } else {
> > >     if (partitionPushDownSpec != null) {
> > >         // partition push down, while filter not
> > >      } else {
> > >          // no partition and filter push down
> > >      }
> > > }
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jing Ge <ji...@ververica.com> 于2022年5月29日周日 08:17写道:
> > > >
> > > > Hi Godfrey,
> > > >
> > > > Thanks for driving this FLIP.  It looks really good! Looking forward to
> > > it!
> > > >
> > > > If I am not mistaken, partition pruning could also happen in the
> > > following
> > > > pseudocode condition block:
> > > >
> > > > else if (filterPushDownSpec != null) {
> > > >             // only filter push down (*the description means
> > > > partitionPushDownSpec == null but misses the case of
> > > > partitionPushDownSpec != null*)
> > > >
> > > >             // the catalog do not support get statistics with filters,
> > > >             // so only call reportStatistics method if needed
> > > >             if (collectStatEnabled && tableSource instanceof
> > > > SupportStatisticReport) {
> > > >                 newTableStat = ((SupportStatisticReport)
> > > > tableSource).reportStatistics();
> > > >             }
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Jing
> > > >
> > > >
> > > > On Sat, May 28, 2022 at 5:09 PM Jark Wu <im...@gmail.com> wrote:
> > > >
> > > > > Hi Godfrey,
> > > > >
> > > > > It seems that the "SupportStatisticReport" interface name and
> > > > >  "table.optimizer.source.connect-statistics-enabled" option name is not
> > > > > updated in the FLIP.
> > > > >
> > > > > Besides, in the terms of the option name, the meaning of
> > > > > "source.statistics-type"
> > > > > is not very straightforward and clean to me. Maybe
> > > > > "source.report-statistics" = "none/all/file-size"
> > > > > would be better.
> > > > >
> > > > > We can also change "table.optimizer.source.connect-statistics-enabled"
> > > to
> > > > > "table.optimizer.source.report-statistics-enabled" for alignment and
> > > > >  it's clear that one for fine-grained and one for coarse-grained.
> > > > >
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Fri, 27 May 2022 at 22:58, godfrey he <go...@gmail.com> wrote:
> > > > >
> > > > > > Hi, everyone.
> > > > > >
> > > > > > Thanks for all the inputs.
> > > > > > If there is no more feedback, I think we can start the vote next
> > > monday.
> > > > > >
> > > > > > Best,
> > > > > > Godfrey
> > > > > >
> > > > > > Martijn Visser <ma...@apache.org> 于2022年5月25日周三 19:46写道:
> > > > > > >
> > > > > > > Hi Godfrey,
> > > > > > >
> > > > > > > Thanks for creating the FLIP. I have no comments.
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Martijn
> > > > > > >
> > > > > > >
> > > > > > > On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Godfrey,
> > > > > > > >
> > > > > > > > Thanks for your reply.
> > > > > > > >
> > > > > > > > Sounds good to me.
> > > > > > > >
> > > > > > > > > I think we should also introduce a config option
> > > > > > > >
> > > > > > > > We can add this option to the FLIP. I prefer a option for
> > > > > > > > FileSystemConnector, maybe a enum.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jingsong
> > > > > > > >
> > > > > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <godfreyhe@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jingsong,
> > > > > > > > >
> > > > > > > > > Thanks for the feedback.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >One concern I have is that we read the footer for each file,
> > > and
> > > > > > this
> > > > > > > > may
> > > > > > > > > >be a bit costly in some cases. Is it possible for us to have
> > > some
> > > > > > > > > > hierarchical way
> > > > > > > > > yes, if there are thousands of orc/parquet files, it may take a
> > > > > long
> > > > > > > > time.
> > > > > > > > > So we can introduce a config option to let the user choose the
> > > > > > > > > granularity of the statistics.
> > > > > > > > > But the SIZE will not be introduced, because the planner does
> > > not
> > > > > use
> > > > > > > > > the file size statistics now.
> > > > > > > > > We can introduce once file size statistics is introduce in the
> > > > > > future.
> > > > > > > > > I think we should also introduce a config option to
> > > enable/disable
> > > > > > > > > SupportStatisticReport,
> > > > > > > > > because it's a heavy operation for some connectors in some
> > > cases.
> > > > > > > > >
> > > > > > > > > > is the filter pushdown already happening at
> > > > > > > > > > this time?
> > > > > > > > > That's a good point. Currently, the filter push down is after
> > > > > > partition
> > > > > > > > > pruning
> > > > > > > > > to prevent the filter push down rule from consuming the
> > > partition
> > > > > > > > > predicates.
> > > > > > > > > The statistics will be set to unknown if filter is pushed down
> > > now.
> > > > > > > > > To combine them all, we can create an optimization program
> > > after
> > > > > > filter
> > > > > > > > > push
> > > > > > > > > down program to collect the statistics. This could avoid
> > > collecting
> > > > > > > > > statistics multiple times.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Godfrey
> > > > > > > > >
> > > > > > > > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > > > > > > > >
> > > > > > > > > > Thank Godfrey for driving.
> > > > > > > > > >
> > > > > > > > > > Looks very good~ This will undoubtedly greatly enhance the
> > > > > various
> > > > > > > > batch
> > > > > > > > > > mode connectors.
> > > > > > > > > >
> > > > > > > > > > I left some comments:
> > > > > > > > > >
> > > > > > > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > > > > > > >
> > > > > > > > > > One concern I have is that we read the footer for each file,
> > > and
> > > > > > this
> > > > > > > > may
> > > > > > > > > > be a bit costly in some cases. Is it possible for us to have
> > > some
> > > > > > > > > > hierarchical way, e.g.
> > > > > > > > > > - No statistics are collected for files by default.
> > > > > > > > > > - SIZE: Generate statistics based on file Size, get the size
> > > of
> > > > > the
> > > > > > > > file
> > > > > > > > > > only with access to the master of the FileSystem.
> > > > > > > > > > - DETAILED: Get the complete statistics by format, possibly
> > > by
> > > > > > > > accessing
> > > > > > > > > > the footer of the file.
> > > > > > > > > >
> > > > > > > > > > ## When use the statistics reported by connector
> > > > > > > > > >
> > > > > > > > > > > When partitions are pruned by
> > > > > > PushPartitionIntoTableSourceScanRule,
> > > > > > > > the
> > > > > > > > > > statistics should also be updated.
> > > > > > > > > >
> > > > > > > > > > I understand that we definitely need to use reporter after
> > > the
> > > > > > > > partition
> > > > > > > > > > prune, but another question: is the filter pushdown already
> > > > > > happening
> > > > > > > > at
> > > > > > > > > > this time?
> > > > > > > > > > Can we make sure that in the following three cases, both the
> > > > > filter
> > > > > > > > > > pushdown and the partition prune happen before the stats
> > > > > reporting.
> > > > > > > > > > - only partition prune happens
> > > > > > > > > > - only filter pushdown happens
> > > > > > > > > > - both filter pushdown and partition prune happen
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong
> > > > > > > > > >
> > > > > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <
> > > godfreyhe@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > > > > > > SupportStatisticReport
> > > > > > > > > > > to support reporting statistics from source connectors.
> > > > > > > > > > >
> > > > > > > > > > > Statistics are one of the most important inputs to the
> > > > > optimizer.
> > > > > > > > > > > Accurate and complete statistics allows the optimizer to be
> > > > > more
> > > > > > > > > powerful.
> > > > > > > > > > > Currently, the statistics of Flink SQL come from Catalog
> > > only,
> > > > > > > > > > > while many Connectors have the ability to provide
> > > statistics,
> > > > > > e.g.
> > > > > > > > > > > FileSystem.
> > > > > > > > > > > In production, we find many tables in Catalog do not have
> > > any
> > > > > > > > > statistics.
> > > > > > > > > > > As a result, the optimizer can't generate better execution
> > > > > plans,
> > > > > > > > > > > especially for Batch jobs.
> > > > > > > > > > >
> > > > > > > > > > > There are two approaches to enhance statistics for the
> > > planner,
> > > > > > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will
> > > write
> > > > > > > > > > > the analyzed result to the catalog, another is to
> > > introduce a
> > > > > new
> > > > > > > > > > > connector interface
> > > > > > > > > > > which allows the connector itself to report statistics
> > > directly
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > planner.
> > > > > > > > > > > The second one is a supplement to the catalog statistics.
> > > > > > > > > > >
> > > > > > > > > > > Here, we will discuss the second approach. Compared to the
> > > > > first
> > > > > > one,
> > > > > > > > > > > the second one is to get statistics in real time, no need
> > > to
> > > > > run
> > > > > > an
> > > > > > > > > > > analysis job for each table. This could help improve the
> > > user
> > > > > > > > > > > experience.
> > > > > > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other
> > > > > > FLIP.)
> > > > > > > > > > >
> > > > > > > > > > > You can find more details in FLIP-231 document[1]. Looking
> > > > > > forward to
> > > > > > > > > > > your feedback.
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Godfrey
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by godfrey he <go...@gmail.com>.
Hi, Jing.

Thanks for the suggestion, I have updated the doc and
will continue to optimize the code in subsequent PR.

Best,
Godfrey

Jing Ge <ji...@ververica.com> 于2022年6月1日周三 04:41写道:
>
> Hi Godfrey,
>
> Thanks for clarifying it. I personally prefer the new change you suggested.
>
> Would you please help to understand one more thing? The else if
> (filterPushDownSpec != null) branch is the only branch that doesn't have to
> check if the newTableStat has been calculated previously. The reason might
> be, after the filter has been pushed down to the table source,
> ((SupportStatisticReport) tableSource).reportStatistics() will return a new
> TableStats, which turns out the newTableStat has to be re-computed for each
> filter push down. In this case, it might be good to add it into the FLIP
> description. Otherwise, We could also add optimization for this branch to
> avoid re-computing the table statistics.
>
> NIT: There are many conditions in the nested if-else statements. In order
> to improve the readability (and maintainability in the future), we could
> consider moving up some common checks like collectStatEnabled or
> tableSource instanceof SupportStatisticReport, e.g.:
>
>  private LogicalTableScan collectStatistics(LogicalTableScan scan) {
>       ......
>      FlinkStatistic newStatistic = FlinkStatistic.builder()
> .statistic(table.getStatistic()) .tableStats(refreshTableStat(...))
> .build();
>      return new LogicalTableScan( scan.getCluster(), scan.getTraitSet(),
> scan.getHints(), table.copy(newStatistic));
> }
>
> private TableStats refreshTableStat(boolean collectStatEnabled,
> TableSourceTable table , PartitionPushDownSpec partitionPushDownSpec,
> FilterPushDownSpec filterPushDownSpec) {
>
>      if (!collectStatEnabled)   return null;
>
>      if (!(table.tableSource() instanceof SupportStatisticReport)) return
> null;
>
>      SupportStatisticReport tableSource =
> (SupportStatisticReport)table.tableSource();
>
>       if (filterPushDownSpec != null) {
>              // filter push down, no matter  partition push down or not
>              return  tableSource.reportStatistics();
>      } else {
>          if (partitionPushDownSpec != null) {
>         // partition push down, while filter not
>          } else {
>          // no partition and filter push down
>              return table.getStatistic().getTableStats() ==
> TableStats.UNKNOWN ? tableSource.reportStatistics() :
> table.getStatistic().getTableStats();
>         }
>      }
> }
>
> This just improved a little bit without introducing some kind of
> Action/Performer interface with many subclasses and factory class to get
> rid of some if-else statements, which could optionally be the next step
> provement in the future.
>
> Best regards,
> Jing
>
> On Tue, May 31, 2022 at 3:42 PM godfrey he <go...@gmail.com> wrote:
>
> > Hi Jark and Jing,
> >
> > +1 to use "report" instead of "collect".
> >
> > >  // only filter push down (*the description means
> > > partitionPushDownSpec == null but misses the case of
> > > partitionPushDownSpec != null*)
> >
> > `if (partitionPushDownSpec != null && filterPushDownSpec == null)`
> > this branch is only consider that the partition is partition is pushed
> > down,
> > but no filter is push down. The planner will collect the statistics
> > from catalog first,
> > and then try to collect the statistics from collectors if the catalog
> > statistics is unknown.
> >
> > `else if (filterPushDownSpec != null)` this branch means  whether
> > the partitionPushDownSpec is null or not, the planner will collect
> > statistics from
> > collectors only, because the catalog do not support get statistics with
> > filters.
> >
> > `else if (collectStatEnabled
> >                 && (table.getStatistic().getTableStats() ==
> > TableStats.UNKNOWN)
> >                 && tableSource instanceof SupportStatisticReport)`
> > this branch means no partition and no filter are pushed down.
> >
> > or we can change the pseudocode to:
> >  if (filterPushDownSpec != null) {
> >    // filter push down, no mater  partition push down or not
> > } else {
> >     if (partitionPushDownSpec != null) {
> >         // partition push down, while filter not
> >      } else {
> >          // no partition and filter push down
> >      }
> > }
> >
> > Best,
> > Godfrey
> >
> > Jing Ge <ji...@ververica.com> 于2022年5月29日周日 08:17写道:
> > >
> > > Hi Godfrey,
> > >
> > > Thanks for driving this FLIP.  It looks really good! Looking forward to
> > it!
> > >
> > > If I am not mistaken, partition pruning could also happen in the
> > following
> > > pseudocode condition block:
> > >
> > > else if (filterPushDownSpec != null) {
> > >             // only filter push down (*the description means
> > > partitionPushDownSpec == null but misses the case of
> > > partitionPushDownSpec != null*)
> > >
> > >             // the catalog do not support get statistics with filters,
> > >             // so only call reportStatistics method if needed
> > >             if (collectStatEnabled && tableSource instanceof
> > > SupportStatisticReport) {
> > >                 newTableStat = ((SupportStatisticReport)
> > > tableSource).reportStatistics();
> > >             }
> > >
> > >
> > > Best regards,
> > >
> > > Jing
> > >
> > >
> > > On Sat, May 28, 2022 at 5:09 PM Jark Wu <im...@gmail.com> wrote:
> > >
> > > > Hi Godfrey,
> > > >
> > > > It seems that the "SupportStatisticReport" interface name and
> > > >  "table.optimizer.source.connect-statistics-enabled" option name is not
> > > > updated in the FLIP.
> > > >
> > > > Besides, in the terms of the option name, the meaning of
> > > > "source.statistics-type"
> > > > is not very straightforward and clean to me. Maybe
> > > > "source.report-statistics" = "none/all/file-size"
> > > > would be better.
> > > >
> > > > We can also change "table.optimizer.source.connect-statistics-enabled"
> > to
> > > > "table.optimizer.source.report-statistics-enabled" for alignment and
> > > >  it's clear that one for fine-grained and one for coarse-grained.
> > > >
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Fri, 27 May 2022 at 22:58, godfrey he <go...@gmail.com> wrote:
> > > >
> > > > > Hi, everyone.
> > > > >
> > > > > Thanks for all the inputs.
> > > > > If there is no more feedback, I think we can start the vote next
> > monday.
> > > > >
> > > > > Best,
> > > > > Godfrey
> > > > >
> > > > > Martijn Visser <ma...@apache.org> 于2022年5月25日周三 19:46写道:
> > > > > >
> > > > > > Hi Godfrey,
> > > > > >
> > > > > > Thanks for creating the FLIP. I have no comments.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Martijn
> > > > > >
> > > > > >
> > > > > > On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Godfrey,
> > > > > > >
> > > > > > > Thanks for your reply.
> > > > > > >
> > > > > > > Sounds good to me.
> > > > > > >
> > > > > > > > I think we should also introduce a config option
> > > > > > >
> > > > > > > We can add this option to the FLIP. I prefer a option for
> > > > > > > FileSystemConnector, maybe a enum.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jingsong
> > > > > > >
> > > > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <godfreyhe@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jingsong,
> > > > > > > >
> > > > > > > > Thanks for the feedback.
> > > > > > > >
> > > > > > > >
> > > > > > > > >One concern I have is that we read the footer for each file,
> > and
> > > > > this
> > > > > > > may
> > > > > > > > >be a bit costly in some cases. Is it possible for us to have
> > some
> > > > > > > > > hierarchical way
> > > > > > > > yes, if there are thousands of orc/parquet files, it may take a
> > > > long
> > > > > > > time.
> > > > > > > > So we can introduce a config option to let the user choose the
> > > > > > > > granularity of the statistics.
> > > > > > > > But the SIZE will not be introduced, because the planner does
> > not
> > > > use
> > > > > > > > the file size statistics now.
> > > > > > > > We can introduce once file size statistics is introduce in the
> > > > > future.
> > > > > > > > I think we should also introduce a config option to
> > enable/disable
> > > > > > > > SupportStatisticReport,
> > > > > > > > because it's a heavy operation for some connectors in some
> > cases.
> > > > > > > >
> > > > > > > > > is the filter pushdown already happening at
> > > > > > > > > this time?
> > > > > > > > That's a good point. Currently, the filter push down is after
> > > > > partition
> > > > > > > > pruning
> > > > > > > > to prevent the filter push down rule from consuming the
> > partition
> > > > > > > > predicates.
> > > > > > > > The statistics will be set to unknown if filter is pushed down
> > now.
> > > > > > > > To combine them all, we can create an optimization program
> > after
> > > > > filter
> > > > > > > > push
> > > > > > > > down program to collect the statistics. This could avoid
> > collecting
> > > > > > > > statistics multiple times.
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Godfrey
> > > > > > > >
> > > > > > > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > > > > > > >
> > > > > > > > > Thank Godfrey for driving.
> > > > > > > > >
> > > > > > > > > Looks very good~ This will undoubtedly greatly enhance the
> > > > various
> > > > > > > batch
> > > > > > > > > mode connectors.
> > > > > > > > >
> > > > > > > > > I left some comments:
> > > > > > > > >
> > > > > > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > > > > > >
> > > > > > > > > One concern I have is that we read the footer for each file,
> > and
> > > > > this
> > > > > > > may
> > > > > > > > > be a bit costly in some cases. Is it possible for us to have
> > some
> > > > > > > > > hierarchical way, e.g.
> > > > > > > > > - No statistics are collected for files by default.
> > > > > > > > > - SIZE: Generate statistics based on file Size, get the size
> > of
> > > > the
> > > > > > > file
> > > > > > > > > only with access to the master of the FileSystem.
> > > > > > > > > - DETAILED: Get the complete statistics by format, possibly
> > by
> > > > > > > accessing
> > > > > > > > > the footer of the file.
> > > > > > > > >
> > > > > > > > > ## When use the statistics reported by connector
> > > > > > > > >
> > > > > > > > > > When partitions are pruned by
> > > > > PushPartitionIntoTableSourceScanRule,
> > > > > > > the
> > > > > > > > > statistics should also be updated.
> > > > > > > > >
> > > > > > > > > I understand that we definitely need to use reporter after
> > the
> > > > > > > partition
> > > > > > > > > prune, but another question: is the filter pushdown already
> > > > > happening
> > > > > > > at
> > > > > > > > > this time?
> > > > > > > > > Can we make sure that in the following three cases, both the
> > > > filter
> > > > > > > > > pushdown and the partition prune happen before the stats
> > > > reporting.
> > > > > > > > > - only partition prune happens
> > > > > > > > > - only filter pushdown happens
> > > > > > > > > - both filter pushdown and partition prune happen
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jingsong
> > > > > > > > >
> > > > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <
> > godfreyhe@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > > > > > SupportStatisticReport
> > > > > > > > > > to support reporting statistics from source connectors.
> > > > > > > > > >
> > > > > > > > > > Statistics are one of the most important inputs to the
> > > > optimizer.
> > > > > > > > > > Accurate and complete statistics allows the optimizer to be
> > > > more
> > > > > > > > powerful.
> > > > > > > > > > Currently, the statistics of Flink SQL come from Catalog
> > only,
> > > > > > > > > > while many Connectors have the ability to provide
> > statistics,
> > > > > e.g.
> > > > > > > > > > FileSystem.
> > > > > > > > > > In production, we find many tables in Catalog do not have
> > any
> > > > > > > > statistics.
> > > > > > > > > > As a result, the optimizer can't generate better execution
> > > > plans,
> > > > > > > > > > especially for Batch jobs.
> > > > > > > > > >
> > > > > > > > > > There are two approaches to enhance statistics for the
> > planner,
> > > > > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will
> > write
> > > > > > > > > > the analyzed result to the catalog, another is to
> > introduce a
> > > > new
> > > > > > > > > > connector interface
> > > > > > > > > > which allows the connector itself to report statistics
> > directly
> > > > > to
> > > > > > > the
> > > > > > > > > > planner.
> > > > > > > > > > The second one is a supplement to the catalog statistics.
> > > > > > > > > >
> > > > > > > > > > Here, we will discuss the second approach. Compared to the
> > > > first
> > > > > one,
> > > > > > > > > > the second one is to get statistics in real time, no need
> > to
> > > > run
> > > > > an
> > > > > > > > > > analysis job for each table. This could help improve the
> > user
> > > > > > > > > > experience.
> > > > > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other
> > > > > FLIP.)
> > > > > > > > > >
> > > > > > > > > > You can find more details in FLIP-231 document[1]. Looking
> > > > > forward to
> > > > > > > > > > your feedback.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Godfrey
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by Jing Ge <ji...@ververica.com>.
Hi Godfrey,

Thanks for clarifying it. I personally prefer the new change you suggested.

Would you please help to understand one more thing? The else if
(filterPushDownSpec != null) branch is the only branch that doesn't have to
check if the newTableStat has been calculated previously. The reason might
be, after the filter has been pushed down to the table source,
((SupportStatisticReport) tableSource).reportStatistics() will return a new
TableStats, which turns out the newTableStat has to be re-computed for each
filter push down. In this case, it might be good to add it into the FLIP
description. Otherwise, We could also add optimization for this branch to
avoid re-computing the table statistics.

NIT: There are many conditions in the nested if-else statements. In order
to improve the readability (and maintainability in the future), we could
consider moving up some common checks like collectStatEnabled or
tableSource instanceof SupportStatisticReport, e.g.:

 private LogicalTableScan collectStatistics(LogicalTableScan scan) {
      ......
     FlinkStatistic newStatistic = FlinkStatistic.builder()
.statistic(table.getStatistic()) .tableStats(refreshTableStat(...))
.build();
     return new LogicalTableScan( scan.getCluster(), scan.getTraitSet(),
scan.getHints(), table.copy(newStatistic));
}

private TableStats refreshTableStat(boolean collectStatEnabled,
TableSourceTable table , PartitionPushDownSpec partitionPushDownSpec,
FilterPushDownSpec filterPushDownSpec) {

     if (!collectStatEnabled)   return null;

     if (!(table.tableSource() instanceof SupportStatisticReport)) return
null;

     SupportStatisticReport tableSource =
(SupportStatisticReport)table.tableSource();

      if (filterPushDownSpec != null) {
             // filter push down, no matter  partition push down or not
             return  tableSource.reportStatistics();
     } else {
         if (partitionPushDownSpec != null) {
        // partition push down, while filter not
         } else {
         // no partition and filter push down
             return table.getStatistic().getTableStats() ==
TableStats.UNKNOWN ? tableSource.reportStatistics() :
table.getStatistic().getTableStats();
        }
     }
}

This just improved a little bit without introducing some kind of
Action/Performer interface with many subclasses and factory class to get
rid of some if-else statements, which could optionally be the next step
provement in the future.

Best regards,
Jing

On Tue, May 31, 2022 at 3:42 PM godfrey he <go...@gmail.com> wrote:

> Hi Jark and Jing,
>
> +1 to use "report" instead of "collect".
>
> >  // only filter push down (*the description means
> > partitionPushDownSpec == null but misses the case of
> > partitionPushDownSpec != null*)
>
> `if (partitionPushDownSpec != null && filterPushDownSpec == null)`
> this branch is only consider that the partition is partition is pushed
> down,
> but no filter is push down. The planner will collect the statistics
> from catalog first,
> and then try to collect the statistics from collectors if the catalog
> statistics is unknown.
>
> `else if (filterPushDownSpec != null)` this branch means  whether
> the partitionPushDownSpec is null or not, the planner will collect
> statistics from
> collectors only, because the catalog do not support get statistics with
> filters.
>
> `else if (collectStatEnabled
>                 && (table.getStatistic().getTableStats() ==
> TableStats.UNKNOWN)
>                 && tableSource instanceof SupportStatisticReport)`
> this branch means no partition and no filter are pushed down.
>
> or we can change the pseudocode to:
>  if (filterPushDownSpec != null) {
>    // filter push down, no mater  partition push down or not
> } else {
>     if (partitionPushDownSpec != null) {
>         // partition push down, while filter not
>      } else {
>          // no partition and filter push down
>      }
> }
>
> Best,
> Godfrey
>
> Jing Ge <ji...@ververica.com> 于2022年5月29日周日 08:17写道:
> >
> > Hi Godfrey,
> >
> > Thanks for driving this FLIP.  It looks really good! Looking forward to
> it!
> >
> > If I am not mistaken, partition pruning could also happen in the
> following
> > pseudocode condition block:
> >
> > else if (filterPushDownSpec != null) {
> >             // only filter push down (*the description means
> > partitionPushDownSpec == null but misses the case of
> > partitionPushDownSpec != null*)
> >
> >             // the catalog do not support get statistics with filters,
> >             // so only call reportStatistics method if needed
> >             if (collectStatEnabled && tableSource instanceof
> > SupportStatisticReport) {
> >                 newTableStat = ((SupportStatisticReport)
> > tableSource).reportStatistics();
> >             }
> >
> >
> > Best regards,
> >
> > Jing
> >
> >
> > On Sat, May 28, 2022 at 5:09 PM Jark Wu <im...@gmail.com> wrote:
> >
> > > Hi Godfrey,
> > >
> > > It seems that the "SupportStatisticReport" interface name and
> > >  "table.optimizer.source.connect-statistics-enabled" option name is not
> > > updated in the FLIP.
> > >
> > > Besides, in the terms of the option name, the meaning of
> > > "source.statistics-type"
> > > is not very straightforward and clean to me. Maybe
> > > "source.report-statistics" = "none/all/file-size"
> > > would be better.
> > >
> > > We can also change "table.optimizer.source.connect-statistics-enabled"
> to
> > > "table.optimizer.source.report-statistics-enabled" for alignment and
> > >  it's clear that one for fine-grained and one for coarse-grained.
> > >
> > >
> > > Best,
> > > Jark
> > >
> > > On Fri, 27 May 2022 at 22:58, godfrey he <go...@gmail.com> wrote:
> > >
> > > > Hi, everyone.
> > > >
> > > > Thanks for all the inputs.
> > > > If there is no more feedback, I think we can start the vote next
> monday.
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Martijn Visser <ma...@apache.org> 于2022年5月25日周三 19:46写道:
> > > > >
> > > > > Hi Godfrey,
> > > > >
> > > > > Thanks for creating the FLIP. I have no comments.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > >
> > > > > On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Godfrey,
> > > > > >
> > > > > > Thanks for your reply.
> > > > > >
> > > > > > Sounds good to me.
> > > > > >
> > > > > > > I think we should also introduce a config option
> > > > > >
> > > > > > We can add this option to the FLIP. I prefer a option for
> > > > > > FileSystemConnector, maybe a enum.
> > > > > >
> > > > > > Best,
> > > > > > Jingsong
> > > > > >
> > > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <godfreyhe@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Jingsong,
> > > > > > >
> > > > > > > Thanks for the feedback.
> > > > > > >
> > > > > > >
> > > > > > > >One concern I have is that we read the footer for each file,
> and
> > > > this
> > > > > > may
> > > > > > > >be a bit costly in some cases. Is it possible for us to have
> some
> > > > > > > > hierarchical way
> > > > > > > yes, if there are thousands of orc/parquet files, it may take a
> > > long
> > > > > > time.
> > > > > > > So we can introduce a config option to let the user choose the
> > > > > > > granularity of the statistics.
> > > > > > > But the SIZE will not be introduced, because the planner does
> not
> > > use
> > > > > > > the file size statistics now.
> > > > > > > We can introduce once file size statistics is introduce in the
> > > > future.
> > > > > > > I think we should also introduce a config option to
> enable/disable
> > > > > > > SupportStatisticReport,
> > > > > > > because it's a heavy operation for some connectors in some
> cases.
> > > > > > >
> > > > > > > > is the filter pushdown already happening at
> > > > > > > > this time?
> > > > > > > That's a good point. Currently, the filter push down is after
> > > > partition
> > > > > > > pruning
> > > > > > > to prevent the filter push down rule from consuming the
> partition
> > > > > > > predicates.
> > > > > > > The statistics will be set to unknown if filter is pushed down
> now.
> > > > > > > To combine them all, we can create an optimization program
> after
> > > > filter
> > > > > > > push
> > > > > > > down program to collect the statistics. This could avoid
> collecting
> > > > > > > statistics multiple times.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Godfrey
> > > > > > >
> > > > > > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > > > > > >
> > > > > > > > Thank Godfrey for driving.
> > > > > > > >
> > > > > > > > Looks very good~ This will undoubtedly greatly enhance the
> > > various
> > > > > > batch
> > > > > > > > mode connectors.
> > > > > > > >
> > > > > > > > I left some comments:
> > > > > > > >
> > > > > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > > > > >
> > > > > > > > One concern I have is that we read the footer for each file,
> and
> > > > this
> > > > > > may
> > > > > > > > be a bit costly in some cases. Is it possible for us to have
> some
> > > > > > > > hierarchical way, e.g.
> > > > > > > > - No statistics are collected for files by default.
> > > > > > > > - SIZE: Generate statistics based on file Size, get the size
> of
> > > the
> > > > > > file
> > > > > > > > only with access to the master of the FileSystem.
> > > > > > > > - DETAILED: Get the complete statistics by format, possibly
> by
> > > > > > accessing
> > > > > > > > the footer of the file.
> > > > > > > >
> > > > > > > > ## When use the statistics reported by connector
> > > > > > > >
> > > > > > > > > When partitions are pruned by
> > > > PushPartitionIntoTableSourceScanRule,
> > > > > > the
> > > > > > > > statistics should also be updated.
> > > > > > > >
> > > > > > > > I understand that we definitely need to use reporter after
> the
> > > > > > partition
> > > > > > > > prune, but another question: is the filter pushdown already
> > > > happening
> > > > > > at
> > > > > > > > this time?
> > > > > > > > Can we make sure that in the following three cases, both the
> > > filter
> > > > > > > > pushdown and the partition prune happen before the stats
> > > reporting.
> > > > > > > > - only partition prune happens
> > > > > > > > - only filter pushdown happens
> > > > > > > > - both filter pushdown and partition prune happen
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jingsong
> > > > > > > >
> > > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <
> godfreyhe@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > > > > SupportStatisticReport
> > > > > > > > > to support reporting statistics from source connectors.
> > > > > > > > >
> > > > > > > > > Statistics are one of the most important inputs to the
> > > optimizer.
> > > > > > > > > Accurate and complete statistics allows the optimizer to be
> > > more
> > > > > > > powerful.
> > > > > > > > > Currently, the statistics of Flink SQL come from Catalog
> only,
> > > > > > > > > while many Connectors have the ability to provide
> statistics,
> > > > e.g.
> > > > > > > > > FileSystem.
> > > > > > > > > In production, we find many tables in Catalog do not have
> any
> > > > > > > statistics.
> > > > > > > > > As a result, the optimizer can't generate better execution
> > > plans,
> > > > > > > > > especially for Batch jobs.
> > > > > > > > >
> > > > > > > > > There are two approaches to enhance statistics for the
> planner,
> > > > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will
> write
> > > > > > > > > the analyzed result to the catalog, another is to
> introduce a
> > > new
> > > > > > > > > connector interface
> > > > > > > > > which allows the connector itself to report statistics
> directly
> > > > to
> > > > > > the
> > > > > > > > > planner.
> > > > > > > > > The second one is a supplement to the catalog statistics.
> > > > > > > > >
> > > > > > > > > Here, we will discuss the second approach. Compared to the
> > > first
> > > > one,
> > > > > > > > > the second one is to get statistics in real time, no need
> to
> > > run
> > > > an
> > > > > > > > > analysis job for each table. This could help improve the
> user
> > > > > > > > > experience.
> > > > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other
> > > > FLIP.)
> > > > > > > > >
> > > > > > > > > You can find more details in FLIP-231 document[1]. Looking
> > > > forward to
> > > > > > > > > your feedback.
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Godfrey
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by godfrey he <go...@gmail.com>.
Hi Jark and Jing,

+1 to use "report" instead of "collect".

>  // only filter push down (*the description means
> partitionPushDownSpec == null but misses the case of
> partitionPushDownSpec != null*)

`if (partitionPushDownSpec != null && filterPushDownSpec == null)`
this branch is only consider that the partition is partition is pushed down,
but no filter is push down. The planner will collect the statistics
from catalog first,
and then try to collect the statistics from collectors if the catalog
statistics is unknown.

`else if (filterPushDownSpec != null)` this branch means  whether
the partitionPushDownSpec is null or not, the planner will collect
statistics from
collectors only, because the catalog do not support get statistics with filters.

`else if (collectStatEnabled
                && (table.getStatistic().getTableStats() == TableStats.UNKNOWN)
                && tableSource instanceof SupportStatisticReport)`
this branch means no partition and no filter are pushed down.

or we can change the pseudocode to:
 if (filterPushDownSpec != null) {
   // filter push down, no mater  partition push down or not
} else {
    if (partitionPushDownSpec != null) {
        // partition push down, while filter not
     } else {
         // no partition and filter push down
     }
}

Best,
Godfrey

Jing Ge <ji...@ververica.com> 于2022年5月29日周日 08:17写道:
>
> Hi Godfrey,
>
> Thanks for driving this FLIP.  It looks really good! Looking forward to it!
>
> If I am not mistaken, partition pruning could also happen in the following
> pseudocode condition block:
>
> else if (filterPushDownSpec != null) {
>             // only filter push down (*the description means
> partitionPushDownSpec == null but misses the case of
> partitionPushDownSpec != null*)
>
>             // the catalog do not support get statistics with filters,
>             // so only call reportStatistics method if needed
>             if (collectStatEnabled && tableSource instanceof
> SupportStatisticReport) {
>                 newTableStat = ((SupportStatisticReport)
> tableSource).reportStatistics();
>             }
>
>
> Best regards,
>
> Jing
>
>
> On Sat, May 28, 2022 at 5:09 PM Jark Wu <im...@gmail.com> wrote:
>
> > Hi Godfrey,
> >
> > It seems that the "SupportStatisticReport" interface name and
> >  "table.optimizer.source.connect-statistics-enabled" option name is not
> > updated in the FLIP.
> >
> > Besides, in the terms of the option name, the meaning of
> > "source.statistics-type"
> > is not very straightforward and clean to me. Maybe
> > "source.report-statistics" = "none/all/file-size"
> > would be better.
> >
> > We can also change "table.optimizer.source.connect-statistics-enabled" to
> > "table.optimizer.source.report-statistics-enabled" for alignment and
> >  it's clear that one for fine-grained and one for coarse-grained.
> >
> >
> > Best,
> > Jark
> >
> > On Fri, 27 May 2022 at 22:58, godfrey he <go...@gmail.com> wrote:
> >
> > > Hi, everyone.
> > >
> > > Thanks for all the inputs.
> > > If there is no more feedback, I think we can start the vote next monday.
> > >
> > > Best,
> > > Godfrey
> > >
> > > Martijn Visser <ma...@apache.org> 于2022年5月25日周三 19:46写道:
> > > >
> > > > Hi Godfrey,
> > > >
> > > > Thanks for creating the FLIP. I have no comments.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > >
> > > > On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Godfrey,
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > Sounds good to me.
> > > > >
> > > > > > I think we should also introduce a config option
> > > > >
> > > > > We can add this option to the FLIP. I prefer a option for
> > > > > FileSystemConnector, maybe a enum.
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > Thanks for the feedback.
> > > > > >
> > > > > >
> > > > > > >One concern I have is that we read the footer for each file, and
> > > this
> > > > > may
> > > > > > >be a bit costly in some cases. Is it possible for us to have some
> > > > > > > hierarchical way
> > > > > > yes, if there are thousands of orc/parquet files, it may take a
> > long
> > > > > time.
> > > > > > So we can introduce a config option to let the user choose the
> > > > > > granularity of the statistics.
> > > > > > But the SIZE will not be introduced, because the planner does not
> > use
> > > > > > the file size statistics now.
> > > > > > We can introduce once file size statistics is introduce in the
> > > future.
> > > > > > I think we should also introduce a config option to enable/disable
> > > > > > SupportStatisticReport,
> > > > > > because it's a heavy operation for some connectors in some cases.
> > > > > >
> > > > > > > is the filter pushdown already happening at
> > > > > > > this time?
> > > > > > That's a good point. Currently, the filter push down is after
> > > partition
> > > > > > pruning
> > > > > > to prevent the filter push down rule from consuming the partition
> > > > > > predicates.
> > > > > > The statistics will be set to unknown if filter is pushed down now.
> > > > > > To combine them all, we can create an optimization program after
> > > filter
> > > > > > push
> > > > > > down program to collect the statistics. This could avoid collecting
> > > > > > statistics multiple times.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Godfrey
> > > > > >
> > > > > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > > > > >
> > > > > > > Thank Godfrey for driving.
> > > > > > >
> > > > > > > Looks very good~ This will undoubtedly greatly enhance the
> > various
> > > > > batch
> > > > > > > mode connectors.
> > > > > > >
> > > > > > > I left some comments:
> > > > > > >
> > > > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > > > >
> > > > > > > One concern I have is that we read the footer for each file, and
> > > this
> > > > > may
> > > > > > > be a bit costly in some cases. Is it possible for us to have some
> > > > > > > hierarchical way, e.g.
> > > > > > > - No statistics are collected for files by default.
> > > > > > > - SIZE: Generate statistics based on file Size, get the size of
> > the
> > > > > file
> > > > > > > only with access to the master of the FileSystem.
> > > > > > > - DETAILED: Get the complete statistics by format, possibly by
> > > > > accessing
> > > > > > > the footer of the file.
> > > > > > >
> > > > > > > ## When use the statistics reported by connector
> > > > > > >
> > > > > > > > When partitions are pruned by
> > > PushPartitionIntoTableSourceScanRule,
> > > > > the
> > > > > > > statistics should also be updated.
> > > > > > >
> > > > > > > I understand that we definitely need to use reporter after the
> > > > > partition
> > > > > > > prune, but another question: is the filter pushdown already
> > > happening
> > > > > at
> > > > > > > this time?
> > > > > > > Can we make sure that in the following three cases, both the
> > filter
> > > > > > > pushdown and the partition prune happen before the stats
> > reporting.
> > > > > > > - only partition prune happens
> > > > > > > - only filter pushdown happens
> > > > > > > - both filter pushdown and partition prune happen
> > > > > > >
> > > > > > > Best,
> > > > > > > Jingsong
> > > > > > >
> > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > > > SupportStatisticReport
> > > > > > > > to support reporting statistics from source connectors.
> > > > > > > >
> > > > > > > > Statistics are one of the most important inputs to the
> > optimizer.
> > > > > > > > Accurate and complete statistics allows the optimizer to be
> > more
> > > > > > powerful.
> > > > > > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > > > > > while many Connectors have the ability to provide statistics,
> > > e.g.
> > > > > > > > FileSystem.
> > > > > > > > In production, we find many tables in Catalog do not have any
> > > > > > statistics.
> > > > > > > > As a result, the optimizer can't generate better execution
> > plans,
> > > > > > > > especially for Batch jobs.
> > > > > > > >
> > > > > > > > There are two approaches to enhance statistics for the planner,
> > > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > > > > > the analyzed result to the catalog, another is to introduce a
> > new
> > > > > > > > connector interface
> > > > > > > > which allows the connector itself to report statistics directly
> > > to
> > > > > the
> > > > > > > > planner.
> > > > > > > > The second one is a supplement to the catalog statistics.
> > > > > > > >
> > > > > > > > Here, we will discuss the second approach. Compared to the
> > first
> > > one,
> > > > > > > > the second one is to get statistics in real time, no need to
> > run
> > > an
> > > > > > > > analysis job for each table. This could help improve the user
> > > > > > > > experience.
> > > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other
> > > FLIP.)
> > > > > > > >
> > > > > > > > You can find more details in FLIP-231 document[1]. Looking
> > > forward to
> > > > > > > > your feedback.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Godfrey
> > > > > > > >
> > > > > >
> > > > >
> > >
> >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by Jing Ge <ji...@ververica.com>.
Hi Godfrey,

Thanks for driving this FLIP.  It looks really good! Looking forward to it!

If I am not mistaken, partition pruning could also happen in the following
pseudocode condition block:

else if (filterPushDownSpec != null) {
            // only filter push down (*the description means
partitionPushDownSpec == null but misses the case of
partitionPushDownSpec != null*)

            // the catalog do not support get statistics with filters,
            // so only call reportStatistics method if needed
            if (collectStatEnabled && tableSource instanceof
SupportStatisticReport) {
                newTableStat = ((SupportStatisticReport)
tableSource).reportStatistics();
            }


Best regards,

Jing


On Sat, May 28, 2022 at 5:09 PM Jark Wu <im...@gmail.com> wrote:

> Hi Godfrey,
>
> It seems that the "SupportStatisticReport" interface name and
>  "table.optimizer.source.connect-statistics-enabled" option name is not
> updated in the FLIP.
>
> Besides, in the terms of the option name, the meaning of
> "source.statistics-type"
> is not very straightforward and clean to me. Maybe
> "source.report-statistics" = "none/all/file-size"
> would be better.
>
> We can also change "table.optimizer.source.connect-statistics-enabled" to
> "table.optimizer.source.report-statistics-enabled" for alignment and
>  it's clear that one for fine-grained and one for coarse-grained.
>
>
> Best,
> Jark
>
> On Fri, 27 May 2022 at 22:58, godfrey he <go...@gmail.com> wrote:
>
> > Hi, everyone.
> >
> > Thanks for all the inputs.
> > If there is no more feedback, I think we can start the vote next monday.
> >
> > Best,
> > Godfrey
> >
> > Martijn Visser <ma...@apache.org> 于2022年5月25日周三 19:46写道:
> > >
> > > Hi Godfrey,
> > >
> > > Thanks for creating the FLIP. I have no comments.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > >
> > > On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com>
> > wrote:
> > >
> > > > Hi Godfrey,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > Sounds good to me.
> > > >
> > > > > I think we should also introduce a config option
> > > >
> > > > We can add this option to the FLIP. I prefer a option for
> > > > FileSystemConnector, maybe a enum.
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Jingsong,
> > > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > >
> > > > > >One concern I have is that we read the footer for each file, and
> > this
> > > > may
> > > > > >be a bit costly in some cases. Is it possible for us to have some
> > > > > > hierarchical way
> > > > > yes, if there are thousands of orc/parquet files, it may take a
> long
> > > > time.
> > > > > So we can introduce a config option to let the user choose the
> > > > > granularity of the statistics.
> > > > > But the SIZE will not be introduced, because the planner does not
> use
> > > > > the file size statistics now.
> > > > > We can introduce once file size statistics is introduce in the
> > future.
> > > > > I think we should also introduce a config option to enable/disable
> > > > > SupportStatisticReport,
> > > > > because it's a heavy operation for some connectors in some cases.
> > > > >
> > > > > > is the filter pushdown already happening at
> > > > > > this time?
> > > > > That's a good point. Currently, the filter push down is after
> > partition
> > > > > pruning
> > > > > to prevent the filter push down rule from consuming the partition
> > > > > predicates.
> > > > > The statistics will be set to unknown if filter is pushed down now.
> > > > > To combine them all, we can create an optimization program after
> > filter
> > > > > push
> > > > > down program to collect the statistics. This could avoid collecting
> > > > > statistics multiple times.
> > > > >
> > > > >
> > > > > Best,
> > > > > Godfrey
> > > > >
> > > > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > > > >
> > > > > > Thank Godfrey for driving.
> > > > > >
> > > > > > Looks very good~ This will undoubtedly greatly enhance the
> various
> > > > batch
> > > > > > mode connectors.
> > > > > >
> > > > > > I left some comments:
> > > > > >
> > > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > > >
> > > > > > One concern I have is that we read the footer for each file, and
> > this
> > > > may
> > > > > > be a bit costly in some cases. Is it possible for us to have some
> > > > > > hierarchical way, e.g.
> > > > > > - No statistics are collected for files by default.
> > > > > > - SIZE: Generate statistics based on file Size, get the size of
> the
> > > > file
> > > > > > only with access to the master of the FileSystem.
> > > > > > - DETAILED: Get the complete statistics by format, possibly by
> > > > accessing
> > > > > > the footer of the file.
> > > > > >
> > > > > > ## When use the statistics reported by connector
> > > > > >
> > > > > > > When partitions are pruned by
> > PushPartitionIntoTableSourceScanRule,
> > > > the
> > > > > > statistics should also be updated.
> > > > > >
> > > > > > I understand that we definitely need to use reporter after the
> > > > partition
> > > > > > prune, but another question: is the filter pushdown already
> > happening
> > > > at
> > > > > > this time?
> > > > > > Can we make sure that in the following three cases, both the
> filter
> > > > > > pushdown and the partition prune happen before the stats
> reporting.
> > > > > > - only partition prune happens
> > > > > > - only filter pushdown happens
> > > > > > - both filter pushdown and partition prune happen
> > > > > >
> > > > > > Best,
> > > > > > Jingsong
> > > > > >
> > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > > SupportStatisticReport
> > > > > > > to support reporting statistics from source connectors.
> > > > > > >
> > > > > > > Statistics are one of the most important inputs to the
> optimizer.
> > > > > > > Accurate and complete statistics allows the optimizer to be
> more
> > > > > powerful.
> > > > > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > > > > while many Connectors have the ability to provide statistics,
> > e.g.
> > > > > > > FileSystem.
> > > > > > > In production, we find many tables in Catalog do not have any
> > > > > statistics.
> > > > > > > As a result, the optimizer can't generate better execution
> plans,
> > > > > > > especially for Batch jobs.
> > > > > > >
> > > > > > > There are two approaches to enhance statistics for the planner,
> > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > > > > the analyzed result to the catalog, another is to introduce a
> new
> > > > > > > connector interface
> > > > > > > which allows the connector itself to report statistics directly
> > to
> > > > the
> > > > > > > planner.
> > > > > > > The second one is a supplement to the catalog statistics.
> > > > > > >
> > > > > > > Here, we will discuss the second approach. Compared to the
> first
> > one,
> > > > > > > the second one is to get statistics in real time, no need to
> run
> > an
> > > > > > > analysis job for each table. This could help improve the user
> > > > > > > experience.
> > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other
> > FLIP.)
> > > > > > >
> > > > > > > You can find more details in FLIP-231 document[1]. Looking
> > forward to
> > > > > > > your feedback.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Godfrey
> > > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by Jark Wu <im...@gmail.com>.
Hi Godfrey,

It seems that the "SupportStatisticReport" interface name and
 "table.optimizer.source.connect-statistics-enabled" option name is not
updated in the FLIP.

Besides, in the terms of the option name, the meaning of
"source.statistics-type"
is not very straightforward and clean to me. Maybe
"source.report-statistics" = "none/all/file-size"
would be better.

We can also change "table.optimizer.source.connect-statistics-enabled" to
"table.optimizer.source.report-statistics-enabled" for alignment and
 it's clear that one for fine-grained and one for coarse-grained.


Best,
Jark

On Fri, 27 May 2022 at 22:58, godfrey he <go...@gmail.com> wrote:

> Hi, everyone.
>
> Thanks for all the inputs.
> If there is no more feedback, I think we can start the vote next monday.
>
> Best,
> Godfrey
>
> Martijn Visser <ma...@apache.org> 于2022年5月25日周三 19:46写道:
> >
> > Hi Godfrey,
> >
> > Thanks for creating the FLIP. I have no comments.
> >
> > Best regards,
> >
> > Martijn
> >
> >
> > On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com>
> wrote:
> >
> > > Hi Godfrey,
> > >
> > > Thanks for your reply.
> > >
> > > Sounds good to me.
> > >
> > > > I think we should also introduce a config option
> > >
> > > We can add this option to the FLIP. I prefer a option for
> > > FileSystemConnector, maybe a enum.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com>
> wrote:
> > >
> > > > Hi Jingsong,
> > > >
> > > > Thanks for the feedback.
> > > >
> > > >
> > > > >One concern I have is that we read the footer for each file, and
> this
> > > may
> > > > >be a bit costly in some cases. Is it possible for us to have some
> > > > > hierarchical way
> > > > yes, if there are thousands of orc/parquet files, it may take a long
> > > time.
> > > > So we can introduce a config option to let the user choose the
> > > > granularity of the statistics.
> > > > But the SIZE will not be introduced, because the planner does not use
> > > > the file size statistics now.
> > > > We can introduce once file size statistics is introduce in the
> future.
> > > > I think we should also introduce a config option to enable/disable
> > > > SupportStatisticReport,
> > > > because it's a heavy operation for some connectors in some cases.
> > > >
> > > > > is the filter pushdown already happening at
> > > > > this time?
> > > > That's a good point. Currently, the filter push down is after
> partition
> > > > pruning
> > > > to prevent the filter push down rule from consuming the partition
> > > > predicates.
> > > > The statistics will be set to unknown if filter is pushed down now.
> > > > To combine them all, we can create an optimization program after
> filter
> > > > push
> > > > down program to collect the statistics. This could avoid collecting
> > > > statistics multiple times.
> > > >
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > > >
> > > > > Thank Godfrey for driving.
> > > > >
> > > > > Looks very good~ This will undoubtedly greatly enhance the various
> > > batch
> > > > > mode connectors.
> > > > >
> > > > > I left some comments:
> > > > >
> > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > >
> > > > > One concern I have is that we read the footer for each file, and
> this
> > > may
> > > > > be a bit costly in some cases. Is it possible for us to have some
> > > > > hierarchical way, e.g.
> > > > > - No statistics are collected for files by default.
> > > > > - SIZE: Generate statistics based on file Size, get the size of the
> > > file
> > > > > only with access to the master of the FileSystem.
> > > > > - DETAILED: Get the complete statistics by format, possibly by
> > > accessing
> > > > > the footer of the file.
> > > > >
> > > > > ## When use the statistics reported by connector
> > > > >
> > > > > > When partitions are pruned by
> PushPartitionIntoTableSourceScanRule,
> > > the
> > > > > statistics should also be updated.
> > > > >
> > > > > I understand that we definitely need to use reporter after the
> > > partition
> > > > > prune, but another question: is the filter pushdown already
> happening
> > > at
> > > > > this time?
> > > > > Can we make sure that in the following three cases, both the filter
> > > > > pushdown and the partition prune happen before the stats reporting.
> > > > > - only partition prune happens
> > > > > - only filter pushdown happens
> > > > > - both filter pushdown and partition prune happen
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > SupportStatisticReport
> > > > > > to support reporting statistics from source connectors.
> > > > > >
> > > > > > Statistics are one of the most important inputs to the optimizer.
> > > > > > Accurate and complete statistics allows the optimizer to be more
> > > > powerful.
> > > > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > > > while many Connectors have the ability to provide statistics,
> e.g.
> > > > > > FileSystem.
> > > > > > In production, we find many tables in Catalog do not have any
> > > > statistics.
> > > > > > As a result, the optimizer can't generate better execution plans,
> > > > > > especially for Batch jobs.
> > > > > >
> > > > > > There are two approaches to enhance statistics for the planner,
> > > > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > > > the analyzed result to the catalog, another is to introduce a new
> > > > > > connector interface
> > > > > > which allows the connector itself to report statistics directly
> to
> > > the
> > > > > > planner.
> > > > > > The second one is a supplement to the catalog statistics.
> > > > > >
> > > > > > Here, we will discuss the second approach. Compared to the first
> one,
> > > > > > the second one is to get statistics in real time, no need to run
> an
> > > > > > analysis job for each table. This could help improve the user
> > > > > > experience.
> > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other
> FLIP.)
> > > > > >
> > > > > > You can find more details in FLIP-231 document[1]. Looking
> forward to
> > > > > > your feedback.
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Godfrey
> > > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by godfrey he <go...@gmail.com>.
Hi, everyone.

Thanks for all the inputs.
If there is no more feedback, I think we can start the vote next monday.

Best,
Godfrey

Martijn Visser <ma...@apache.org> 于2022年5月25日周三 19:46写道:
>
> Hi Godfrey,
>
> Thanks for creating the FLIP. I have no comments.
>
> Best regards,
>
> Martijn
>
>
> On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com> wrote:
>
> > Hi Godfrey,
> >
> > Thanks for your reply.
> >
> > Sounds good to me.
> >
> > > I think we should also introduce a config option
> >
> > We can add this option to the FLIP. I prefer a option for
> > FileSystemConnector, maybe a enum.
> >
> > Best,
> > Jingsong
> >
> > On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com> wrote:
> >
> > > Hi Jingsong,
> > >
> > > Thanks for the feedback.
> > >
> > >
> > > >One concern I have is that we read the footer for each file, and this
> > may
> > > >be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way
> > > yes, if there are thousands of orc/parquet files, it may take a long
> > time.
> > > So we can introduce a config option to let the user choose the
> > > granularity of the statistics.
> > > But the SIZE will not be introduced, because the planner does not use
> > > the file size statistics now.
> > > We can introduce once file size statistics is introduce in the future.
> > > I think we should also introduce a config option to enable/disable
> > > SupportStatisticReport,
> > > because it's a heavy operation for some connectors in some cases.
> > >
> > > > is the filter pushdown already happening at
> > > > this time?
> > > That's a good point. Currently, the filter push down is after partition
> > > pruning
> > > to prevent the filter push down rule from consuming the partition
> > > predicates.
> > > The statistics will be set to unknown if filter is pushed down now.
> > > To combine them all, we can create an optimization program after filter
> > > push
> > > down program to collect the statistics. This could avoid collecting
> > > statistics multiple times.
> > >
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > >
> > > > Thank Godfrey for driving.
> > > >
> > > > Looks very good~ This will undoubtedly greatly enhance the various
> > batch
> > > > mode connectors.
> > > >
> > > > I left some comments:
> > > >
> > > > ## FileBasedStatisticsReportableDecodingFormat
> > > >
> > > > One concern I have is that we read the footer for each file, and this
> > may
> > > > be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way, e.g.
> > > > - No statistics are collected for files by default.
> > > > - SIZE: Generate statistics based on file Size, get the size of the
> > file
> > > > only with access to the master of the FileSystem.
> > > > - DETAILED: Get the complete statistics by format, possibly by
> > accessing
> > > > the footer of the file.
> > > >
> > > > ## When use the statistics reported by connector
> > > >
> > > > > When partitions are pruned by PushPartitionIntoTableSourceScanRule,
> > the
> > > > statistics should also be updated.
> > > >
> > > > I understand that we definitely need to use reporter after the
> > partition
> > > > prune, but another question: is the filter pushdown already happening
> > at
> > > > this time?
> > > > Can we make sure that in the following three cases, both the filter
> > > > pushdown and the partition prune happen before the stats reporting.
> > > > - only partition prune happens
> > > > - only filter pushdown happens
> > > > - both filter pushdown and partition prune happen
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > SupportStatisticReport
> > > > > to support reporting statistics from source connectors.
> > > > >
> > > > > Statistics are one of the most important inputs to the optimizer.
> > > > > Accurate and complete statistics allows the optimizer to be more
> > > powerful.
> > > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > > while many Connectors have the ability to provide statistics, e.g.
> > > > > FileSystem.
> > > > > In production, we find many tables in Catalog do not have any
> > > statistics.
> > > > > As a result, the optimizer can't generate better execution plans,
> > > > > especially for Batch jobs.
> > > > >
> > > > > There are two approaches to enhance statistics for the planner,
> > > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > > the analyzed result to the catalog, another is to introduce a new
> > > > > connector interface
> > > > > which allows the connector itself to report statistics directly to
> > the
> > > > > planner.
> > > > > The second one is a supplement to the catalog statistics.
> > > > >
> > > > > Here, we will discuss the second approach. Compared to the first one,
> > > > > the second one is to get statistics in real time, no need to run an
> > > > > analysis job for each table. This could help improve the user
> > > > > experience.
> > > > > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> > > > >
> > > > > You can find more details in FLIP-231 document[1]. Looking forward to
> > > > > your feedback.
> > > > >
> > > > > [1]
> > > > >
> > >
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > >
> > > > >
> > > > > Best,
> > > > > Godfrey
> > > > >
> > >
> >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by Martijn Visser <ma...@apache.org>.
Hi Godfrey,

Thanks for creating the FLIP. I have no comments.

Best regards,

Martijn


On Tue, 17 May 2022 at 04:52, Jingsong Li <ji...@gmail.com> wrote:

> Hi Godfrey,
>
> Thanks for your reply.
>
> Sounds good to me.
>
> > I think we should also introduce a config option
>
> We can add this option to the FLIP. I prefer a option for
> FileSystemConnector, maybe a enum.
>
> Best,
> Jingsong
>
> On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com> wrote:
>
> > Hi Jingsong,
> >
> > Thanks for the feedback.
> >
> >
> > >One concern I have is that we read the footer for each file, and this
> may
> > >be a bit costly in some cases. Is it possible for us to have some
> > > hierarchical way
> > yes, if there are thousands of orc/parquet files, it may take a long
> time.
> > So we can introduce a config option to let the user choose the
> > granularity of the statistics.
> > But the SIZE will not be introduced, because the planner does not use
> > the file size statistics now.
> > We can introduce once file size statistics is introduce in the future.
> > I think we should also introduce a config option to enable/disable
> > SupportStatisticReport,
> > because it's a heavy operation for some connectors in some cases.
> >
> > > is the filter pushdown already happening at
> > > this time?
> > That's a good point. Currently, the filter push down is after partition
> > pruning
> > to prevent the filter push down rule from consuming the partition
> > predicates.
> > The statistics will be set to unknown if filter is pushed down now.
> > To combine them all, we can create an optimization program after filter
> > push
> > down program to collect the statistics. This could avoid collecting
> > statistics multiple times.
> >
> >
> > Best,
> > Godfrey
> >
> > Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> > >
> > > Thank Godfrey for driving.
> > >
> > > Looks very good~ This will undoubtedly greatly enhance the various
> batch
> > > mode connectors.
> > >
> > > I left some comments:
> > >
> > > ## FileBasedStatisticsReportableDecodingFormat
> > >
> > > One concern I have is that we read the footer for each file, and this
> may
> > > be a bit costly in some cases. Is it possible for us to have some
> > > hierarchical way, e.g.
> > > - No statistics are collected for files by default.
> > > - SIZE: Generate statistics based on file Size, get the size of the
> file
> > > only with access to the master of the FileSystem.
> > > - DETAILED: Get the complete statistics by format, possibly by
> accessing
> > > the footer of the file.
> > >
> > > ## When use the statistics reported by connector
> > >
> > > > When partitions are pruned by PushPartitionIntoTableSourceScanRule,
> the
> > > statistics should also be updated.
> > >
> > > I understand that we definitely need to use reporter after the
> partition
> > > prune, but another question: is the filter pushdown already happening
> at
> > > this time?
> > > Can we make sure that in the following three cases, both the filter
> > > pushdown and the partition prune happen before the stats reporting.
> > > - only partition prune happens
> > > - only filter pushdown happens
> > > - both filter pushdown and partition prune happen
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > SupportStatisticReport
> > > > to support reporting statistics from source connectors.
> > > >
> > > > Statistics are one of the most important inputs to the optimizer.
> > > > Accurate and complete statistics allows the optimizer to be more
> > powerful.
> > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > while many Connectors have the ability to provide statistics, e.g.
> > > > FileSystem.
> > > > In production, we find many tables in Catalog do not have any
> > statistics.
> > > > As a result, the optimizer can't generate better execution plans,
> > > > especially for Batch jobs.
> > > >
> > > > There are two approaches to enhance statistics for the planner,
> > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > the analyzed result to the catalog, another is to introduce a new
> > > > connector interface
> > > > which allows the connector itself to report statistics directly to
> the
> > > > planner.
> > > > The second one is a supplement to the catalog statistics.
> > > >
> > > > Here, we will discuss the second approach. Compared to the first one,
> > > > the second one is to get statistics in real time, no need to run an
> > > > analysis job for each table. This could help improve the user
> > > > experience.
> > > > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> > > >
> > > > You can find more details in FLIP-231 document[1]. Looking forward to
> > > > your feedback.
> > > >
> > > > [1]
> > > >
> >
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > >
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> >
>

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by Jingsong Li <ji...@gmail.com>.
Hi Godfrey,

Thanks for your reply.

Sounds good to me.

> I think we should also introduce a config option

We can add this option to the FLIP. I prefer a option for
FileSystemConnector, maybe a enum.

Best,
Jingsong

On Tue, May 17, 2022 at 10:31 AM godfrey he <go...@gmail.com> wrote:

> Hi Jingsong,
>
> Thanks for the feedback.
>
>
> >One concern I have is that we read the footer for each file, and this may
> >be a bit costly in some cases. Is it possible for us to have some
> > hierarchical way
> yes, if there are thousands of orc/parquet files, it may take a long time.
> So we can introduce a config option to let the user choose the
> granularity of the statistics.
> But the SIZE will not be introduced, because the planner does not use
> the file size statistics now.
> We can introduce once file size statistics is introduce in the future.
> I think we should also introduce a config option to enable/disable
> SupportStatisticReport,
> because it's a heavy operation for some connectors in some cases.
>
> > is the filter pushdown already happening at
> > this time?
> That's a good point. Currently, the filter push down is after partition
> pruning
> to prevent the filter push down rule from consuming the partition
> predicates.
> The statistics will be set to unknown if filter is pushed down now.
> To combine them all, we can create an optimization program after filter
> push
> down program to collect the statistics. This could avoid collecting
> statistics multiple times.
>
>
> Best,
> Godfrey
>
> Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
> >
> > Thank Godfrey for driving.
> >
> > Looks very good~ This will undoubtedly greatly enhance the various batch
> > mode connectors.
> >
> > I left some comments:
> >
> > ## FileBasedStatisticsReportableDecodingFormat
> >
> > One concern I have is that we read the footer for each file, and this may
> > be a bit costly in some cases. Is it possible for us to have some
> > hierarchical way, e.g.
> > - No statistics are collected for files by default.
> > - SIZE: Generate statistics based on file Size, get the size of the file
> > only with access to the master of the FileSystem.
> > - DETAILED: Get the complete statistics by format, possibly by accessing
> > the footer of the file.
> >
> > ## When use the statistics reported by connector
> >
> > > When partitions are pruned by PushPartitionIntoTableSourceScanRule, the
> > statistics should also be updated.
> >
> > I understand that we definitely need to use reporter after the partition
> > prune, but another question: is the filter pushdown already happening at
> > this time?
> > Can we make sure that in the following three cases, both the filter
> > pushdown and the partition prune happen before the stats reporting.
> > - only partition prune happens
> > - only filter pushdown happens
> > - both filter pushdown and partition prune happen
> >
> > Best,
> > Jingsong
> >
> > On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I would like to open a discussion on FLIP-231:  Introduce
> > > SupportStatisticReport
> > > to support reporting statistics from source connectors.
> > >
> > > Statistics are one of the most important inputs to the optimizer.
> > > Accurate and complete statistics allows the optimizer to be more
> powerful.
> > > Currently, the statistics of Flink SQL come from Catalog only,
> > > while many Connectors have the ability to provide statistics, e.g.
> > > FileSystem.
> > > In production, we find many tables in Catalog do not have any
> statistics.
> > > As a result, the optimizer can't generate better execution plans,
> > > especially for Batch jobs.
> > >
> > > There are two approaches to enhance statistics for the planner,
> > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > the analyzed result to the catalog, another is to introduce a new
> > > connector interface
> > > which allows the connector itself to report statistics directly to the
> > > planner.
> > > The second one is a supplement to the catalog statistics.
> > >
> > > Here, we will discuss the second approach. Compared to the first one,
> > > the second one is to get statistics in real time, no need to run an
> > > analysis job for each table. This could help improve the user
> > > experience.
> > > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> > >
> > > You can find more details in FLIP-231 document[1]. Looking forward to
> > > your feedback.
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > >
> > >
> > > Best,
> > > Godfrey
> > >
>

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by godfrey he <go...@gmail.com>.
Hi Jingsong,

Thanks for the feedback.


>One concern I have is that we read the footer for each file, and this may
>be a bit costly in some cases. Is it possible for us to have some
> hierarchical way
yes, if there are thousands of orc/parquet files, it may take a long time.
So we can introduce a config option to let the user choose the
granularity of the statistics.
But the SIZE will not be introduced, because the planner does not use
the file size statistics now.
We can introduce once file size statistics is introduce in the future.
I think we should also introduce a config option to enable/disable
SupportStatisticReport,
because it's a heavy operation for some connectors in some cases.

> is the filter pushdown already happening at
> this time?
That's a good point. Currently, the filter push down is after partition pruning
to prevent the filter push down rule from consuming the partition predicates.
The statistics will be set to unknown if filter is pushed down now.
To combine them all, we can create an optimization program after filter push
down program to collect the statistics. This could avoid collecting
statistics multiple times.


Best,
Godfrey

Jingsong Li <ji...@gmail.com> 于2022年5月13日周五 22:44写道:
>
> Thank Godfrey for driving.
>
> Looks very good~ This will undoubtedly greatly enhance the various batch
> mode connectors.
>
> I left some comments:
>
> ## FileBasedStatisticsReportableDecodingFormat
>
> One concern I have is that we read the footer for each file, and this may
> be a bit costly in some cases. Is it possible for us to have some
> hierarchical way, e.g.
> - No statistics are collected for files by default.
> - SIZE: Generate statistics based on file Size, get the size of the file
> only with access to the master of the FileSystem.
> - DETAILED: Get the complete statistics by format, possibly by accessing
> the footer of the file.
>
> ## When use the statistics reported by connector
>
> > When partitions are pruned by PushPartitionIntoTableSourceScanRule, the
> statistics should also be updated.
>
> I understand that we definitely need to use reporter after the partition
> prune, but another question: is the filter pushdown already happening at
> this time?
> Can we make sure that in the following three cases, both the filter
> pushdown and the partition prune happen before the stats reporting.
> - only partition prune happens
> - only filter pushdown happens
> - both filter pushdown and partition prune happen
>
> Best,
> Jingsong
>
> On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com> wrote:
>
> > Hi all,
> >
> > I would like to open a discussion on FLIP-231:  Introduce
> > SupportStatisticReport
> > to support reporting statistics from source connectors.
> >
> > Statistics are one of the most important inputs to the optimizer.
> > Accurate and complete statistics allows the optimizer to be more powerful.
> > Currently, the statistics of Flink SQL come from Catalog only,
> > while many Connectors have the ability to provide statistics, e.g.
> > FileSystem.
> > In production, we find many tables in Catalog do not have any statistics.
> > As a result, the optimizer can't generate better execution plans,
> > especially for Batch jobs.
> >
> > There are two approaches to enhance statistics for the planner,
> > one is to introduce the "ANALYZE TABLE" syntax which will write
> > the analyzed result to the catalog, another is to introduce a new
> > connector interface
> > which allows the connector itself to report statistics directly to the
> > planner.
> > The second one is a supplement to the catalog statistics.
> >
> > Here, we will discuss the second approach. Compared to the first one,
> > the second one is to get statistics in real time, no need to run an
> > analysis job for each table. This could help improve the user
> > experience.
> > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> >
> > You can find more details in FLIP-231 document[1]. Looking forward to
> > your feedback.
> >
> > [1]
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> >
> >
> > Best,
> > Godfrey
> >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Posted by Jingsong Li <ji...@gmail.com>.
Thank Godfrey for driving.

Looks very good~ This will undoubtedly greatly enhance the various batch
mode connectors.

I left some comments:

## FileBasedStatisticsReportableDecodingFormat

One concern I have is that we read the footer for each file, and this may
be a bit costly in some cases. Is it possible for us to have some
hierarchical way, e.g.
- No statistics are collected for files by default.
- SIZE: Generate statistics based on file Size, get the size of the file
only with access to the master of the FileSystem.
- DETAILED: Get the complete statistics by format, possibly by accessing
the footer of the file.

## When use the statistics reported by connector

> When partitions are pruned by PushPartitionIntoTableSourceScanRule, the
statistics should also be updated.

I understand that we definitely need to use reporter after the partition
prune, but another question: is the filter pushdown already happening at
this time?
Can we make sure that in the following three cases, both the filter
pushdown and the partition prune happen before the stats reporting.
- only partition prune happens
- only filter pushdown happens
- both filter pushdown and partition prune happen

Best,
Jingsong

On Fri, May 13, 2022 at 6:57 PM godfrey he <go...@gmail.com> wrote:

> Hi all,
>
> I would like to open a discussion on FLIP-231:  Introduce
> SupportStatisticReport
> to support reporting statistics from source connectors.
>
> Statistics are one of the most important inputs to the optimizer.
> Accurate and complete statistics allows the optimizer to be more powerful.
> Currently, the statistics of Flink SQL come from Catalog only,
> while many Connectors have the ability to provide statistics, e.g.
> FileSystem.
> In production, we find many tables in Catalog do not have any statistics.
> As a result, the optimizer can't generate better execution plans,
> especially for Batch jobs.
>
> There are two approaches to enhance statistics for the planner,
> one is to introduce the "ANALYZE TABLE" syntax which will write
> the analyzed result to the catalog, another is to introduce a new
> connector interface
> which allows the connector itself to report statistics directly to the
> planner.
> The second one is a supplement to the catalog statistics.
>
> Here, we will discuss the second approach. Compared to the first one,
> the second one is to get statistics in real time, no need to run an
> analysis job for each table. This could help improve the user
> experience.
> (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
>
> You can find more details in FLIP-231 document[1]. Looking forward to
> your feedback.
>
> [1]
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&
> [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
>
>
> Best,
> Godfrey
>