You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Ovidiu-Cristian MARCU <ov...@inria.fr> on 2016/06/28 19:41:44 UTC

Optimizations not performed - please confirm

Hi,

The optimizer internals described in this document [1] are probably not up-to-date.
Can you please confirm if this is still valid:

“The following optimizations are not performed
Join reordering (or operator reordering in general): Joins / Filters / Reducers are not re-ordered in Flink. This is a high opportunity optimization, but with high risk in the absence of good estimates about the data characteristics. Flink is not doing these optimizations at this point.
Index vs. Table Scan selection: In Flink, all data sources are always scanned. The data source (the input format) may apply clever mechanism to not scan all the data, but pre-select and project. Examples are the RCFile / ORCFile / Parquet input formats."
Any update of this page will be very helpful.

Thank you.

Best,
Ovidiu
[1] https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals <https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals>

Re: Optimizations not performed - please confirm

Posted by Fabian Hueske <fh...@gmail.com>.

Yes, that was my fault. I'm used to auto reply-all on my desktop machine,
but my phone just did a simple reply.
Sorry for the confusion,
Fabian



2016-06-29 19:24 GMT+02:00 Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr>:

> Thank you, Aljoscha!
> I received a similar update from Fabian, only now I see the user list was
> not in CC.
>
> Fabian::*The optimizer hasn’t been touched (except for bugfixes and new
> operators) for quite some time.*
> *These limitations are still present and I don’t expect them to be removed
> anytime soon. IMO, it is more likely that certain optimizations like join
> reordering will be done for Table API / SQL queries by the Calcite
> optimizer and pushed through the Flink Dataset optimizer.*
>
> I agree, for join reordering optimisations it makes sense to rely on
> Calcite.
> My goal is to understand how current documentation correlates to the
> Flink’s framework status.
>
> I've did an experimental study where I compared Flink and Spark for many
> workloads at very large scale (I’ll share the results soon) and I would
> like to develop a few ideas on top of Flink (from the results Flink is the
> winner in most of the use cases and it is our choice for the platform on
> which to develop and grow).
>
> My interest is in understanding more about Flink today. I am familiar with
> most of the papers written, I am watching the documentation also.
> I am looking at the DataSet API, runtime and current architecture.
>
> Best,
> Ovidiu
>
> On 29 Jun 2016, at 17:27, Aljoscha Krettek <al...@apache.org> wrote:
>
> Hi,
> I think this document is still up-to-date since not much was done in these
> parts of the code for the 1.0 release and after that.
>
> Maybe Timo can give some insights into what optimizations are done in the
> Table API/SQL that will be be released in an updated version in 1.1.
>
> Cheers,
> Aljoscha
>
> +Timo, Explicitly adding Timo
>
> On Tue, 28 Jun 2016 at 21:41 Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
>> Hi,
>>
>> The optimizer internals described in this document [1] are probably not
>> up-to-date.
>> Can you please confirm if this is still valid:
>>
>> *“The following optimizations are not performed*
>>
>>    - *Join reordering (or operator reordering in general): Joins /
>>    Filters / Reducers are not re-ordered in Flink. This is a high opportunity
>>    optimization, but with high risk in the absence of good estimates about the
>>    data characteristics. Flink is not doing these optimizations at this point.*
>>    - *Index vs. Table Scan selection: In Flink, all data sources are
>>    always scanned. The data source (the input format) may apply clever
>>    mechanism to not scan all the data, but pre-select and project. Examples
>>    are the RCFile / ORCFile / Parquet input formats."*
>>
>> Any update of this page will be very helpful.
>>
>> Thank you.
>>
>> Best,
>> Ovidiu
>> [1] https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals
>>
>
>

Re: Optimizations not performed - please confirm

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Thank you, Aljoscha!
I received a similar update from Fabian, only now I see the user list was not in CC.

Fabian::The optimizer hasn’t been touched (except for bugfixes and new operators) for quite some time.
These limitations are still present and I don’t expect them to be removed anytime soon. IMO, it is more likely that certain optimizations like join reordering will be done for Table API / SQL queries by the Calcite optimizer and pushed through the Flink Dataset optimizer.

I agree, for join reordering optimisations it makes sense to rely on Calcite.
My goal is to understand how current documentation correlates to the Flink’s framework status.

I've did an experimental study where I compared Flink and Spark for many workloads at very large scale (I’ll share the results soon) and I would like to develop a few ideas on top of Flink (from the results Flink is the winner in most of the use cases and it is our choice for the platform on which to develop and grow).

My interest is in understanding more about Flink today. I am familiar with most of the papers written, I am watching the documentation also.
I am looking at the DataSet API, runtime and current architecture.

Best,
Ovidiu

> On 29 Jun 2016, at 17:27, Aljoscha Krettek <al...@apache.org> wrote:
> 
> Hi,
> I think this document is still up-to-date since not much was done in these parts of the code for the 1.0 release and after that.
> 
> Maybe Timo can give some insights into what optimizations are done in the Table API/SQL that will be be released in an updated version in 1.1.
> 
> Cheers,
> Aljoscha
> 
> +Timo, Explicitly adding Timo
> 
> On Tue, 28 Jun 2016 at 21:41 Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>> wrote:
> Hi,
> 
> The optimizer internals described in this document [1] are probably not up-to-date.
> Can you please confirm if this is still valid:
> 
> “The following optimizations are not performed
> Join reordering (or operator reordering in general): Joins / Filters / Reducers are not re-ordered in Flink. This is a high opportunity optimization, but with high risk in the absence of good estimates about the data characteristics. Flink is not doing these optimizations at this point.
> Index vs. Table Scan selection: In Flink, all data sources are always scanned. The data source (the input format) may apply clever mechanism to not scan all the data, but pre-select and project. Examples are the RCFile / ORCFile / Parquet input formats."
> Any update of this page will be very helpful.
> 
> Thank you.
> 
> Best,
> Ovidiu
> [1] https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals <https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals>

Re: Optimizations not performed - please confirm

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
I think this document is still up-to-date since not much was done in these
parts of the code for the 1.0 release and after that.

Maybe Timo can give some insights into what optimizations are done in the
Table API/SQL that will be be released in an updated version in 1.1.

Cheers,
Aljoscha

+Timo, Explicitly adding Timo

On Tue, 28 Jun 2016 at 21:41 Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Hi,
>
> The optimizer internals described in this document [1] are probably not
> up-to-date.
> Can you please confirm if this is still valid:
>
> *“The following optimizations are not performed*
>
>    - *Join reordering (or operator reordering in general): Joins /
>    Filters / Reducers are not re-ordered in Flink. This is a high opportunity
>    optimization, but with high risk in the absence of good estimates about the
>    data characteristics. Flink is not doing these optimizations at this point.*
>    - *Index vs. Table Scan selection: In Flink, all data sources are
>    always scanned. The data source (the input format) may apply clever
>    mechanism to not scan all the data, but pre-select and project. Examples
>    are the RCFile / ORCFile / Parquet input formats."*
>
> Any update of this page will be very helpful.
>
> Thank you.
>
> Best,
> Ovidiu
> [1] https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals
>