You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Jark Wu (JIRA)" <ji...@apache.org> on 2019/02/18 08:31:00 UTC

[jira] [Commented] (FLINK-7001) Improve performance of Sliding Time Window with pane optimization

    [ https://issues.apache.org/jira/browse/FLINK-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770861#comment-16770861 ] 

Jark Wu commented on FLINK-7001:
--------------------------------

Hi [~walterddr], [~pgrulich], I'm sorry to join the discussion so late. 

Thanks [~walterddr] for summarizing the discussion and writing a great design doc! Improving windowing performance is definitely a great work for users. I'm very willing to collaborate with you on this issue. Here is some thoughts from my side:

We also have some similar attempts in Blink. We implemented a new window operator for Table/SQL which supports pane optimization and retraction. You can find the operator here: https://github.com/apache/flink/blob/blink/flink-libraries/flink-table/src/main/java/org/apache/flink/table/runtime/window/WindowOperator.java

From our experiences, the pane optimization gains some performance improvements but not obvious. Because the original aggregating sliding window only stores the accumulator in state, not the raw records, so there is not a lot of duplicate data here. And in some situations, such as Count Distinct, the performance gets worse, because of the non-efficient merge method. But I still believe that the pane-based/slice-based optimization is worth to explore.

Btw, maybe we should move the discussion to the mailing thread [1] to encourage more people join in the discussion.

I have also went through the design doc Rong proposed, and will leave some comments there. 

Best,
Jark 

[1]: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Improvement-to-Flink-Window-Operator-with-Slicing-td25750.html

> Improve performance of Sliding Time Window with pane optimization
> -----------------------------------------------------------------
>
>                 Key: FLINK-7001
>                 URL: https://issues.apache.org/jira/browse/FLINK-7001
>             Project: Flink
>          Issue Type: Improvement
>          Components: DataStream API
>            Reporter: Jark Wu
>            Assignee: Jark Wu
>            Priority: Major
>
> Currently, the implementation of time-based sliding windows treats each window individually and replicates records to each window. For a window of 10 minute size that slides by 1 second the data is replicated 600 fold (10 minutes / 1 second). We can optimize sliding window by divide windows into panes (aligned with slide), so that we can avoid record duplication and leverage the checkpoint.
> I will attach a more detail design doc to the issue.
> The following issues are similar to this issue: FLINK-5387, FLINK-6990



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)