You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Elias Levy (JIRA)" <ji...@apache.org> on 2017/04/01 21:23:41 UTC

[jira] [Created] (FLINK-6243) Continuous Joins: True Sliding Window Joins

Elias Levy created FLINK-6243:
---------------------------------

             Summary: Continuous Joins:  True Sliding Window Joins
                 Key: FLINK-6243
                 URL: https://issues.apache.org/jira/browse/FLINK-6243
             Project: Flink
          Issue Type: New Feature
          Components: Streaming
    Affects Versions: 1.1.4
            Reporter: Elias Levy


Flink defines sliding window joins as the join of elements of two streams that share a window of time, where the windows are defined by advancing them forward some amount of time that is less than the window time span.  More generally, such windows are just overlapping hopping windows. 

Other systems, such as Kafka Streams, support a different notion of sliding window joins.  In these systems, two elements of a stream are joined if the absolute time difference between the them is less or equal the time window length.

This alternate notion of sliding window joins has some advantages in some applications over the current implementation.  

Elements to be joined may both fall within multiple overlapping sliding windows, leading them to be joined multiple times, when we only wish them to be joined once.

The implementation need not instantiate window objects to keep track of stream elements, which becomes problematic in the current implementation if the window size is very large and the slide is very small.

It allows for asymmetric time joins.  E.g. join if elements from stream A are no more than X time behind and Y time head of an element from stream B.

It is currently possible to implement a join with these semantics using {{CoProcessFunction}}, but the capability should be a first class feature, such as it is in Kafka Streams.

To perform the join, elements of each stream must be buffered for at least the window time length.  To allow for large window sizes and high volume of elements, the state, possibly optionally, should be buffered such as it can spill to disk (e.g. by using RocksDB).

The same stream may be joined multiple times in a complex topology.  As an optimization, it may be wise to reuse any element buffer among colocated join operators.  Otherwise, there may write amplification and increased state that must be snapshotted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)