You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by ba <ma...@baranalytics.co.uk> on 2020/05/20 13:58:30 UTC

Stream Iterative Matching

Hi All,

I'm new to Flink but am trying to write an application that processes data
from internet connected sensors.

My problem is as follows:

-Data arrives in the format: [sensor id] [timestamp in seconds] [sensor
value]
-Data can arrive out of order (between sensor IDs) by upto 5 minutes.
-So a stream of data could be:
[1] [100] [20]
[2] [101] [23]
[1] [105] [31]
[1] [140] [17]

-Each sensor can sometimes split its measurements, and I'm hoping to 'put
them back together' within Flink. For data to be 'put back together' it must
have a timestamp within 90 seconds of the timestamp on the first piece of
data. The data must also be put back together in order, in the example above
for sensor 1 you could only have combinations of (A) the first reading on
its own (time 100), (B) the first and third item (time 100 and 105) or (C)
the first, third and fourth item (time 100, 105, 140). The second item is a
different sensor so not considered in this exercise.

-I would like to write something that tries different 'sum' combinations
within the 90 second limit and outputs the best 'match' to expected values.
In the example above lets say the expected sum values are 50 or 100. Of the
three combinations I mentioned for sensor 1, the sum would be 20, 51, or 68.
Therefore the 'best' combination is 51 as it is closest to 50 or 100, so it
would output two data items: [1] [100] [20] and [1] [105] [31], with the
last item left in the stream and matched with any other data points that
arrive after.

I am thinking some sort of iterative function that does this, but am not
sure how to emit the values I want and keep other values that were
considered (but not used) in the stream.

Any ideas or help is really appreciated?

Thanks,
Marc



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Stream Iterative Matching

Posted by Guowei Ma <gu...@gmail.com>.

Hi, Marc
1. I think you should choose which type of window you want to use first.
(Thumbling/Sliding/Session) From your description, I think the session
window maybe not suit your case because there is no gap.
2. >>> how this would work in practise or how to handle the case where
timers fire for data that has already been ejected from the window (as it
has been matched with past data)?
     Do you want to know the lifecycle of the element in the window? I
think you could know that the lifecycle of the window and element in it
after you choose your window type. For example, the element could be
assigned to multiple slide windows and an element ejected from a sliding
window could be processed from another sliding window.[1]
3. I think you could find some examples in the `WindowTranslationTest`.
4. If these window types do not work for your application. I think you
might need a customized window(trigger/evictor).  However, I think you
could make a simple POC with the current type window first.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html#sliding-windows
Best,
Guowei


ba <ma...@baranalytics.co.uk> 于2020年5月21日周四 下午4:00写道：

> Hi Guowei,
>
> Thank you for your reply. Are you able to give some detail on how that
> would
> work with the per window state you linked? I'm struggling to see how the
> logic would work.
>
> I guess something like a session window on a keyed stream (keyed by sensor
> ID). Timers would fire 90 seconds after each element is added to the window
> and then be evaluated? I can't quite think how this would work in practise
> or how to handle the case where timers fire for data that has already been
> ejected from the window (as it has been matched with past data)?
>
> If there are any examples showing similar uses of this function that would
> be great?
>
> Any assistance is very appreciated!
>
> Best,
> Marc
>
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Stream Iterative Matching

Posted by ba <ma...@baranalytics.co.uk>.

Hi Guowei,

Thank you for your reply. Are you able to give some detail on how that would
work with the per window state you linked? I'm struggling to see how the
logic would work.

I guess something like a session window on a keyed stream (keyed by sensor
ID). Timers would fire 90 seconds after each element is added to the window
and then be evaluated? I can't quite think how this would work in practise
or how to handle the case where timers fire for data that has already been
ejected from the window (as it has been matched with past data)?

If there are any examples showing similar uses of this function that would
be great?

Any assistance is very appreciated!

Best,
Marc





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Stream Iterative Matching

Posted by Guowei Ma <gu...@gmail.com>.

Hi, Marc

I think the window operator might fulfill your needs. You could find the
detailed description here[1]
In general, I think you could choose the correct type of window and use the
`ProcessWindowFunction` to emit the elements that match the best sum.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html
<https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html#using-per-window-state-in-processwindowfunction>
Best,
Guowei


ba <ma...@baranalytics.co.uk> 于2020年5月20日周三 下午9:58写道：

> Hi All,
>
> I'm new to Flink but am trying to write an application that processes data
> from internet connected sensors.
>
> My problem is as follows:
>
> -Data arrives in the format: [sensor id] [timestamp in seconds] [sensor
> value]
> -Data can arrive out of order (between sensor IDs) by upto 5 minutes.
> -So a stream of data could be:
> [1] [100] [20]
> [2] [101] [23]
> [1] [105] [31]
> [1] [140] [17]
>
> -Each sensor can sometimes split its measurements, and I'm hoping to 'put
> them back together' within Flink. For data to be 'put back together' it
> must
> have a timestamp within 90 seconds of the timestamp on the first piece of
> data. The data must also be put back together in order, in the example
> above
> for sensor 1 you could only have combinations of (A) the first reading on
> its own (time 100), (B) the first and third item (time 100 and 105) or (C)
> the first, third and fourth item (time 100, 105, 140). The second item is a
> different sensor so not considered in this exercise.
>
> -I would like to write something that tries different 'sum' combinations
> within the 90 second limit and outputs the best 'match' to expected values.
> In the example above lets say the expected sum values are 50 or 100. Of the
> three combinations I mentioned for sensor 1, the sum would be 20, 51, or
> 68.
> Therefore the 'best' combination is 51 as it is closest to 50 or 100, so it
> would output two data items: [1] [100] [20] and [1] [105] [31], with the
> last item left in the stream and matched with any other data points that
> arrive after.
>
> I am thinking some sort of iterative function that does this, but am not
> sure how to emit the values I want and keep other values that were
> considered (but not used) in the stream.
>
> Any ideas or help is really appreciated?
>
> Thanks,
> Marc
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>