You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Gabriele Di Bernardo <ga...@me.com> on 2017/07/23 11:28:00 UTC

Find the running median from a data stream

Hi guys,

I want to keep track of the running median of a keyed data stream. I was considering to apply a RichMapFunction to the stream and store in a ValueState object two heaps (PriorityQueue) in order to find the running median. However, I am not really sure if this is the best approach performance-wise. Do you have some suggestions for me or have you ever done something similar?

Thank you so much!

Best,


Gabriele

Re: Find the running median from a data stream

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Gabriele,

I don't think you can compute the exact running median on a stream.
This would require to collect all elements of the stream so you would
basically need to put the complete stream into the ValueState.
Even if the state is backed by RocksDB, the state for a specific key needs
to fit on the heap when it is accessed. Moreover, the ValueState would be
completely serialized and deserialized for every access which is of course
expensive if the whole stream is materialized.

So, computing an exact running median requires a lot of storage and is
expensive to compute.
You might be able to implement an approximate median using adaptive
histograms.

Best, Fabian

2017-07-23 13:28 GMT+02:00 Gabriele Di Bernardo <ga...@me.com>
:

> Hi guys,
>
> I want to keep track of the running median of a keyed data stream. I was
> considering to apply a RichMapFunction to the stream and store in a
> ValueState object two heaps (PriorityQueue) in order to find the running
> median. However, I am not really sure if this is the best approach
> performance-wise. Do you have some suggestions for me or have you ever done
> something similar?
>
> Thank you so much!
>
> Best,
>
>
> Gabriele