You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Shivam Sharma <28...@gmail.com> on 2017/11/23 19:23:21 UTC

How & Where does flink stores data for aggregations.

Hi All,

I have a small question regarding where does Flink stores data for doing
window aggregations. Lets say I am running following query on Flink table:

SELECT name, count(*)
FROM testTable
GROUP BY TUMBLE(rowtime, INTERVAL '1' MINUTE), name

So, If I understand above query properly so it must be saving data for 1
minute somewhere to find aggregations. If Flink is persisting this in
memory then my concern is if I increase interval to a DAY or more then it
will store the complete data for interval which can cross memory. If
persistence is disk then latency will be there.

Basically how do we solve such kind of use-cases using FLINK where
aggregation interval are quite high.

Thanks in advance

-- 
Shivam Sharma

Re: How & Where does flink stores data for aggregations.

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

Flink will have to maintain state of the defined aggregations per each window and key (the more names you have, the bigger the state). Flinkā€™s state backend will be used for that (for example memory or rocksdb).

However in most cases state will be small and not dependent on the length of the window, but only on number of keys. In your case per each key (name) only one counter will be maintained. Same applies to sums and averages (averages will use counter and sum).

There is no magic way to deal with too large state. Either add more RAM to the cluster, fallback to using disks or rewrite your query/application so it will not need that large state.

Piotrek

> On 23 Nov 2017, at 20:23, Shivam Sharma <28...@gmail.com> wrote:
> 
> Hi All,
> 
> I have a small question regarding where does Flink stores data for doing window aggregations. Lets say I am running following query on Flink table:
> 
> SELECT name, count(*)
> FROM testTable
> GROUP BY TUMBLE(rowtime, INTERVAL '1' MINUTE), name
> 
> So, If I understand above query properly so it must be saving data for 1 minute somewhere to find aggregations. If Flink is persisting this in memory then my concern is if I increase interval to a DAY or more then it will store the complete data for interval which can cross memory. If persistence is disk then latency will be there.
> 
> Basically how do we solve such kind of use-cases using FLINK where aggregation interval are quite high.
> 
> Thanks in advance
> 
> -- 
> Shivam Sharma
>