You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Richard Moorhead <ri...@gmail.com> on 2020/02/13 19:00:35 UTC

group by optimizations with sorted input

In batch mode, if input is sorted prior to a group by operation; does flink
forward the aggregate data early? Is there a way to prevent grouping
operations from buffering all data in a GBK operation in batch mode?

Re: group by optimizations with sorted input

Posted by Robert Metzger <rm...@apache.org>.

I assume you are using the DataSet API.
There, you can do a combinable group reduce:
https://ci.apache.org/projects/flink/flink-docs-master/dev/batch/dataset_transformations.html#combinable-groupreducefunctions

The combine() method will be executed on the sender side, reducing the
amount of data to spill on disk. This only works if your data allows such
early aggregations.
This is similar to a combiner in Hadoop:
https://www.quora.com/What-is-a-Combiner-in-Hadoop

On Thu, Feb 13, 2020 at 8:01 PM Richard Moorhead <ri...@gmail.com>
wrote:

> In batch mode, if input is sorted prior to a group by operation; does
> flink forward the aggregate data early? Is there a way to prevent grouping
> operations from buffering all data in a GBK operation in batch mode?
>