You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Patrick Brunmayr <ja...@kpibench.com> on 2017/02/23 17:16:36 UTC

Difference between partition and groupBy

What is the basic difference between partitioning datasets by key or
grouping them by key ?

Does it make a difference in terms of paralellism ?

Thx

Re: Difference between partition and groupBy

Posted by Patrick Brunmayr <ja...@kpibench.com>.

Thank you for that answer. Helped me a lot

2017-02-23 22:10 GMT+01:00 Fabian Hueske <fh...@gmail.com>:

> Hi Patrick,
>
> as Robert said, partitionBy() shuffles the data such that all records with
> the same key end up in the same partition. That's all it does.
> groupBy() also prepares the data in each partition to be processed per
> key. For example, if you run a groupReduce after a groupBy(), the data is
> first shuffled (just like partitionBy()) and then in each partition sorted
> to organize it by key. So groupBy() does more than partitionBy() because it
> organizes the data in each partition to be processed by key.
>
> Moreover, groupBy() alone is not a complete operation but just "prepares"
> a following operation. It must be called with a reduce or combine operator.
> In contrast partitionBy() is by itself complete.
> So the difference between partitionBy() and groupBy() is more than just an
> API thing.
>
> Hope that helps,
> Fabian
>
> 2017-02-23 21:51 GMT+01:00 Robert Metzger <rm...@apache.org>:
>
>> Hi Patrick,
>>
>> I think (but I'm not 100% sure) its not a difference in what the engine
>> does in the end, its more of an API thing. When you are grouping, you can
>> perform operations such as reducing afterwards.
>> On a partitioned dataset, you can do stuff like processing each partition
>> in parallel, or sort them.
>>
>> The parallelism is independent of the partitioning or grouping. Usually
>> there are more partitions than parallel instances, so each instance will
>> take care of multiple partitions.
>>
>>
>>
>> On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr <ja...@kpibench.com>
>> wrote:
>>
>>> What is the basic difference between partitioning datasets by key or
>>> grouping them by key ?
>>>
>>> Does it make a difference in terms of paralellism ?
>>>
>>> Thx
>>>
>>
>>
>

Re: Difference between partition and groupBy

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Patrick,

as Robert said, partitionBy() shuffles the data such that all records with
the same key end up in the same partition. That's all it does.
groupBy() also prepares the data in each partition to be processed per key.
For example, if you run a groupReduce after a groupBy(), the data is first
shuffled (just like partitionBy()) and then in each partition sorted to
organize it by key. So groupBy() does more than partitionBy() because it
organizes the data in each partition to be processed by key.

Moreover, groupBy() alone is not a complete operation but just "prepares" a
following operation. It must be called with a reduce or combine operator.
In contrast partitionBy() is by itself complete.
So the difference between partitionBy() and groupBy() is more than just an
API thing.

Hope that helps,
Fabian

2017-02-23 21:51 GMT+01:00 Robert Metzger <rm...@apache.org>:

> Hi Patrick,
>
> I think (but I'm not 100% sure) its not a difference in what the engine
> does in the end, its more of an API thing. When you are grouping, you can
> perform operations such as reducing afterwards.
> On a partitioned dataset, you can do stuff like processing each partition
> in parallel, or sort them.
>
> The parallelism is independent of the partitioning or grouping. Usually
> there are more partitions than parallel instances, so each instance will
> take care of multiple partitions.
>
>
>
> On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr <ja...@kpibench.com>
> wrote:
>
>> What is the basic difference between partitioning datasets by key or
>> grouping them by key ?
>>
>> Does it make a difference in terms of paralellism ?
>>
>> Thx
>>
>
>

Re: Difference between partition and groupBy

Posted by Robert Metzger <rm...@apache.org>.

Hi Patrick,

I think (but I'm not 100% sure) its not a difference in what the engine
does in the end, its more of an API thing. When you are grouping, you can
perform operations such as reducing afterwards.
On a partitioned dataset, you can do stuff like processing each partition
in parallel, or sort them.

The parallelism is independent of the partitioning or grouping. Usually
there are more partitions than parallel instances, so each instance will
take care of multiple partitions.

On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr <ja...@kpibench.com> wrote:

> What is the basic difference between partitioning datasets by key or
> grouping them by key ?
>
> Does it make a difference in terms of paralellism ?
>
> Thx
>