You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Newport, Billy" <Bi...@gs.com> on 2017/02/07 20:52:56 UTC

Cogroup hints/performance

We have a cogroup where sometimes we cogroup like this:

Dataset z = larger.coGroup(small).where...

The strategy is printed as hash on key and a sort asc on the other key. Which is which? Naively, we'd want to hash larger and sort the small? Or is that wrong?

What factors would impact the performance of the cogroup? We use cogroup to calculate a new set of records for a key from the previous calculated set with some modifications from (small). We're temporally milestoning records using cogroup btw, that's the use case.


Thanks



Billy Newport
Data Architecture, Goldman, Sachs & Co.
30 Hudson | 37th Floor | Jersey City, NJ
Tel:  +1 (212) 8557773 |  Cell:  +1 (507) 254-0134
Email: billy.newport@gs.com<ma...@gs.com>, KD2DKQ

Re: Cogroup hints/performance

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Billy,

A CoGroup does not have any freedom in its execution strategy.
It requires that both inputs are partitioned on the grouping keys and are
then performs a local sort-merge join, i.e, both inputs are sorted.
Existing partitioning or sort orders can be reused.

Since there is only one execution strategy, there is not much you can do to
optimize it without changing the operator.
If you can convert the CoGroup into an (Inner)Join or OuterJoin you have
more degrees of freedom to optimize. In this case, Flink might be able to
use a Broadcast/Forward shipping strategy (ideally keeping the large input
local and broadcasting the small one) and using a HashJoin (small input
being the build side, large input being the probe side).
Whether you can use a join depends on your application semantics. Maybe
grouping all the smaller input and collecting all records for a key into a
single record can help to switch from CoGroup to Join.

Hope this helps,
Fabian

2017-02-07 21:52 GMT+01:00 Newport, Billy <Bi...@gs.com>:

> We have a cogroup where sometimes we cogroup like this:
>
>
>
> Dataset z = larger.coGroup(small).where…
>
>
>
> The strategy is printed as hash on key and a sort asc on the other key.
> Which is which? Naively, we’d want to hash larger and sort the small? Or is
> that wrong?
>
>
>
> What factors would impact the performance of the cogroup? We use cogroup
> to calculate a new set of records for a key from the previous calculated
> set with some modifications from (small). We’re temporally milestoning
> records using cogroup btw, that’s the use case.
>
>
>
>
>
> Thanks
>
>
>
>
>
>
>
> *Billy Newport*
>
> Data Architecture, Goldman, Sachs & Co.
> 30 Hudson | 37th Floor | Jersey City, NJ
>
> Tel:  +1 (212) 8557773 <(212)%20855-7773> |  Cell:  +1 (507) 254-0134
> <(507)%20254-0134>
> Email: billy.newport@gs.com <ed...@gs.com>, KD2DKQ
>
>
>