You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Andra Lungu <lu...@gmail.com> on 2015/07/22 10:56:41 UTC

Theoretical complexity of a coGroup

Hi everyone,

I am not 100% sure about this one, so I thought that I could set my
thoughts straight via the mailing list.

Here's the use case. You coGroup a data set of vertices with a data set of
edges. That gives you a complexity of* O(|V| * |E|)*, where |V| is the
total number of vertices and |E| the total number of edges. You however do
a vertex value update in the coGroup function. Say you intersect some hash
sets and assign that value to the vertices. Does that give you an extra
O(|V|) [making the total O(|V|^2*|E|)] or is that covered by the initial
O(|V|*|E|)? I guess this has to do with the way a coGroup is internally
built...

Thanks a lot!
Andra

Re: Theoretical complexity of a coGroup

Posted by Kostas Tzoumas <kt...@apache.org>.
Hi Andra,

CoGroup is not implemented as a Cartesian product, so O(V*E) is not a very
accurate approximation.

All this depends on what you count. Let's assume single-node execution and
that everything fits in memory, and let's count comparisons and UDFs on
groups.

Then, coGroup sorts both inputs, so the complexity of sorting is
O(VlogV+ElogE) comparisons. The merge phase is O(V+E), so sorting
dominates. During the merge, for each group (say we have G groups) you want
to do some operation. If this is an operation that only depends on the
group size, and we assume a constant group size that is very small compared
to V and E (if this is not true then the grouping does not really make
sense) you end up with O(VlogV+ElogE+V+E+G). How many groups do you have?
Assuming that you coGroup on say, the "from" vertex of the edge, worst case
is V groups, so your worst case complexity is
O(VlogV+ElogE+V+E+V)=O(VlogV+ElogE).

In a parallel scenario, you might want to count the number of records
transferred over the network to measure complexity.

An of course, all of this is, in general, not a good predictor of
performance.

Hopefully, I didn't screw this up, it's been a while I thought about
complexity :-D

Kostas


On Wed, Jul 22, 2015 at 10:56 AM, Andra Lungu <lu...@gmail.com> wrote:

> Hi everyone,
>
> I am not 100% sure about this one, so I thought that I could set my
> thoughts straight via the mailing list.
>
> Here's the use case. You coGroup a data set of vertices with a data set of
> edges. That gives you a complexity of* O(|V| * |E|)*, where |V| is the
> total number of vertices and |E| the total number of edges. You however do
> a vertex value update in the coGroup function. Say you intersect some hash
> sets and assign that value to the vertices. Does that give you an extra
> O(|V|) [making the total O(|V|^2*|E|)] or is that covered by the initial
> O(|V|*|E|)? I guess this has to do with the way a coGroup is internally
> built...
>
> Thanks a lot!
> Andra
>