You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Greg Hogan <co...@greghogan.com> on 2016/02/02 21:54:57 UTC

Limitations on grouped ReduceFunction

If a user modifies keyed fields of a grouped reduce during a combine then
the reduce will receive incorrect groupings. For example, a useless
modification to word count:

  public WC reduce(WC in1, WC in2) {
    return new WC(in1.word + " " + in2.word, in1.count + in2.count);
  }

I don't see an efficient means to prevent this. Is this limitation worth
documenting, or can we safely assume that no one will ever attempt this?
MapReduce also has this limitation, and Spark gets around this by
separating keys and values and only presenting values to reduce.

"Reduce on Grouped DataSet: A Reduce transformation that is applied on a
grouped DataSet reduces each group to a single element using a user-defined
reduce function. For each group of input elements, a reduce function
successively combines pairs of elements into one element until only a
single element for each group remains."

Greg

Re: Limitations on grouped ReduceFunction

Posted by Stephan Ewen <ew...@gmail.com>.
For now, I think it is worth documenting and leaving it as it is.

A while back we thought about adding a Static Code Analysis rule to find
such cases and create a warning. For Reduce, that is quite straightforward,
for GrouReduce quite tricky...
Am 02.02.2016 21:55 schrieb "Greg Hogan" <co...@greghogan.com>:

> If a user modifies keyed fields of a grouped reduce during a combine then
> the reduce will receive incorrect groupings. For example, a useless
> modification to word count:
>
>   public WC reduce(WC in1, WC in2) {
>     return new WC(in1.word + " " + in2.word, in1.count + in2.count);
>   }
>
> I don't see an efficient means to prevent this. Is this limitation worth
> documenting, or can we safely assume that no one will ever attempt this?
> MapReduce also has this limitation, and Spark gets around this by
> separating keys and values and only presenting values to reduce.
>
> "Reduce on Grouped DataSet: A Reduce transformation that is applied on a
> grouped DataSet reduces each group to a single element using a user-defined
> reduce function. For each group of input elements, a reduce function
> successively combines pairs of elements into one element until only a
> single element for each group remains."
>
> Greg
>