You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jelgh <jo...@gmail.com> on 2014/11/18 13:59:23 UTC

ReduceByKey but with different functions depending on key

Hello everyone,

I'm new to Spark and I have the following problem:

I have this large JavaRDD<MyClass> collection, which I group with by
creating a hashcode from some fields in MyClass:

JavaRDD<MyClass> collection = ...;
JavaPairRDD<Integer, Iterable&lt;MyClass>> grouped =
collection.groupBy(...); // the group-function is just creating a hashcode
from some fields in MyClass.

Now I want to reduce the variable grouped. However, I want to reduce it with
different functions depending on the key in the JavaPairRDD. So basically a
reduceByKey but with multiple functions.

Only solution I've come up with is by filtering grouped for each reduce
function and apply it on the filtered  subsets. This feels kinda hackish
though. 

Is there a better way? 

Best regards,
Johannes



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: ReduceByKey but with different functions depending on key

Posted by Debasish Das <de...@gmail.com>.

groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...

reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in reduceByKey if you have access to the key in
reduceByKey/aggregateByKey...

I did not have the need to access the key in reduceByKey/aggregateByKey yet
but there should be a way...

On Tue, Nov 18, 2014 at 7:24 AM, Yanbo <ya...@gmail.com> wrote:

> First use groupByKey(), you get a tuple RDD with
> (key:K,value:ArrayBuffer[V]).
> Then use map() on this RDD with a function has different operations
> depending on the key which act as a parameter of this function.
>
>
> > 在 2014年11月18日，下午8:59，jelgh <jo...@gmail.com> 写道：
> >
> > Hello everyone,
> >
> > I'm new to Spark and I have the following problem:
> >
> > I have this large JavaRDD<MyClass> collection, which I group with by
> > creating a hashcode from some fields in MyClass:
> >
> > JavaRDD<MyClass> collection = ...;
> > JavaPairRDD<Integer, Iterable&lt;MyClass>> grouped =
> > collection.groupBy(...); // the group-function is just creating a
> hashcode
> > from some fields in MyClass.
> >
> > Now I want to reduce the variable grouped. However, I want to reduce it
> with
> > different functions depending on the key in the JavaPairRDD. So
> basically a
> > reduceByKey but with multiple functions.
> >
> > Only solution I've come up with is by filtering grouped for each reduce
> > function and apply it on the filtered  subsets. This feels kinda hackish
> > though.
> >
> > Is there a better way?
> >
> > Best regards,
> > Johannes
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: ReduceByKey but with different functions depending on key

Posted by Yanbo <ya...@gmail.com>.

First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]).
Then use map() on this RDD with a function has different operations depending on the key which act as a parameter of this function.


> 在 2014年11月18日，下午8:59，jelgh <jo...@gmail.com> 写道：
> 
> Hello everyone,
> 
> I'm new to Spark and I have the following problem:
> 
> I have this large JavaRDD<MyClass> collection, which I group with by
> creating a hashcode from some fields in MyClass:
> 
> JavaRDD<MyClass> collection = ...;
> JavaPairRDD<Integer, Iterable&lt;MyClass>> grouped =
> collection.groupBy(...); // the group-function is just creating a hashcode
> from some fields in MyClass.
> 
> Now I want to reduce the variable grouped. However, I want to reduce it with
> different functions depending on the key in the JavaPairRDD. So basically a
> reduceByKey but with multiple functions.
> 
> Only solution I've come up with is by filtering grouped for each reduce
> function and apply it on the filtered  subsets. This feels kinda hackish
> though. 
> 
> Is there a better way? 
> 
> Best regards,
> Johannes
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: ReduceByKey but with different functions depending on key

Posted by lordjoe <lo...@gmail.com>.

Map the key value into a key,Tuple2<key,value> and process that -
Also ask the Spark maintainers for a version of keyed operations where the
key is passed in as an argument - I run into these cases all the time

    /**
     * map a tuple int a key tuple pair to insure subsequent processing has
access to both Key and value
     * @param inp input pair RDD
     * @param <K>   key type
     * @param <V>   value type
     * @return   output where value has both key and value
     */
       @Nonnull
       public static <K extends Serializable, V extends Serializable>
JavaPairRDD<K,Tuple2&lt;K,  V>> toKeyedTuples(@Nonnull JavaPairRDD< K, V>
inp) {
             return inp.flatMapToPair(new PairFlatMapFunction<Tuple2&lt;K,
V>, K, Tuple2<K, V>>() {
               @Override
               public Iterable<Tuple2&lt;K, Tuple2&lt;K, V>>> call(final
Tuple2<K, V> t) throws Exception {
                   return   new Tuple2<K, Tuple2&lt;K, V>>>(t._1(),new
Tuple2<K,V>(t._1(),t._2());
               }
           });
       }



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177p19198.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org