You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by qihuagao <qi...@icloud.com> on 2017/07/19 12:50:21 UTC

about aggregateByKey of pairrdd.

java pair rdd has aggregateByKey, which can avoid full shuffle, so have
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be
collected
# A combining function accepting two paremeters. The second paramter is
merged into the first parameter. This function combines/merges values within
a partition.
# A merging function function accepting two parameters. In this case the
parameters are merged into one. This step merges values across partitions.

While Dataframe, I noticed groupByKey, which could do save function as
aggregateByKey, but without merge functions, so I assumed it should trigger
shuffle operation. Is this true? if true should we have a funtion like the
performance like  aggregateByKey for dataframe?

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-aggregateByKey-of-pairrdd-tp28878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org