You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pengcheng YIN <pc...@gmail.com> on 2014/12/01 09:05:58 UTC
merge elements in a Spark RDD under custom condition
Hi Pro,
I want to merge elements in a Spark RDD when the two elements satisfy certain condition
Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements. The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new RDD.
For example, suppose RDD[Seq[Int]] = [[1,2,3], [2,4,5], [1,2], [7,8,9]], the result should be [[1,2,3,4,5], [7,8,9]].
Since RDD[Seq[Int]] is very large, I cannot do it in driver program. Is it possible to get it done using distributed groupBy/map/reduce, etc?
Thanks in advance,
Pengcheng