You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pengcheng YIN <pc...@gmail.com> on 2014/12/01 09:05:58 UTC

merge elements in a Spark RDD under custom condition

Hi Pro,
I want to merge elements in a Spark RDD when the two elements satisfy certain condition

Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements. The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new RDD.

For example, suppose RDD[Seq[Int]] = [[1,2,3], [2,4,5], [1,2], [7,8,9]], the result should be [[1,2,3,4,5], [7,8,9]].

Since RDD[Seq[Int]] is very large, I cannot do it in driver program. Is it possible to get it done using distributed groupBy/map/reduce, etc?

Thanks in advance,

Pengcheng