You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Batselem <se...@gmail.com> on 2016/05/04 08:56:56 UTC

substitute mapPartitions by distinct

Hi, I am trying to remove duplicates from a set of RDD tuples in an iterative
algorithm. I have discovered that it is possible to substitute RDD
mapPartitions for RDD distinct. 
First I partitioned the RDD and distinct it locally using mapPartitions
transformation. I expect it will be much faster when it comes to iterative
algorithm I checked two results and they were equal.. But I am not sure it
works correctly. my concern is that it does not guarantee that it will
remove all duplicates accurately. because hash codes can collide in some
times. and if duplicates are in different
partitions, the following code doesn't work. so all the same duplicates
should be in the same partition. Any suggestion will be appreciated. 

Code: 
PairRDD = inputPairs.partitionBy(new HashPartitioner(slices)) 

val distinctCount = PairRDD.distinct().count() 

val mapPartitionCount = PairRDD.mapPartitions(iterator => { 
      iterator.toList.distinct.toIterator 
}, true).count() 

println("distinct : " + distinctCount) 

println("mapPartitionCount : " + mapPartitionCount)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/substitute-mapPartitions-by-distinct-tp26876.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org