You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "anthonyjschulte@gmail.com" <an...@gmail.com> on 2014/09/05 22:09:44 UTC

Repartition inefficient

I wonder if anyone has any tips for using repartition?

It seems that when you call the repartition method, the entire RDD gets
split up, shuffled, and redistributed... This is an extremely heavy task if
you have a large hdfs dataset and all you want to do is make sure your RDD
is balance/ data skew is minimal...

I have tried coalesce(shuffle=false), but this seems to be somewhat
ineffective at balancing the blocks.

Care to share your experiences?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-inefficient-tp13587.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org