You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by julyfire <he...@gmail.com> on 2014/09/08 02:23:09 UTC

Spark groupByKey partition out of memory

When a MappedRDD is handled by groupByKey transformation, tuples distributed
in different worker nodes with the same key will be collected into one
worker nodes, say, 
(K, V1), (K, V2), ..., (K, Vn) -> (K, Seq(V1, V2, ..., Vn)). 

I want to know whether the value /Seq(V1, V2, ..., Vn)/ of a tuple in the
grouped RDD can reside in different nodes or have to be in one node, if I
set the number of partitions when using groupByKey. If  the value /Seq(V1,
V2, ..., Vn)/ can only reside in the memory of just one machine, out of
memory risk exists in case the size of the /Seq(V1, V2, ..., Vn)/ is larger
than the JVM memory limit of this machine. if this case happens, how should
we deal with?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-groupByKey-partition-out-of-memory-tp13669.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org