You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by memoryzpp <me...@gmail.com> on 2016/11/29 04:42:17 UTC

groupbykey data access size vs Reducer number

Hi all,

How shuffle in Spark 1.6.2 work? I am using groupbykey(int: partitionSize).
groupbykey, a shuffle operation, has mapper side (M mappers) and reducer
side (R reducers).

Here R=partitionSize, and each mapper will produce a local file output and
store in spark.local.dir. Let's assume total shuffle data size is D, then
each reducer will shuffle read in D/R data.

My question is, when changing R(for example, decreasing R), each reducer
will read in more data (size is P = D/R increases as R decreases) per
partition. Since data for each reducer comes from every mapper output, does
that mean on average, each reducer reads in P/M = D/(R*M) data. However,
what I observe is not consistent with the theory model. I use iostat tool to
examine the I/O request size, and found no different in I/O request size
when decreasing R. Does anyone know any details on shuffle? Many thanks!

R = 6000
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28140/iostat_m14_reduceBy2_core6_readSize.png>

R = 3000
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28140/iostat_m14_reduceBy4_core6_readSize.png>

As seen from two figures comparing two iostat plot results, the average IO
request sizes of two reducer number are the same, 250 sectors ( 250 * 512
B/sector = 128 KB).

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupbykey-data-access-size-vs-Reducer-number-tp28140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org