You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by shahid ashraf <sh...@trialx.com> on 2015/10/19 13:16:38 UTC

SHUFFLE in PARTITIONBY or shuffle in general

Hi Folks

i am not able to understand the shuffle in paritionby
I am doing partitionby(hashparitioning on int) to repartition the data as
of data skew. see screen shot below, after doing the
partitionby(repartitioning ) why is the shuffle so high* 50 GB for only 3GB
data and why is shuffle read so high? after that for collect() to a task
which is action of geting total counts of records in each partition the
shuffle read is 50GB. Also as counts before partitioning took only 19s see
stage 5 vs stage 10 in screenshot.*

what is shuffle read and shuffle write in partiton by task.


-- 
with Regards
Shahid Ashraf

Fwd: SHUFFLE in PARTITIONBY or shuffle in general

Posted by shahid ashraf <sh...@trialx.com>.

Hi Folks

i am not able to understand the shuffle in paritionby
I am doing partitionby(hashparitioning on int) to repartition the data as
of data skew. see screen shot below, after doing the
partitionby(repartitioning ) why is the shuffle so high* 50 GB for only 3GB
data and why is shuffle read so high? after that for collect() to a task
which is action of geting total counts of records in each partition the
shuffle read is 50GB. Also as counts before partitioning took only 19s see
stage 5 vs stage 10 in screenshot.*

what is shuffle read and shuffle write in partiton by task.



-- 
with Regards
Shahid Ashraf