You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vijay Kumar <vi...@gmail.com> on 2020/05/22 08:00:22 UTC

[apache-spark]-spark-shuffle

Hi,

I am trying to thoroughly understand below concepts in spark.
1. A job is reading 2 files and performing a cartesian join.
2. Sizes of input are 55.7 mb and 67.1  mb
3. after reading input file, spark did shuffle, for both the inputs
shuffle was in KB. I want to understand why this size is not a complete
size of a file. Per my understanding, records which are required to be
shuffled from one executor to another will only be shuffled. It is not
required to shuffle whole file. Is this understanding correct?

4. what is shuffle spill(Memory) and shuffle spill(disk) , do these
represent figures for same data one on memory and other on disk. And how to
calculate these values ?
5. when does shuffle need to spill. It needs to spill when data does not
fit in memory but when are the situations or scenarios when this can happen.
6. On SQL tab  for above join situation , there are 2 exchanges
data size totals are
 190. MB(37.3MB,51.0MB,51.0MB)
  228.9 MB(46.9MB,60.6MB,60.6MB) how these figures are calculated
and
on Sort  below are the figures
524.2MB (Peak Memory Total)(min,med,max)(64KB,64KB,128MB)
576.0 MB (Peak memory Total) (min, med,max)(144MB,144MB,144MB)

I am trying to understand many things if you can help me with some kind of
guide or link or book where i will be able to get answers to above question
along with other more questions it will be great.

Re: [apache-spark]-spark-shuffle

Posted by "vijay.bvp" <bv...@gmail.com>.
How a Spark job reads datasources depends on the underlying source system,the
job configuration about number of executors and cores per executor. 
https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets

About Shuffle operations. 
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
https://stackoverflow.com/questions/32210011/spark-difference-between-shuffle-write-shuffle-spill-memory-shuffle-spill
https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs/29012187#29012187

this has great explanation of how shuffle works
https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark

========
A sample of code and job configuration, the DAG underlying source (HDFS or
others) would help explain

thanks
VP



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org