You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by wxhsdp <wx...@gmail.com> on 2014/06/30 04:45:27 UTC

questions about shuffle time and parallel degree

Hi, all
  i have two questions about shuffle time and parallel degree.

  question 1:
  we assume that cluster size is fixed, for example a cluster of 16 nodes,
each node has 2 cores in EC2
  case 1: a total shuffle of 64GB data between 32 partitions
  case 2: a total shuffle of 128GB data between 128 partitions
  data is evenly distributed across the cluster, but we found the shuffle
time of case 2 is 5x-6x longer
  than case 1.
  i guess one reason is the OS cache buffer. There is more data in case 2,
so cache buffer is exhausted. 
  We have to directly write to and read from the disk. can anyone give
reasons from the perspective of
  network transfer?

  question 2:
  we all know that high parallel degree means low computation time. but
what's the impact on shuffle
  time?
  this time assume data size is fixed, for example 64GB, and cluster is
large enough.
  case 1: evenly distributed among 32 partitions, then do a total shuffle
  case 2: evenly distributed among 64 partitions, then do a total shuffle
  will shuffle time be halved in case 2?

  appreciate your help:)
  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-shuffle-time-and-parallel-degree-tp8512.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.