You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by gsvic <vi...@gmail.com> on 2015/11/19 08:02:39 UTC
Hash Partitioning & Sort Merge Join

In case of Sort Merge join in which a shuffle (exchange) will be performed, I
have the following questions (Please correct me if my understanding is not
correct): 

Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the
block size is 64MB. This will produce a Scan JSONRelation() of 10 partitions
( 640 / 64 ) where each of these partitions will contain |A| / 10 rows. 

The second step will be a hashPartitioning(#key, 200) where #key is the
equi-join condition and 200 the default number of shuffles
(spark.sql.shuffle.partitions). Each partition will be computed in an
individual task, in which every row will be hashed on the #key and then will
be written in the corresponing chunk (of 200 resulting chunks) directly on
disk. 

Q1: What happens if a resulting hashed row in the executor A must be written
in a chunk which is stored in the executor B? Does it use the
HashShuffleManager to transfer it over the network? 

Q2: After the Sort (3rd) step there will be 200, 200 resulting
partitions/chunks for relations A and B respectively which will be
concatenated into 200 SortMergeJoin tasks where each of them will contain
(|A|/200 + |B|/200) rows. For each pair (chunkOfA, chunkOfB) will chunkOfA
and chunkOfB contain rows for the same hash key ? 

Q3: In the SortMergeJoin of Q2, I suppose that each of the 200 SortMergeJoin
tasks joins two partitions/chunks with the same hash key. So, if a task
corresponds to a hash key X, does it use ShuffleBlockFetchIterator to fetch
the two Shuffles/Chunks (of relations A and B) with hash key X?

Q4: Which sorting algorithm is being used?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Hash-Partitioning-Sort-Merge-Join-tp15275.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org