You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Muthu Jayakumar <ba...@gmail.com> on 2017/04/30 07:07:51 UTC

Spark repartition question...

Hello there,

I am trying to understand the difference between the following
reparition()...
a. def repartition(partitionExprs: Column*): Dataset[T]
b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
c. def repartition(numPartitions: Int): Dataset[T]

My understanding is that (c) is a simpler hash based partitioner where the
number of records are equally partitioned into numPartitions.
(a) is more like (c) except that the nuumPartitions depends on distinct
column values from the expression. right?
(b) Similar to (a) but what does numPartitions mean here?

On a side note, from the source code, it seems like (a) & (b) uses
RepartitionByExpression  . And my guess is that (a) would default the
numPartitions to 200 (which is the default shuffle partition size)

Reason for my question...
say df.reparition(50, col("cat_col"))
and the distinct `cat_col` for the df is about 20 values. The effective
partitions would still be 50? And if it's 50 would the 20 distinct values
would most likely get their own bucket of partition, but some of the values
can repeat into the remainder of the 30 bucket... Is this loosely correct?

The reason for my question is to attempt to fit a large amount of data in
memory that would not fit thru all the workers in the cluster. But if I
repartition the data in some logical manner, then I would be able to fit
the data in the heap to perform some useful joins and write the result back
into parquet (or other useful) datastore

Please advice,
Muthu