You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mkal <di...@hotmail.com> on 2019/07/03 21:34:26 UTC

Attempting to avoid a shuffle on join

Please keep in mind i'm fairly new to spark.
I have some spark code where i load two textfiles as datasets and after some
map and filter operations to bring the columns in a specific shape, i join
the datasets.

The join takes place on a common column (of type string).
Is there any way to avoid the exchange/shuffle before the join?

As i understand it, the idea is that if i, initially, hash partition the
datasets based on the join column, then the join would only have to look
within the same partitions to complete the join, thus avoiding a shuffle.

In the rdd API, you can create a hash partitioner and use partitionBy when
creating the RDDS.(Though im not sure if this a sure way to avoid the
shuffle on the join.) Is there any similar method for Dataframe/Dataset API?

I also would like to avoid repartition,repartitionByRange and bucketing
techniques since i only intend to do one join and these also require
shuffling beforehand.







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Attempting to avoid a shuffle on join

Posted by Chris Teoh <ch...@gmail.com>.

Dataframes have a partitionBy function too.

You can avoid a shuffle if one of your datasets is small enough to
broadcast.

On Thu., 4 Jul. 2019, 7:34 am Mkal, <di...@hotmail.com> wrote:

> Please keep in mind i'm fairly new to spark.
> I have some spark code where i load two textfiles as datasets and after
> some
> map and filter operations to bring the columns in a specific shape, i join
> the datasets.
>
> The join takes place on a common column (of type string).
> Is there any way to avoid the exchange/shuffle before the join?
>
> As i understand it, the idea is that if i, initially, hash partition the
> datasets based on the join column, then the join would only have to look
> within the same partitions to complete the join, thus avoiding a shuffle.
>
> In the rdd API, you can create a hash partitioner and use partitionBy when
> creating the RDDS.(Though im not sure if this a sure way to avoid the
> shuffle on the join.) Is there any similar method for Dataframe/Dataset
> API?
>
> I also would like to avoid repartition,repartitionByRange and bucketing
> techniques since i only intend to do one join and these also require
> shuffling beforehand.
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>