You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Soila Pertet Kavulya <sk...@gmail.com> on 2015/03/13 02:37:38 UTC

Support for skewed joins in Spark

Does Spark support skewed joins similar to Pig which distributes large
keys over multiple partitions? I tried using the RangePartitioner but
I am still experiencing failures because some keys are too large to
fit in a single partition. I cannot use broadcast variables to
work-around this because both RDDs are too large to fit in driver
memory.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Support for skewed joins in Spark

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.
Hello Soila,
Can you share the code that shows usuag of RangePartitioner ?
I am facing issue with .join() where one task runs forever. I tried
repartition(100/200/300/1200) and it did not help, I cannot use map-side
join because both datasets are huge and beyond driver memory size.
Regards,
Deepak

On Fri, Mar 13, 2015 at 9:54 AM, Soila Pertet Kavulya <sk...@gmail.com>
wrote:

> Thanks Shixiong,
>
> I'll try out your PR. Do you know what the status of the PR is? Are
> there any plans to incorporate this change to the
> DataFrames/SchemaRDDs in Spark 1.3?
>
> Soila
>
> On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu <zs...@gmail.com> wrote:
> > I sent a PR to add skewed join last year:
> > https://github.com/apache/spark/pull/3505
> > However, it does not split a key to multiple partitions. Instead, if a
> key
> > has too many values that can not be fit in to memory, it will store the
> > values into the disk temporarily and use disk files to do the join.
> >
> > Best Regards,
> >
> > Shixiong Zhu
> >
> > 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya <sk...@gmail.com>:
> >>
> >> Does Spark support skewed joins similar to Pig which distributes large
> >> keys over multiple partitions? I tried using the RangePartitioner but
> >> I am still experiencing failures because some keys are too large to
> >> fit in a single partition. I cannot use broadcast variables to
> >> work-around this because both RDDs are too large to fit in driver
> >> memory.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Deepak

Re: Support for skewed joins in Spark

Posted by Soila Pertet Kavulya <sk...@gmail.com>.
Thanks Shixiong,

I'll try out your PR. Do you know what the status of the PR is? Are
there any plans to incorporate this change to the
DataFrames/SchemaRDDs in Spark 1.3?

Soila

On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu <zs...@gmail.com> wrote:
> I sent a PR to add skewed join last year:
> https://github.com/apache/spark/pull/3505
> However, it does not split a key to multiple partitions. Instead, if a key
> has too many values that can not be fit in to memory, it will store the
> values into the disk temporarily and use disk files to do the join.
>
> Best Regards,
>
> Shixiong Zhu
>
> 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya <sk...@gmail.com>:
>>
>> Does Spark support skewed joins similar to Pig which distributes large
>> keys over multiple partitions? I tried using the RangePartitioner but
>> I am still experiencing failures because some keys are too large to
>> fit in a single partition. I cannot use broadcast variables to
>> work-around this because both RDDs are too large to fit in driver
>> memory.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Support for skewed joins in Spark

Posted by Shixiong Zhu <zs...@gmail.com>.
I sent a PR to add skewed join last year:
https://github.com/apache/spark/pull/3505
However, it does not split a key to multiple partitions. Instead, if a key
has too many values that can not be fit in to memory, it will store the
values into the disk temporarily and use disk files to do the join.

Best Regards,
Shixiong Zhu

2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya <sk...@gmail.com>:

> Does Spark support skewed joins similar to Pig which distributes large
> keys over multiple partitions? I tried using the RangePartitioner but
> I am still experiencing failures because some keys are too large to
> fit in a single partition. I cannot use broadcast variables to
> work-around this because both RDDs are too large to fit in driver
> memory.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>