You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by nathan grand <na...@gmail.com> on 2020/03/20 11:12:09 UTC

[Spark SQL]: Stability of large many-to-many joins

Hi,

I have two very large datasets, which both have many repeated keys, which I
wish to join.

A simplified example:

dsA

A_1 |A_2
1 |A
2 |A
3 |A
4 |A
5 |A
1 |B
2 |B
3 |B
1 |C

dsB

B_1 |B_2
A |B
A |C
A |D
A |E
A |F
A |G
B |A
B |E
B |G
B |H
C |A
C |B


The join I want to do is:

dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")

However, this ends putting a lot of pressure on tasks containing frequently
occurring keys - it's either very, very slow to complete or I encounter
memory issues.

I've played with grouping both sides by the join key prior to joining
(which would make the join one-to-one) but memory seems to become an issue
again as the groups are very large.

Does anyone have any good suggestions as to how to make large many-to-many
joins reliably complete in Spark??

Reliability for me is much more important than speed - this is for a tool
so I can't over-tune to specific data sizes/shapes.

Thanks,

Nathan

Re: [Spark SQL]: Stability of large many-to-many joins

Posted by nathan grand <na...@gmail.com>.

Being many-to-many on two similar sized large datasets, does salting really
help?

On Fri, Mar 20, 2020 at 3:26 PM Peyman Mohajerian <mo...@gmail.com>
wrote:

> Two options, either add salting to your join or filter records that are
> frequent, join them separately and the union back, it's the skew join issue.
>
> On Fri, Mar 20, 2020 at 4:12 AM nathan grand <na...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have two very large datasets, which both have many repeated keys, which
>> I
>> wish to join.
>>
>> A simplified example:
>>
>> dsA
>>
>> A_1 |A_2
>> 1 |A
>> 2 |A
>> 3 |A
>> 4 |A
>> 5 |A
>> 1 |B
>> 2 |B
>> 3 |B
>> 1 |C
>>
>> dsB
>>
>> B_1 |B_2
>> A |B
>> A |C
>> A |D
>> A |E
>> A |F
>> A |G
>> B |A
>> B |E
>> B |G
>> B |H
>> C |A
>> C |B
>>
>>
>> The join I want to do is:
>>
>> dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")
>>
>> However, this ends putting a lot of pressure on tasks containing
>> frequently
>> occurring keys - it's either very, very slow to complete or I encounter
>> memory issues.
>>
>> I've played with grouping both sides by the join key prior to joining
>> (which would make the join one-to-one) but memory seems to become an issue
>> again as the groups are very large.
>>
>> Does anyone have any good suggestions as to how to make large many-to-many
>> joins reliably complete in Spark??
>>
>> Reliability for me is much more important than speed - this is for a tool
>> so I can't over-tune to specific data sizes/shapes.
>>
>> Thanks,
>>
>> Nathan
>>
>

Re: [Spark SQL]: Stability of large many-to-many joins

Posted by Peyman Mohajerian <mo...@gmail.com>.

Two options, either add salting to your join or filter records that are
frequent, join them separately and the union back, it's the skew join issue.

On Fri, Mar 20, 2020 at 4:12 AM nathan grand <na...@gmail.com>
wrote:

> Hi,
>
> I have two very large datasets, which both have many repeated keys, which I
> wish to join.
>
> A simplified example:
>
> dsA
>
> A_1 |A_2
> 1 |A
> 2 |A
> 3 |A
> 4 |A
> 5 |A
> 1 |B
> 2 |B
> 3 |B
> 1 |C
>
> dsB
>
> B_1 |B_2
> A |B
> A |C
> A |D
> A |E
> A |F
> A |G
> B |A
> B |E
> B |G
> B |H
> C |A
> C |B
>
>
> The join I want to do is:
>
> dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")
>
> However, this ends putting a lot of pressure on tasks containing frequently
> occurring keys - it's either very, very slow to complete or I encounter
> memory issues.
>
> I've played with grouping both sides by the join key prior to joining
> (which would make the join one-to-one) but memory seems to become an issue
> again as the groups are very large.
>
> Does anyone have any good suggestions as to how to make large many-to-many
> joins reliably complete in Spark??
>
> Reliability for me is much more important than speed - this is for a tool
> so I can't over-tune to specific data sizes/shapes.
>
> Thanks,
>
> Nathan
>