You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jelmer <jk...@gmail.com> on 2019/11/29 09:50:38 UTC

Any way to make catalyst optimise away join

I have 2 dataframes , lets call them A and B,

A is made up out of [unique_id, field1]
B is made up out of [unique_id, field2]

The have the exact same number of rows, and every id in A is also present
in B

if I execute a join like this A.join(B,
Seq("unique_id")).select($"unique_id", $"field1") then spark will do an
expensive join even though it does not have to because all the fields it
needs are in A. is there some trick I can use so that catalyst will
optimise this join away ?

Re: Any way to make catalyst optimise away join

Posted by Jerry Vinokurov <gr...@gmail.com>.

This seems like a suboptimal situation for a join. How can Spark know in
advance that all the fields are present and the tables have the same number
of rows? I suppose you could just sort the two frames by id and concatenate
them, but I'm not sure what join optimization is available here.

On Fri, Nov 29, 2019, 4:51 AM jelmer <jk...@gmail.com> wrote:

> I have 2 dataframes , lets call them A and B,
>
> A is made up out of [unique_id, field1]
> B is made up out of [unique_id, field2]
>
> The have the exact same number of rows, and every id in A is also present
> in B
>
> if I execute a join like this A.join(B,
> Seq("unique_id")).select($"unique_id", $"field1") then spark will do an
> expensive join even though it does not have to because all the fields it
> needs are in A. is there some trick I can use so that catalyst will
> optimise this join away ?
>