You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Dipl.-Inf. Rico Bergmann" <in...@ricobergmann.de> on 2020/03/23 07:45:50 UTC

Spark 2.2.1 Dataframes multiple joins bug?

Hi all!

Is it possible that Spark creates under certain circumstances duplicate 
rows when doing multiple joins?

What I did:

buse.count

res0: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count

res1: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), 
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), 
$"bdef._c9"===$"crnb._c4").count

res2: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), 
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), 
$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"), 
$"crnb._c1"===$"wreg._c5").count

res3: Long = 21633023

For explanation: buse and crnb are 1:1 relationship tables.

In the last join I expected again 20554365 but suddenly duplicate rows 
exist. "wreg._c5" is a unique key, so it should not create more records:

wreg.groupBy($"_c5").agg(count($"_c2") as "cnt").filter($"cnt">1).show
+---+---+
|_c5|cnt|
+---+---+
+---+---+

When doing a distinct on the 4-way join I get the expected number of 
records:

buse.alias("buse").join(bdef.alias("bdef"), 
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), 
$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"), 
$"crnb._c1"===$"wreg._c5").distinct.count
res10: Long = 20554365

This (in my opinion) means, that Spark is creating duplicte rows, 
although it shouldn't. Or do I miss something?


Best, Rico.