You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aaron Jackson <aj...@pobox.com> on 2016/09/25 06:46:45 UTC
Left Join Yields Results And Not Results
Hi,
I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very
odd situation. I have two dataframes, base and updated. The "updated"
dataframe contains constrained subset of data from "base" that I wish to
excluded. Something like this.
updated = base.where(base.X = F.lit(1000))
It's more complicated than that, but you get the idea.
Later, I do a left join.
base.join(updated, 'Core_Column', 'left_outer')
This should return all values in base and null where updated doesn't have
an equality match. And that's almost true, but here's where it gets
strange.
base.join(updated, 'Core_Column', 'left_outer').select(base.FieldId,
updated.FieldId, 'updated.*').show()
|FieldId|FieldId|FieldId|x|y|z
|123|123|null|1|2|3
Now I understand why base.FieldId shows 123, but why does updated.FieldId
show 123 as well, when the expanded join for 'updated.*' shows null. I can
what I want to do by using an RDD, but I was hoping to avoid bypassing
tungsten.
It almost feels like it's optimizing the field based on the join. But I
tested other fields as well and they also came back with values from base.
Very odd.
Any thoughts?
Aaron