You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aaron Jackson <aj...@pobox.com> on 2016/09/25 06:46:45 UTC

Left Join Yields Results And Not Results

Hi,

I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very
odd situation.  I have two dataframes, base and updated.  The "updated"
dataframe contains constrained subset of data from "base" that I wish to
excluded.  Something like this.

updated = base.where(base.X = F.lit(1000))

It's more complicated than that, but you get the idea.

Later, I do a left join.

base.join(updated, 'Core_Column', 'left_outer')

This should return all values in base and null where updated doesn't have
an equality match.  And that's almost true, but here's where it gets
strange.

base.join(updated, 'Core_Column', 'left_outer').select(base.FieldId,
updated.FieldId, 'updated.*').show()

|FieldId|FieldId|FieldId|x|y|z
|123|123|null|1|2|3

Now I understand why base.FieldId shows 123, but why does updated.FieldId
show 123 as well, when the expanded join for 'updated.*' shows null.  I can
what I want to do by using an RDD, but I was hoping to avoid bypassing
tungsten.

It almost feels like it's optimizing the field based on the join.  But I
tested other fields as well and they also came back with values from base.
Very odd.

Any thoughts?

Aaron