You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shahban Riaz <sr...@seek.com.au> on 2022/09/16 04:02:33 UTC

[Spark Core] Joining Same DataFrame Multiple Times Results in Column not getting dropped

Hi,

We have some PySpark code that joins a table table_a, twice to another table table_b using the following code. After joining the table, we drop the key_hash column from the output DataFrame.
This code was working fine in spark version 3.0.1. Since upgrading to spark version 3.2.2, the behaviour has changed and during the first transform operation the key_hash field gets dropped from the output DataFrame but when the 2nd transform operation gets executed then the key_hash field still stays in the output_df. Can someone please guide what has changed in Spark behaviour that is causing this issue?
def tr_join_sac_user(self, df_a):

def inner(df_b):
    return (
        df_b.join(df_a, on=df_b["sac_key_hash"] == df_a["key_hash"], how="left")
        .drop(df_a.key_hash)
        .drop(df_b.sac_key_hash)
    )

return inner

def tr_join_sec_user(self, df_a):

    def inner(df_b):
        return (
            df_b.join(df_a, on=df_b["sec_key_hash"] == df_a["key_hash"], how="left")
            .drop(df_a.key_hash)
            .drop(df_b.sec_key_hash)
        )

    return inner

table_a_df = spark.read.format("delta").load("/path/to/table_a")
table_b_df = spark.read.format("delta").load("/path/to/table_b")

output_df = table_b_df.transform(tr_join_sac_user(table_a_df))
output_df = output_df.transform(tr_join_sec_user(table_a_df))


If we use .drop('key_hash') instead of .drop(df_a.key_hash) that seems to work and the column does get dropped in 2nd transform as well. I would like to understand what has changed in Spark behaviour between these versions (or if it’s a bug) as this might have an impact in other places in our codebase as well.

Regards,
Shahban Riaz