You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marco Gaido (JIRA)" <ji...@apache.org> on 2018/12/18 13:09:00 UTC

[jira] [Commented] (SPARK-26336) left_anti join with Na Values

    [ https://issues.apache.org/jira/browse/SPARK-26336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724052#comment-16724052 ] 

Marco Gaido commented on SPARK-26336:
-------------------------------------

That's correct because NULLs do not match. The usual implementation of ANTIJOIN in other DBs (eg. Postgres) is to do a left join and filter for the column on the right side being NULL. If you do so in your example 1 row is returned.

> left_anti join with Na Values
> -----------------------------
>
>                 Key: SPARK-26336
>                 URL: https://issues.apache.org/jira/browse/SPARK-26336
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Carlos
>            Priority: Major
>
> When I'm joining two dataframes with data that haves NA values, the left_anti join don't work as well, cause don't detect registers with NA values.
> Example:  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import *
> spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
> data = [(1,"Test"),(2,"Test"),(3,None)]
> df1 = spark.createDataFrame(data,("id","columndata"))
> df2 = spark.createDataFrame(data,("id","columndata"))
> df_joined = df1.join(df2, df1.columns,'left_anti'){code}
> df_joined have data, when two dataframe are the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org