You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ravindra <ra...@gmail.com> on 2017/03/17 08:30:58 UTC

Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

Can someone please explain why

println ( " Empty count " +
hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

*prints* -  Empty count 1

This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and
found this. This causes my tests to fail. Is there another way to check
full equality of 2 dataframes.

Thanks,
Ravindra.

Re: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

Posted by Ravindra <ra...@gmail.com>.
Thanks a lot young for explanation. But its sounds like an API behaviour
change. For now I do the counts != o on both dataframes before these
operations. Not good from performance point of view hence have created a
JIRA (SPARK-20008) to track it.

Thanks,
Ravindra.

On Fri, Mar 17, 2017 at 8:51 PM Yong Zhang <ja...@hotmail.com> wrote:

> Starting from Spark 2, these kind of operation are implemented in left
> anti join, instead of using RDD operation directly.
>
>
> Same issue also on sqlContext.
>
>
> scala> spark.version
> res25: String = 2.0.2
>
>
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
>
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>    +- *HashAggregate(keys=[], functions=[], output=[])
>       +- BroadcastNestedLoopJoin BuildRight, *LeftAnti*, false
>          :- Scan ExistingRDD[]
>          +- BroadcastExchange IdentityBroadcastMode
>             +- Scan ExistingRDD[]
>
> This arguably means a bug. But my guess is liking the logic of comparing
> NULL = NULL, should it return true or false, causing this kind of
> confusion.
>
> Yong
>
> ------------------------------
> *From:* Ravindra <ra...@gmail.com>
> *Sent:* Friday, March 17, 2017 4:30 AM
> *To:* user@spark.apache.org
> *Subject:* Spark 2.0.2 -
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
>
> Can someone please explain why
>
> println ( " Empty count " +
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
>
> *prints* -  Empty count 1
>
> This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and
> found this. This causes my tests to fail. Is there another way to check
> full equality of 2 dataframes.
>
> Thanks,
> Ravindra.
>

Re: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

Posted by Yong Zhang <ja...@hotmail.com>.
Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly.


Same issue also on sqlContext.


scala> spark.version
res25: String = 2.0.2


spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)

== Physical Plan ==
*HashAggregate(keys=[], functions=[], output=[])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[], output=[])
      +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
         :- Scan ExistingRDD[]
         +- BroadcastExchange IdentityBroadcastMode
            +- Scan ExistingRDD[]


This arguably means a bug. But my guess is liking the logic of comparing NULL = NULL, should it return true or false, causing this kind of confusion.

Yong

________________________________
From: Ravindra <ra...@gmail.com>
Sent: Friday, March 17, 2017 4:30 AM
To: user@spark.apache.org
Subject: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

Can someone please explain why

println ( " Empty count " + hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

prints -  Empty count 1

This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and found this. This causes my tests to fail. Is there another way to check full equality of 2 dataframes.

Thanks,
Ravindra.