You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/03/18 13:48:41 UTC

[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

    [ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931221#comment-15931221 ] 

Hyukjin Kwon commented on SPARK-20008:
--------------------------------------

I just took a quick look. {{BroadcastNestedLoopJoin}} looks fine with empty rows but {{HashAggregate}} produces an iterator with single empty row when {{groupingExpressions}} is empty at here - https://github.com/apache/spark/blob/dd9049e0492cc70b629518fee9b3d1632374c612/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L104-L125


> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-20008
>                 URL: https://issues.apache.org/jira/browse/SPARK-20008
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2
>            Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage point of view and hence I consider this as a bug. May be a boundary case, not sure.
> Work around - For now I check the counts != 0 before this operation. Not good for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>    +- *HashAggregate(keys=[], functions=[], output=[])
>       +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>          :- Scan ExistingRDD[]
>          +- BroadcastExchange IdentityBroadcastMode
>             +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org