You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marco Gaido (JIRA)" <ji...@apache.org> on 2017/10/19 15:51:00 UTC

[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

    [ https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211239#comment-16211239 ] 

Marco Gaido commented on SPARK-22307:
-------------------------------------

Have you checked if the missing records contain null as a value for `col1`? If so, there is no bug and this is an expected behavior according to SQL standards, since operations involving nulls are evaluated to null and null is considered false in conditions. Thus nulls are filtered in both cases correctly.

> NOT condition working incorrectly
> ---------------------------------
>
>                 Key: SPARK-22307
>                 URL: https://issues.apache.org/jira/browse/SPARK-22307
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.1.1
>            Reporter: Andrey Yakovenko
>         Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y records (< x), not(expr) does not returns x-y records. Work around: when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  |    |-- Id: string (nullable = true)
>  |    |-- Name: string (nullable = true)
>  |    |-- Parent: struct (nullable = true)
>  |    |    |-- Id: string (nullable = true)
>  |    |    |-- Name: string (nullable = true)
>  |    |    |-- Parent: struct (nullable = true)
>  |    |    |    |-- Id: string (nullable = true)
>  |    |    |    |-- Name: string (nullable = true)
>  |    |    |    |-- Parent: string (nullable = true)
>  |    |    |    |-- SKU: string (nullable = true)
>  |    |    |-- SKU: string (nullable = true)
>  |    |-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("((((Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = ((((Id IN (13MXIIAA4, 13MXIBAA4)) OR (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of levels filled up. I have a suspicion that due to partly filled hierarchy condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org