You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tomasz Kus (Jira)" <ji...@apache.org> on 2021/11/10 12:23:00 UTC
[jira] [Updated] (SPARK-37270) Incorect result of filter using
isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tomasz Kus updated SPARK-37270:
-------------------------------
Component/s: SQL
> Incorect result of filter using isNull condition
> ------------------------------------------------
>
> Key: SPARK-37270
> URL: https://issues.apache.org/jira/browse/SPARK-37270
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 3.2.0
> Reporter: Tomasz Kus
> Priority: Major
>
> Simple code that allows to reproduce this issue:
> {code:java}
> val frame = Seq((false, 1)).toDF("bool", "number")
> frame
> .checkpoint()
> .withColumn("conditions", when(col("bool"), "I am not null"))
> .filter(col("conditions").isNull)
> .show(false){code}
> Although "conditions" column is null
> {code:java}
> +-----+------+----------+
> |bool |number|conditions|
> +-----+------+----------+
> |false|1 |null |
> +-----+------+----------+{code}
> empty result is shown.
> Execution plans:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252]
> +- LogicalRDD [bool#124, number#125], false
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#252)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252]
> +- LogicalRDD [bool#124, number#125], false
> == Optimized Logical Plan ==
> LocalRelation <empty>, [bool#124, number#125, conditions#252]
> == Physical Plan ==
> LocalTableScan <empty>, [bool#124, number#125, conditions#252]
> {code}
> After removing checkpoint proper result is returned and execution plans are as follow:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256]
> +- Project [_1#119 AS bool#124, _2#120 AS number#125]
> +- LocalRelation [_1#119, _2#120]
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#256)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256]
> +- Project [_1#119 AS bool#124, _2#120 AS number#125]
> +- LocalRelation [_1#119, _2#120]
> == Optimized Logical Plan ==
> LocalRelation [bool#124, number#125, conditions#256]
> == Physical Plan ==
> LocalTableScan [bool#124, number#125, conditions#256]
> {code}
> It seems that the most important difference is LogicalRDD -> LocalRelation
> There are following ways (workarounds) to retrieve correct result:
> 1) remove checkpoint
> 2) add explicit .otherwise(null) to when
> 3) add checkpoint() or cache() just before filter
> 4) downgrade to Spark 3.1.2
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org