You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by wilson <wi...@4shield.net> on 2022/05/01 03:01:08 UTC

spark null values calculation

my dataset has NULL included in the columns.
do you know why the select results below have not consistent behavior?

scala> dfs.select("cand_status").count()
val res37: Long = 881793 


scala> dfs.select("cand_status").where($"cand_status" =!= "NULL").count()
val res38: Long = 383717 


scala> dfs.select("cand_status").where($"cand_status" === "NULL").count()
val res39: Long = 86402 


scala> dfs.select("cand_status").where($"cand_status" === 
"NULL").where($"cand_status" =!= "NULL").count()
val res40: Long = 0


as you see: 383717 + 86402  != 881793
for which i expect them to be equal.

Thanks.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: spark null values calculation

Posted by wilson <wi...@4shield.net>.

sorry i have found what's the reasons. for null I can not compare it 
directly. I have wrote a note for this.
https://bigcount.xyz/how-spark-handles-null-and-abnormal-values.html

Thanks.

wilson wrote:
> do you know why the select results below have not consistent behavior?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org