You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Taranov (Jira)" <ji...@apache.org> on 2022/06/14 11:59:00 UTC
[jira] [Created] (SPARK-39467) Count on distinct asterisk not equals to the count with column names provided
Michael Taranov created SPARK-39467:
---------------------------------------
Summary: Count on distinct asterisk not equals to the count with column names provided
Key: SPARK-39467
URL: https://issues.apache.org/jira/browse/SPARK-39467
Project: Spark
Issue Type: Question
Components: Spark Core
Affects Versions: 3.1.3
Environment: Spark 3.1.3 vanilla
Reporter: Michael Taranov
Hi everyone,
We came across a case when count distinct with asterisk produce incorrect result.
Example provide below:
{noformat}
scala> val df = Seq(
| (1655172,1463032,"PHON","US",null,1),
| (1655172,1061329,"DESK","AU",null,3),
| (1655172,1334977,"MOBILE","US",null,23),
| (1655172,1165470,"PHON","CR",null,12),
| (1655172,1021215,"PHON","CA","USD",11)).toDF
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields]
scala> df.printSchema
root
|-- _1: integer (nullable = false)
|-- _2: integer (nullable = false)
|-- _3: string (nullable = true)
|-- _4: string (nullable = true)
|-- _5: string (nullable = true)
|-- _6: integer (nullable = false)
scala> df.createOrReplaceTempView("a_table")
scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, _3, _4, _5, _6)) from a_table").show(false)
+--------+--------------------------------------+----------------------------------------------------------------------------+
|count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))|
+--------+--------------------------------------+----------------------------------------------------------------------------+
|5 |1 |5 |
+--------+--------------------------------------+----------------------------------------------------------------------------+
{noformat}
We understand that this is somehow related to null values but in our understanding asterisk should mimic same behavior as all columns provided.
If there is some documentation about this It would be nice to read.
Any help would be appreciated.
Michael
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org