You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Michael Taranov (Jira)" <ji...@apache.org> on 2022/06/14 11:59:00 UTC

[jira] [Created] (SPARK-39467) Count on distinct asterisk not equals to the count with column names provided

Michael Taranov created SPARK-39467:
---------------------------------------

             Summary: Count on distinct asterisk not equals to the count with column names provided
                 Key: SPARK-39467
                 URL: https://issues.apache.org/jira/browse/SPARK-39467
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.1.3
         Environment: Spark 3.1.3 vanilla 
            Reporter: Michael Taranov


Hi everyone,

We came across a case when count distinct with asterisk produce incorrect result. 

Example provide below:
{noformat}
scala> val df = Seq(
     |     (1655172,1463032,"PHON","US",null,1),
     |     (1655172,1061329,"DESK","AU",null,3),
     |     (1655172,1334977,"MOBILE","US",null,23),
     |     (1655172,1165470,"PHON","CR",null,12),
     |     (1655172,1021215,"PHON","CA","USD",11)).toDF
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields]

scala> df.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: integer (nullable = false)
 |-- _3: string (nullable = true)
 |-- _4: string (nullable = true)
 |-- _5: string (nullable = true)
 |-- _6: integer (nullable = false)

scala> df.createOrReplaceTempView("a_table")

scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, _3, _4, _5, _6)) from a_table").show(false)
+--------+--------------------------------------+----------------------------------------------------------------------------+
|count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))|
+--------+--------------------------------------+----------------------------------------------------------------------------+
|5       |1                                     |5                                                                           |
+--------+--------------------------------------+----------------------------------------------------------------------------+

{noformat}
We understand that this is somehow related to null values but in our understanding asterisk should mimic same behavior as all columns provided.

If there is some documentation about this It would be nice to read.

Any help would be appreciated. 

Michael



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org