You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Igor Berman (Jira)" <ji...@apache.org> on 2022/06/19 06:32:00 UTC
[jira] [Comment Edited] (SPARK-39467) Count on distinct asterisk not equals to the count with column names provided

    [ https://issues.apache.org/jira/browse/SPARK-39467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556033#comment-17556033 ] 

Igor Berman edited comment on SPARK-39467 at 6/19/22 6:31 AM:
--------------------------------------------------------------

[~hyukjin.kwon] 
Hi, please take a look at a difference between count(distinct(*)) and count(distinct(_1, _2, _3, _4, _5, _6))
I don't believe it's connected to what you posted

from spark sql [documentation|https://spark.apache.org/docs/3.3.0/sql-ref-functions-builtin.html#aggregate-functions] it's not well defined how (if at all) count(distinct(*)) should behave
{code}
count(DISTINCT expr[, expr...])	Returns the number of rows for which the supplied expression(s) are unique and non-null.
{code}

just for the reference:
mssql, vertica, mysql just prohibit this usage
imo, it should be prohibited in sparksql as well


was (Author: igor.berman):
[~hyukjin.kwon] 
Hi, please take a look at a difference between count(distinct(*)) and count(distinct(_1, _2, _3, _4, _5, _6))
I don't believe it's connected to what you posted

from spark sql [documentation|https://spark.apache.org/docs/3.3.0/sql-ref-functions-builtin.html#aggregate-functions] it's not well defined how (if at all) count(distinct(*)) should behave
{code}
count(DISTINCT expr[, expr...])	Returns the number of rows for which the supplied expression(s) are unique and non-null.
{code}

just for the reference:
myssql, vertica, mysql just prohibit this usage
imo, it should be prohibited in sparksql as well

> Count on distinct asterisk not equals to the count with column names provided
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-39467
>                 URL: https://issues.apache.org/jira/browse/SPARK-39467
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core, SQL
>    Affects Versions: 3.1.3
>         Environment: Spark 3.1.3 vanilla 
>            Reporter: Michael Taranov
>            Priority: Minor
>
> Hi everyone,
> We came across a case when count distinct with asterisk produce incorrect result comparing to count distinct when all columns provided.
> Example provide below:
> {noformat}
> scala> val df = Seq(
>      |     (1655172,1463032,"PHON","US",null,1),
>      |     (1655172,1061329,"DESK","AU",null,3),
>      |     (1655172,1334977,"MOBILE","US",null,23),
>      |     (1655172,1165470,"PHON","CR",null,12),
>      |     (1655172,1021215,"PHON","CA","USD",11)).toDF
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields]
> scala> df.printSchema
> root
>  |-- _1: integer (nullable = false)
>  |-- _2: integer (nullable = false)
>  |-- _3: string (nullable = true)
>  |-- _4: string (nullable = true)
>  |-- _5: string (nullable = true)
>  |-- _6: integer (nullable = false)
> scala> df.createOrReplaceTempView("a_table")
> scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, _3, _4, _5, _6)) from a_table").show(false)
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))|
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> |5       |1                                     |5                                                                           |
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> {noformat}
> We understand that this is somehow related to null values but in our understanding asterisk should mimic same behavior as all columns provided.
> If there is any documentation about this It would be nice to read.
> Any help would be appreciated. 
> Michael



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org