You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nikolay Sokolov (Jira)" <ji...@apache.org> on 2021/05/24 17:22:00 UTC
[jira] [Updated] (SPARK-35504) count distinct asterisk

     [ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nikolay Sokolov updated SPARK-35504:
------------------------------------
    Description: 
Hi everyone,

I hope you're well!

 

Today I came across a very interesting case when the result of the execution of the algorithm for counting unique rows differs depending on the form (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.

I still can't figure out on my own if this is a bug or a feature and I would like to share what I found.

 

I run Spark SQL queries through the Thrift (and not only) connecting to the Spark cluster. I use the DBeaver app to execute Spark SQL queries.

 

So, I have two identical Spark SQL queries from an algorithmic point of view that return different results.

 

The first query:
{code:sql}
select count(distinct *) unique_amt from storage_datamart.olympiads
; -- Rows: 13437678
{code}
 

The second query:
{code:sql}
select count(*) from (select distinct * from storage_datamart.olympiads)
; -- Rows: 36901430
{code}
 

The result of the two queries is different. (But it must be the same, right!?)
{code:sql}
select 'The first query' description, count(distinct *) unique_amt from storage_datamart.olympiads
 union all
select 'The second query', count(*) from (select distinct * from storage_datamart.olympiads)
;
{code}
 

The result of the above query is the following:
{code:java}
The first query    13437678
The second query   36901430
{code}
 
 I can easily calculate the unique number of rows in the table:
{code:sql}
select count(*) from (
  select student_id, olympiad_id, tour, grade
    from storage_datamart.olympiads
   group by student_id, olympiad_id, tour, grade
  having count(*) = 1
)
; -- Rows: 36901365
{code}
 

The table DDL is the following:
{code:sql}
CREATE TABLE `storage_datamart`.`olympiads` (
  `ptn_date` DATE,
  `student_id` BIGINT,
  `olympiad_id` STRING,
  `grade` BIGINT,
  `grade_type` STRING,
  `tour` STRING,
  `created_at` TIMESTAMP,
  `created_at_local` TIMESTAMP,
  `olympiad_num` BIGINT,
  `olympiad_name` STRING,
  `subject` STRING,
  `started_at` TIMESTAMP,
  `ended_at` TIMESTAMP,
  `region_id` BIGINT,
  `region_name` STRING,
  `municipality_name` STRING,
  `school_id` BIGINT,
  `school_name` STRING,
  `school_status` BOOLEAN,
  `oly_n_common` INT,
  `num_day` INT,
  `award_type` STRING,
  `new_student_legacy` INT,
  `segment` STRING,
  `total_start` TIMESTAMP,
  `total_end` TIMESTAMP,
  `year_learn` STRING,
  `parent_id` BIGINT,
  `teacher_id` BIGINT,
  `parallel` BIGINT,
  `olympiad_type` STRING)
USING parquet
LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet'
;
{code}
 

Could you please tell me why in the first Spark SQL query counting the unique number of rows using the construction `count(distinct *)` does not count correctly and why the result of the two Spark SQL queries is different??

Thanks in advance.

 

p.s. I could not find a description of such behaviour of the function `count(distinct *)` in the [official Spark documentation|https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions]:
{quote}count(DISTINCT expr[, expr...]) -> Returns the number of rows for which the supplied expression(s) are unique and non-null.
{quote}
 

  was:
Hi everyone,

I hope you're well!

 

Today I came across a very interesting case when the result of the execution of the algorithm for counting unique rows differs depending on the form (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.

I still can't figure out on my own if this is a bug or a feature and I would like to share what I found.

 

I run Spark SQL queries through the Thrift (and not only) connecting to the Spark cluster. I use the DBeaver app to execute Spark SQL queries.

 

So, I have two identical Spark SQL queries from an algorithmic point of view that return different results.

 

The first query:
{code:sql}
select count(distinct *) unique_amt from storage_datamart.olympiads
; -- Rows: 13437678
{code}
 

The second query:
{code:sql}
select count(*) from (select distinct * from storage_datamart.olympiads)
; -- Rows: 36901430
{code}
 

The result of the two queries is different. (But it must be the same, right!?)
{code:sql}
select 'The first query' description, count(distinct *) unique_amt from storage_datamart.olympiads
 union all
select 'The second query', count(*) from (select distinct * from storage_datamart.olympiads)
;
{code}
 

The of the above query is the following:
{code:java}
The first query    13437678
The second query   36901430
{code}
 
 I can easily calculate the unique number of rows in the table:
{code:sql}
select count(*) from (
  select student_id, olympiad_id, tour, grade
    from storage_datamart.olympiads
   group by student_id, olympiad_id, tour, grade
  having count(*) = 1
)
; -- Rows: 36901365
{code}
 

The table DDL is the following:
{code:sql}
CREATE TABLE `storage_datamart`.`olympiads` (
  `ptn_date` DATE,
  `student_id` BIGINT,
  `olympiad_id` STRING,
  `grade` BIGINT,
  `grade_type` STRING,
  `tour` STRING,
  `created_at` TIMESTAMP,
  `created_at_local` TIMESTAMP,
  `olympiad_num` BIGINT,
  `olympiad_name` STRING,
  `subject` STRING,
  `started_at` TIMESTAMP,
  `ended_at` TIMESTAMP,
  `region_id` BIGINT,
  `region_name` STRING,
  `municipality_name` STRING,
  `school_id` BIGINT,
  `school_name` STRING,
  `school_status` BOOLEAN,
  `oly_n_common` INT,
  `num_day` INT,
  `award_type` STRING,
  `new_student_legacy` INT,
  `segment` STRING,
  `total_start` TIMESTAMP,
  `total_end` TIMESTAMP,
  `year_learn` STRING,
  `parent_id` BIGINT,
  `teacher_id` BIGINT,
  `parallel` BIGINT,
  `olympiad_type` STRING)
USING parquet
LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet'
;
{code}
 

Could you please tell me why in the first Spark SQL query counting the unique number of rows using the construction `count(distinct *)` does not count correctly and why the result of the two Spark SQL queries is different??

Thanks in advance.

 

p.s. I could not find a description of such behaviour of the function `count(distinct *)` in the [official Spark documentation|https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions]:
{quote}count(DISTINCT expr[, expr...]) -> Returns the number of rows for which the supplied expression(s) are unique and non-null.
{quote}
 


> count distinct asterisk 
> ------------------------
>
>                 Key: SPARK-35504
>                 URL: https://issues.apache.org/jira/browse/SPARK-35504
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:	Ubuntu
> Description:	Ubuntu 18.04.4 LTS
> Release:	18.04
> Codename:	bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>       /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           1
> NUMA node(s):        1
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               85
> Model name:          Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:            7
> CPU MHz:             3602.011
> BogoMIPS:            6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            1024K
> L3 cache:            36608K
> NUMA node0 CPU(s):   0-3
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>            Reporter: Nikolay Sokolov
>            Priority: Minor
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution of the algorithm for counting unique rows differs depending on the form (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query    13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
>     from storage_datamart.olympiads
>    group by student_id, olympiad_id, tour, grade
>   having count(*) = 1
> )
> ; -- Rows: 36901365
> {code}
>  
> The table DDL is the following:
> {code:sql}
> CREATE TABLE `storage_datamart`.`olympiads` (
>   `ptn_date` DATE,
>   `student_id` BIGINT,
>   `olympiad_id` STRING,
>   `grade` BIGINT,
>   `grade_type` STRING,
>   `tour` STRING,
>   `created_at` TIMESTAMP,
>   `created_at_local` TIMESTAMP,
>   `olympiad_num` BIGINT,
>   `olympiad_name` STRING,
>   `subject` STRING,
>   `started_at` TIMESTAMP,
>   `ended_at` TIMESTAMP,
>   `region_id` BIGINT,
>   `region_name` STRING,
>   `municipality_name` STRING,
>   `school_id` BIGINT,
>   `school_name` STRING,
>   `school_status` BOOLEAN,
>   `oly_n_common` INT,
>   `num_day` INT,
>   `award_type` STRING,
>   `new_student_legacy` INT,
>   `segment` STRING,
>   `total_start` TIMESTAMP,
>   `total_end` TIMESTAMP,
>   `year_learn` STRING,
>   `parent_id` BIGINT,
>   `teacher_id` BIGINT,
>   `parallel` BIGINT,
>   `olympiad_type` STRING)
> USING parquet
> LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet'
> ;
> {code}
>  
> Could you please tell me why in the first Spark SQL query counting the unique number of rows using the construction `count(distinct *)` does not count correctly and why the result of the two Spark SQL queries is different??
> Thanks in advance.
>  
> p.s. I could not find a description of such behaviour of the function `count(distinct *)` in the [official Spark documentation|https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions]:
> {quote}count(DISTINCT expr[, expr...]) -> Returns the number of rows for which the supplied expression(s) are unique and non-null.
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org