You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nikolay Sokolov (Jira)" <ji...@apache.org> on 2021/05/24 17:20:00 UTC
[jira] [Created] (SPARK-35504) count distinct asterisk
Nikolay Sokolov created SPARK-35504:
---------------------------------------
Summary: count distinct asterisk
Key: SPARK-35504
URL: https://issues.apache.org/jira/browse/SPARK-35504
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.0.0
Environment: {code:java}
uname -a
Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
{code}
{code:java}
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
{code}
{code:java}
/opt/spark/bin/spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user ubuntu on 2020-06-06T13:05:28Z
Revision 3fdfce3120f307147244e5eaf46d61419a723d50
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
{code}
{code:java}
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3602.011
BogoMIPS: 6000.01
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
{code}
Reporter: Nikolay Sokolov
Hi everyone,
I hope you're well!
Today I came across a very interesting case when the result of the execution of the algorithm for counting unique rows differs depending on the form (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
I still can't figure out on my own if this is a bug or a feature and I would like to share what I found.
I run Spark SQL queries through the Thrift (and not only) connecting to the Spark cluster. I use the DBeaver app to execute Spark SQL queries.
So, I have two identical Spark SQL queries from an algorithmic point of view that return different results.
The first query:
{code:sql}
select count(distinct *) unique_amt from storage_datamart.olympiads
; -- Rows: 13437678
{code}
The second query:
{code:sql}
select count(*) from (select distinct * from storage_datamart.olympiads)
; -- Rows: 36901430
{code}
The result of the two queries is different. (But it must be the same, right!?)
{code:sql}
select 'The first query' description, count(distinct *) unique_amt from storage_datamart.olympiads
union all
select 'The second query', count(*) from (select distinct * from storage_datamart.olympiads)
;
{code}
The of the above query is the following:
{code:java}
The first query 13437678
The second query 36901430
{code}
I can easily calculate the unique number of rows in the table:
{code:sql}
select count(*) from (
select student_id, olympiad_id, tour, grade
from storage_datamart.olympiads
group by student_id, olympiad_id, tour, grade
having count(*) = 1
)
; -- Rows: 36901365
{code}
The table DDL is the following:
{code:sql}
CREATE TABLE `storage_datamart`.`olympiads` (
`ptn_date` DATE,
`student_id` BIGINT,
`olympiad_id` STRING,
`grade` BIGINT,
`grade_type` STRING,
`tour` STRING,
`created_at` TIMESTAMP,
`created_at_local` TIMESTAMP,
`olympiad_num` BIGINT,
`olympiad_name` STRING,
`subject` STRING,
`started_at` TIMESTAMP,
`ended_at` TIMESTAMP,
`region_id` BIGINT,
`region_name` STRING,
`municipality_name` STRING,
`school_id` BIGINT,
`school_name` STRING,
`school_status` BOOLEAN,
`oly_n_common` INT,
`num_day` INT,
`award_type` STRING,
`new_student_legacy` INT,
`segment` STRING,
`total_start` TIMESTAMP,
`total_end` TIMESTAMP,
`year_learn` STRING,
`parent_id` BIGINT,
`teacher_id` BIGINT,
`parallel` BIGINT,
`olympiad_type` STRING)
USING parquet
LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet'
;
{code}
Could you please tell me why in the first Spark SQL query counting the unique number of rows using the construction `count(distinct *)` does not count correctly and why the result of the two Spark SQL queries is different??
Thanks in advance.
p.s. I could not find a description of such behaviour of the function `count(distinct *)` in the [official Spark documentation|https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions]:
{quote}count(DISTINCT expr[, expr...]) -> Returns the number of rows for which the supplied expression(s) are unique and non-null.
{quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org