You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Edward Visel (Jira)" <ji...@apache.org> on 2022/06/10 17:01:00 UTC
[jira] [Created] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups
Edward Visel created ARROW-16808:
------------------------------------
Summary: [C++] count_distinct aggregates incorrectly across row groups
Key: ARROW-16808
URL: https://issues.apache.org/jira/browse/ARROW-16808
Project: Apache Arrow
Issue Type: Bug
Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc FALSE
Memory:
Allocator jemalloc
Current 37.25 Kb
Max 925.42 Kb
Runtime:
SIMD Level none
Detected SIMD Level none
Build:
C++ Library Version 9.0.0-SNAPSHOT
C++ Compiler AppleClang
C++ Compiler Version 13.1.6.13160021
Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9
Reporter: Edward Visel
Fix For: 9.0.0, 8.0.1
When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)
path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
ds <- arrow::open_dataset(path)
ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#> sex n
#> <chr> <int>
#> 1 male 60
#> 2 none 6
#> 3 female 16
#> 4 hermaphroditic 1
#> 5 <NA> 4
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 17
# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 5
{code}
If the file is stored as a single row group, results are correct. When grouped, results are correct.
I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
pa.__version__
#> 8.0.0
starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#> "male",
#> "none",
#> "female",
#> "hermaphroditic",
#> null
#> ]
{code}
This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)