You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2022/06/27 02:38:00 UTC
[jira] [Commented] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups
[ https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558955#comment-17558955 ]
Yibo Cai commented on ARROW-16807:
----------------------------------
Looks current {{count_distinct}} doesn't handle chunked array. It simply accumulates the distinct counts of each chunk.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/aggregate_basic.cc#L159
It's wrong if there are duplicated values among chunks.
E.g., for two chunks "1,2,3", "1,2,3", current count_distinct kernel outpus 3+3=6.
> [C++] count_distinct aggregates incorrectly across row groups
> -------------------------------------------------------------
>
> Key: ARROW-16807
> URL: https://issues.apache.org/jira/browse/ARROW-16807
> Project: Apache Arrow
> Issue Type: Bug
> Environment: > arrow::arrow_info()
> Arrow package version: 8.0.0.9000
> Capabilities:
>
> dataset TRUE
> substrait FALSE
> parquet TRUE
> json TRUE
> s3 TRUE
> utf8proc TRUE
> re2 TRUE
> snappy TRUE
> gzip TRUE
> brotli TRUE
> zstd TRUE
> lz4 TRUE
> lz4_frame TRUE
> lzo FALSE
> bz2 TRUE
> jemalloc TRUE
> mimalloc FALSE
> Memory:
>
> Allocator jemalloc
> Current 37.25 Kb
> Max 925.42 Kb
> Runtime:
>
> SIMD Level none
> Detected SIMD Level none
> Build:
>
> C++ Library Version 9.0.0-SNAPSHOT
> C++ Compiler AppleClang
> C++ Compiler Version 13.1.6.13160021
> Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9
> Reporter: Edward Visel
> Priority: Blocker
> Fix For: 9.0.0, 8.0.1
>
>
> When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
> {code:r}
> library(dplyr, warn.conflicts = FALSE)
> path <- tempfile(fileext = '.parquet')
> arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
> ds <- arrow::open_dataset(path)
> ds %>% count(sex) %>% collect()
> #> # A tibble: 5 × 2
> #> sex n
> #> <chr> <int>
> #> 1 male 60
> #> 2 none 6
> #> 3 female 16
> #> 4 hermaphroditic 1
> #> 5 <NA> 4
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 19
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 17
> # correct
> ds %>% collect() %>% summarise(n = n_distinct(sex))
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 5
> {code}
> If the file is stored as a single row group, results are correct. When grouped, results are correct.
> I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> pa.__version__
> #> 8.0.0
> starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
> pa.compute.count_distinct(starwars.column('sex')).as_py()
> #> 15
> pa.compute.unique(starwars.column('sex'))
> #> [
> #> "male",
> #> "none",
> #> "female",
> #> "hermaphroditic",
> #> null
> #> ]
> {code}
> This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)