You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jeroen van Straten (Jira)" <ji...@apache.org> on 2022/07/05 11:55:00 UTC

[jira] [Commented] (ARROW-16904) [C++] min/max not deterministic if Parquet files have multiple row groups

    [ https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562627#comment-17562627 ] 

Jeroen van Straten commented on ARROW-16904:
--------------------------------------------

I probably fixed this as part of [https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16700] / [https://github.com/apache/arrow/pull/13509]. min/max wasn't working correctly when multiple Consume calls would be chained for the same ScalarAggregator instance; only the last call would affect the state. I'm not deep enough into Acero to understand under what circumstances it follows this pattern (which was broken and isn't tested) and under what circumstances it will only call Consume once per instance and then Merge the instances (which works correctly and is tested), though.

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -------------------------------------------------------------------------
>
>                 Key: ARROW-16904
>                 URL: https://issues.apache.org/jira/browse/ARROW-16904
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 8.0.0
>         Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 20.04.4 LTS
> Release:        20.04
> Codename:       focal
>            Reporter: Robert On
>            Assignee: Aldrin M
>            Priority: Blocker
>             Fix For: 9.0.0
>
>
> The following code produces non-deterministic result for getting the minimum value of a sequence of 1e5 and 1e6 integers.
> {code:r}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times respectively.
> {code:r}
> . 1
> 100 
> . 1 131073 262145 393217 
>  65     25      8      2 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)