You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/01/26 15:47:00 UTC
[jira] [Commented] (ARROW-15312) [R][C++] filtering a dataset with is.na() misses some rows

    [ https://issues.apache.org/jira/browse/ARROW-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482584#comment-17482584 ] 

Nicola Crane commented on ARROW-15312:
--------------------------------------

Here are the results with both code and output:

{code:r}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, unionds_path = "test-arrow-na"

df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))df %>%
    arrow::write_dataset(ds_path)

# OK: Collect then filter: returns row 3, as expected
arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))
#> # A tibble: 1 × 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     3    NA    NA

# ERROR: Filter then collect (on y) returns a tibble with no row
arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
#> # A tibble: 0 × 3
#> # … with 3 variables: x <int>, y <int>, z <int>

# OK: Filter then collect (on z) returns row 3, as expected
arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() 
#> # A tibble: 1 × 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     3    NA    NA

# This is as expected
arrow::read_parquet("test-arrow-na/part-0.parquet", as_data_frame = FALSE) %>%
  filter(is.na(y)) %>%
  collect()
#> # A tibble: 1 × 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     3    NA    NA

 {code}

> [R][C++] filtering a dataset with is.na() misses some rows
> ----------------------------------------------------------
>
>                 Key: ARROW-15312
>                 URL: https://issues.apache.org/jira/browse/ARROW-15312
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>         Environment: R 4.1.2 on Windows
> arrow 6.0.1
> dplyr 1.0.7
>            Reporter: Pierre Gramme
>            Priority: Major
>
> Hi !
> I just found an issue when querying an Arrow dataset with dplyr, filtering on is.na(...)
> It seems linked to columns containing only one distinct value and some NA's.
> Can you also reproduce the following?
>  
> {code:java}
>   library(arrow)
>   library(dplyr)
>   
>   ds_path = "test-arrow-na"
>   df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
>   
>   df %>% arrow::write_dataset(ds_path)
>   
>   # OK: Collect then filter: returns row 3, as expected
>   arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))
>   # ERROR: Filter then collect (on y) returns a tibble with no row
>   arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
>   
>   # OK: Filter then collect (on z) returns row 3, as expected
>   arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() {code}
>  
> Thanks
> Pierre



--
This message was sent by Atlassian Jira
(v8.20.1#820001)