You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Pierre Gramme (Jira)" <ji...@apache.org> on 2022/01/12 16:25:00 UTC
[jira] [Updated] (ARROW-15312) [R] filtering a dataset with is.na() misses some rows

     [ https://issues.apache.org/jira/browse/ARROW-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pierre Gramme updated ARROW-15312:
----------------------------------
    Description: 
Hi !

I just found an issue when querying an Arrow dataset with dplyr, filtering on is.na(...)

It seems linked to columns containing only one distinct value and some NA's.

Can you also reproduce the following?
{quote}{{  library(arrow)}}
{{  library(dplyr)}}
{{  }}
{{  ds_path = "test-arrow-na"}}
{{  df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))}}
{{  }}
{{  df %>% arrow::write_dataset(ds_path)}}
{{  }}
{{  # OK: Collect then filter: returns row 3, as expected}}
{{  arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))}}{{  # ERROR: Filter then collect (on y) returns a tibble with no row}}
{{  arrow::open_dataset(ds_path) %>% filter(is.na(y) %>% collect()}}
{{  }}
{{  # OK: Filter then collect (on z) returns row 3, as expected}}
{{  arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect()}}{quote}
 

Thanks

Pierre

  was:
Hi !

I just found an issue when querying an Arrow dataset with dplyr, filtering on is.na(...)

It seems linked to columns containing only one distinct value and some NA's.

Can you also reproduce the following?
{quote}  library(arrow)
  library(dplyr)
  
  ds_path = "test-arrow-na"
  df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
  
  df %>% arrow::write_dataset(ds_path)
  
  # OK: Collect then filter: returns row 3, as expected
  arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))

  # ERROR: Filter then collect (on y) returns a tibble with no row
  arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
  
  # OK: Filter then collect (on z) returns row 3, as expected
  arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect()
{quote}
 

Thanks

Pierre


> [R] filtering a dataset with is.na() misses some rows
> -----------------------------------------------------
>
>                 Key: ARROW-15312
>                 URL: https://issues.apache.org/jira/browse/ARROW-15312
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>         Environment: R 4.1.2 on Windows
> arrow 6.0.1
> dplyr 1.0.7
>            Reporter: Pierre Gramme
>            Priority: Major
>
> Hi !
> I just found an issue when querying an Arrow dataset with dplyr, filtering on is.na(...)
> It seems linked to columns containing only one distinct value and some NA's.
> Can you also reproduce the following?
> {quote}{{  library(arrow)}}
> {{  library(dplyr)}}
> {{  }}
> {{  ds_path = "test-arrow-na"}}
> {{  df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))}}
> {{  }}
> {{  df %>% arrow::write_dataset(ds_path)}}
> {{  }}
> {{  # OK: Collect then filter: returns row 3, as expected}}
> {{  arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))}}{{  # ERROR: Filter then collect (on y) returns a tibble with no row}}
> {{  arrow::open_dataset(ds_path) %>% filter(is.na(y) %>% collect()}}
> {{  }}
> {{  # OK: Filter then collect (on z) returns row 3, as expected}}
> {{  arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect()}}{quote}
>  
> Thanks
> Pierre



--
This message was sent by Atlassian Jira
(v8.20.1#820001)