You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "thisisnic (via GitHub)" <gi...@apache.org> on 2023/09/27 11:52:20 UTC

[GitHub] [arrow] thisisnic opened a new issue, #37902: [R] Support joining using `NA` as join key

thisisnic opened a new issue, #37902:
URL: https://github.com/apache/arrow/issues/37902

   ### Describe the enhancement requested
   
   As described in https://github.com/apache/arrow/issues/14907#issuecomment-1735674700.  Reprex copied below.
   
   ```r
   library(dplyr)
   library(arrow)
   
   tbl1 <- tibble::tibble(
     a = 1:3,
     b = c("a", "b", NA),
     d = c(letters[4:6])
   )
   
   tbl2 <- tibble::tibble(
     b = c("b", NA),
     c = c("a should be 2", "a should be 3")
   )
   
   
   # Left join tibbles, NAs matched
   left_join(tbl2, tbl1)
   #> Joining with `by = join_by(b)`
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3     3 f
   
   
   # Left join arrow table & tibble, NAs NOT matched
   left_join(as_arrow_table(tbl2), tbl1) %>% collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3    NA <NA>
   
   
   # Left join arrow table & arrow table, NAs NOT matched
   left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3    NA <NA>
   ```
   
   The problem here is the question of "can we join on `NA` values in arrow?"
   
   Not right now!  But, here are a couple of workarounds.  The first uses extra code, and the second passes the data to duckdb and back.
   
   
   ``` r
   library(arrow)
   library(dplyr)
   
   tbl1 <- tibble::tibble(
     a = 1:3,
     b = c("a", "b", NA),
     d = c(letters[4:6])
   )
   
   tbl2 <- tibble::tibble(
     b = c("b", NA),
     c = c("a should be 2", "a should be 3")
   )
   
   as_arrow_table(tbl2) |>
     # replace NAs in tbl2 with alternative value
     mutate(b = ifelse(is.na(b), "temp_value", b)) |>
     left_join(
       as_arrow_table(tbl1) |>
         # replace NAs in tbl1 with alternative value
         mutate(b = ifelse(is.na(b), "temp_value", b))
     ) |>
     # replace alternative value in results with NA
     mutate(b = ifelse(b == "temp_value", NA, b)) |>
     collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3     3 f
   ```
   
   Or with DuckDB:
   
   ```
   tbl1_duckdb <- as_arrow_table(tbl1) |>
     to_duckdb()
   
   as_arrow_table(tbl2) |>
     to_duckdb() |>
     left_join(tbl1_duckdb, na_matches = "na") |>
     collect()
   #> Joining with `by = join_by(b)`
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3     3 f
   ```
   
   Note that we have to pass in `na_matches = "na"` explicitly in the example there as the default value when working with duckdb/dbplyr is "never" - basically reflecting that in SQL we can't join on NULL (NA) values.
   
   I also had a look to try to work out whether we can implement this in Arrow or not.  dbplyr implements a function `sql_expr_matches()` which is what allows matching on NAs, and here's a snippet from it:
   
   ```
   sql_expr_matches.DBIConnection <- function(con, x, y, ...) {
     glue_sql2(
       con,
       "CASE WHEN ({x} = {y}) OR ({x} IS NULL AND {y} IS NULL) ",
       "THEN 0 ",
       "ELSE 1 ",
       "END = 0"
     )
   }
   ```
   
   The C++ changes in #11579 *may* allow us to implement something like this in arrow - the unit tests in that PR certainly make it look feasible, though non-trivial.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org