You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Oliver Reiter (Jira)" <ji...@apache.org> on 2022/06/23 20:06:00 UTC

[jira] [Created] (ARROW-16897) [R] Full join on Arrow objects is incorrect

Oliver Reiter created ARROW-16897:
-------------------------------------

             Summary: [R] Full join on Arrow objects is incorrect
                 Key: ARROW-16897
                 URL: https://issues.apache.org/jira/browse/ARROW-16897
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 8.0.0
         Environment: Linux
            Reporter: Oliver Reiter


Hello,

I am trying to do a full join on a dataset. It produces the correct number of observations, but not the correct result (the resulting data.frame is just filled up with NA-rows).

My use case: I want to include the 'full' year range for every factor value:
{code:java}
library(data.table)
library(arrow)
library(dplyr)

year_range <- 2000:2019
group_n <- 100
N <- 1000 ## the resulting data should have 100 groups * 20 years

dt <- data.table(value = rnorm(N),
                 group = rep(paste0("g", 1:group_n), length.out = N))
## there are only observations for some years in every group
dt[, year := sample(year_range, size = N / group_n), by = .(group)]
dt[group == "g1", ]

## this would be the 'full' data.table
group_years <- data.table(group = rep(unique(dt$group), each = 20),
                          year = rep(year_range, times = 10))
group_years[group == "g1", ]

write_dataset(dt, path = "parquet_db")
db <- open_dataset(sources = "parquet_db")

## full_join using data.table -> expected result
db_full <- merge(dt, group_years,
                 by = c("group", "year"),
                 all = TRUE)
setorder(db_full, group, year)
db_full[group == "g1", ]

## try to do the full_join with arrow -> incorrect result
db_full_arrow <- db |>
  full_join(group_years, by = c("group", "year")) |>
  collect() |>
  setDT()
setorder(db_full_arrow, group, year)
db_full_arrow[group == "g1", ]

## or: convert data.table to arrow_table beforehand -> incorrect result
group_years_arrow <- group_years |>
  as_arrow_table()
db_full_arrow <- db |>
  full_join(group_years_arrow, by = c("group", "year")) |>
  collect() |>
  setDT()
setorder(db_full_arrow, group, year)
db_full_arrow[group == "g1", ]{code}
The [documentation|https://arrow.apache.org/docs/r/] says equality joins are supported, which should hold also for `full_join` I guess?

Thanks for your time and work!

 

Oliver



--
This message was sent by Atlassian Jira
(v8.20.7#820007)