You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Oliver Reiter (Jira)" <ji...@apache.org> on 2022/06/23 20:06:00 UTC
[jira] [Created] (ARROW-16897) [R] Full join on Arrow objects is incorrect
Oliver Reiter created ARROW-16897:
-------------------------------------
Summary: [R] Full join on Arrow objects is incorrect
Key: ARROW-16897
URL: https://issues.apache.org/jira/browse/ARROW-16897
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 8.0.0
Environment: Linux
Reporter: Oliver Reiter
Hello,
I am trying to do a full join on a dataset. It produces the correct number of observations, but not the correct result (the resulting data.frame is just filled up with NA-rows).
My use case: I want to include the 'full' year range for every factor value:
{code:java}
library(data.table)
library(arrow)
library(dplyr)
year_range <- 2000:2019
group_n <- 100
N <- 1000 ## the resulting data should have 100 groups * 20 years
dt <- data.table(value = rnorm(N),
group = rep(paste0("g", 1:group_n), length.out = N))
## there are only observations for some years in every group
dt[, year := sample(year_range, size = N / group_n), by = .(group)]
dt[group == "g1", ]
## this would be the 'full' data.table
group_years <- data.table(group = rep(unique(dt$group), each = 20),
year = rep(year_range, times = 10))
group_years[group == "g1", ]
write_dataset(dt, path = "parquet_db")
db <- open_dataset(sources = "parquet_db")
## full_join using data.table -> expected result
db_full <- merge(dt, group_years,
by = c("group", "year"),
all = TRUE)
setorder(db_full, group, year)
db_full[group == "g1", ]
## try to do the full_join with arrow -> incorrect result
db_full_arrow <- db |>
full_join(group_years, by = c("group", "year")) |>
collect() |>
setDT()
setorder(db_full_arrow, group, year)
db_full_arrow[group == "g1", ]
## or: convert data.table to arrow_table beforehand -> incorrect result
group_years_arrow <- group_years |>
as_arrow_table()
db_full_arrow <- db |>
full_join(group_years_arrow, by = c("group", "year")) |>
collect() |>
setDT()
setorder(db_full_arrow, group, year)
db_full_arrow[group == "g1", ]{code}
The [documentation|https://arrow.apache.org/docs/r/] says equality joins are supported, which should hold also for `full_join` I guess?
Thanks for your time and work!
Oliver
--
This message was sent by Atlassian Jira
(v8.20.7#820007)