You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/10/06 16:48:00 UTC

[jira] [Closed] (ARROW-16897) [R][C++] Full join on Arrow objects is incorrect

     [ https://issues.apache.org/jira/browse/ARROW-16897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Will Jones closed ARROW-16897.
------------------------------
    Resolution: Duplicate

> [R][C++] Full join on Arrow objects is incorrect
> ------------------------------------------------
>
>                 Key: ARROW-16897
>                 URL: https://issues.apache.org/jira/browse/ARROW-16897
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 8.0.0, 9.0.0
>         Environment: Linux
>            Reporter: Oliver Reiter
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: joins, query-engine
>             Fix For: 10.0.0
>
>
> Hello,
> I am trying to do a full join on a dataset. It produces the correct number of observations, but not the correct result (the resulting data.frame is just filled up with NA-rows).
> My use case: I want to include the 'full' year range for every factor value:
> {code:java}
> library(data.table)
> library(arrow)
> library(dplyr)
> year_range <- 2000:2019
> group_n <- 100
> N <- 1000 ## the resulting data should have 100 groups * 20 years
> dt <- data.table(value = rnorm(N),
>                  group = rep(paste0("g", 1:group_n), length.out = N))
> ## there are only observations for some years in every group
> dt[, year := sample(year_range, size = N / group_n), by = .(group)]
> dt[group == "g1", ]
> ## this would be the 'full' data.table
> group_years <- data.table(group = rep(unique(dt$group), each = 20),
>                           year = rep(year_range, times = 10))
> group_years[group == "g1", ]
> write_dataset(dt, path = "parquet_db")
> db <- open_dataset(sources = "parquet_db")
> ## full_join using data.table -> expected result
> db_full <- merge(dt, group_years,
>                  by = c("group", "year"),
>                  all = TRUE)
> setorder(db_full, group, year)
> db_full[group == "g1", ]
> ## try to do the full_join with arrow -> incorrect result
> db_full_arrow <- db |>
>   full_join(group_years, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]
> ## or: convert data.table to arrow_table beforehand -> incorrect result
> group_years_arrow <- group_years |>
>   as_arrow_table()
> db_full_arrow <- db |>
>   full_join(group_years_arrow, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]{code}
> The [documentation|https://arrow.apache.org/docs/r/] says equality joins are supported, which should hold also for `full_join` I guess?
> Thanks for your time and work!
>  
> Oliver



--
This message was sent by Atlassian Jira
(v8.20.10#820010)