You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/06/30 04:00:00 UTC
[jira] [Commented] (ARROW-16897) [R][C++] Full join on Arrow objects is incorrect

    [ https://issues.apache.org/jira/browse/ARROW-16897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560809#comment-17560809 ] 

Weston Pace commented on ARROW-16897:
-------------------------------------

I believe I have found the root cause.  I'll give some more detail.  In Acero a join node has a left output and a right output.  Key columns can be output on both the left and the right side.  In an inner join this just leads to duplicate data.  In an outer join it's a bit more subtle.

For example, consider the inner join of the above (years 2000 - 2005 but only 2005 has a match in both):
||left.group||left.year||right.group||right.year||value||
|g1|2005|g1|2005|-1.008|

In an inner join you only get rows that have matches in both tables and so the two key columns will always be identical.  However, in an outer join, things are different.  Here is the same outer join for the above five years:
||left.group||left.year||right.group||right.year||value||
|NA|NA|g1|2000|NA|
|NA|NA|g1|2001|NA|
|NA|NA|g1|2002|NA|
|NA|NA|g1|2003|NA|
|NA|NA|g1|2004|NA|
|g1|2005|g1|2005|-1.008|

Note that it would be possible to get values in the left keys with NA in the right keys if the right table was missing data (in this particular example the right table has the full set of keys and so it is never missing rows).

Acero is actually giving you more information that you would normally get out of a standard SQL server.  In a standard SQL server you would never know whether a row simple did not have a match or the row matched but all the payload columns were null.  And, if a row didn't have a match and all the payload columns were null you wouldn't know which side had they key columns.  Still, this information isn't generally useful and one could portray this as calling a bug a feature :)

R currently deals with this duplication of key columns by always excluding the key columns from the right output.  So, given the above, R throws away {{right.group}} and {{left.group}} and you end up with:

 
||left.group||left.year||value||
|NA|NA|NA|
|NA|NA|NA|
|NA|NA|NA|
|NA|NA|NA|
|NA|NA|NA|
|g1|2005|-1.008|

I'm not sure if Acero's join node is dropping the rows with all nulls or if that is happening later on in some further node but these aren't very useful rows anyways.  If I reverse R's logic so that left_output is "let fields minus keys" and right_output is "right fields" then I get the correct result.

However, the correct answer is not to always drop the keys from the right side or to always drop the keys from the left side but instead to always coalesce the key columns.

 

Fortunately, we have a PR that adds this as an option (append_consolidated_key) to the hash join node options and will take care of the coalescing for you: [https://github.com/michalursa/arrow/pull/7/files#diff-985b684376a17de896e07a2ff2362934dbf8fa2483edeaeff6910907f6f96733]

 

Unfortunately, this PR was built on top of the improved hash-join node which hasn't yet merged.

> [R][C++] Full join on Arrow objects is incorrect
> ------------------------------------------------
>
>                 Key: ARROW-16897
>                 URL: https://issues.apache.org/jira/browse/ARROW-16897
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 8.0.0
>         Environment: Linux
>            Reporter: Oliver Reiter
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: joins, query-engine
>             Fix For: 9.0.0
>
>
> Hello,
> I am trying to do a full join on a dataset. It produces the correct number of observations, but not the correct result (the resulting data.frame is just filled up with NA-rows).
> My use case: I want to include the 'full' year range for every factor value:
> {code:java}
> library(data.table)
> library(arrow)
> library(dplyr)
> year_range <- 2000:2019
> group_n <- 100
> N <- 1000 ## the resulting data should have 100 groups * 20 years
> dt <- data.table(value = rnorm(N),
>                  group = rep(paste0("g", 1:group_n), length.out = N))
> ## there are only observations for some years in every group
> dt[, year := sample(year_range, size = N / group_n), by = .(group)]
> dt[group == "g1", ]
> ## this would be the 'full' data.table
> group_years <- data.table(group = rep(unique(dt$group), each = 20),
>                           year = rep(year_range, times = 10))
> group_years[group == "g1", ]
> write_dataset(dt, path = "parquet_db")
> db <- open_dataset(sources = "parquet_db")
> ## full_join using data.table -> expected result
> db_full <- merge(dt, group_years,
>                  by = c("group", "year"),
>                  all = TRUE)
> setorder(db_full, group, year)
> db_full[group == "g1", ]
> ## try to do the full_join with arrow -> incorrect result
> db_full_arrow <- db |>
>   full_join(group_years, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]
> ## or: convert data.table to arrow_table beforehand -> incorrect result
> group_years_arrow <- group_years |>
>   as_arrow_table()
> db_full_arrow <- db |>
>   full_join(group_years_arrow, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]{code}
> The [documentation|https://arrow.apache.org/docs/r/] says equality joins are supported, which should hold also for `full_join` I guess?
> Thanks for your time and work!
>  
> Oliver



--
This message was sent by Atlassian Jira
(v8.20.10#820010)