You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/06/25 17:33:00 UTC
[jira] [Created] (ARROW-13189) [R] Should we be handling row-level
metadata at all?
Jonathan Keane created ARROW-13189:
--------------------------------------
Summary: [R] Should we be handling row-level metadata at all?
Key: ARROW-13189
URL: https://issues.apache.org/jira/browse/ARROW-13189
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 4.0.1, 4.0.0, 3.0.0
Reporter: Jonathan Keane
In order to support things like SF columns, we have added code that handles row-level metadata (https://github.com/apache/arrow/pull/8549 and https://github.com/apache/arrow/pull/9182).
These work just fine in a single table or single parquet file circumstance, but when using a dataset (even without filtering!) this can produce some surprising (and wrong) results (see reprex below).
There is already some work underway to make it easier to convert the row-element-level attributes to a struct + store it in the column in the ARROW-12542 work, but that's still a bit off. But even once that's done, should we disable this totally? Stop or ignore+warn that with datasets row-level metadata isn't applied (since there's no way for us to get the ordering right)? Something else?
{code:r}
library(arrow)
df <- tibble::tibble(
part = rep(1:2, 13),
let = letters
)
df$embedded_attr <- lapply(seq_len(nrow(df)), function(i) {
value <- "nothing"
attributes(value) <- list(letter = df[[i, "let"]])
value
})
df_from_tab <- as.data.frame(Table$create(df))
# this should be (and is) "b"
attributes(df_from_tab[df_from_tab$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "b"
# the dfs are the same
waldo::compare(df, df_from_tab)
#> ✓ No differences
# now via dataset
dir <- "ds-dir"
write_dataset(df, path = dir, partitioning = "part")
ds <- open_dataset(dir)
df_from_ds <- dplyr::collect(ds)
# this should be (and is not) "b"
attributes(df_from_ds[df_from_ds$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "n"
# Even controlling for order, the dfs are not the same
waldo::compare(dplyr::arrange(df, let), dplyr::arrange(df_from_ds, let))
#> `names(old)`: "part" "let" "embedded_attr"
#> `names(new)`: "let" "embedded_attr" "part"
#>
#> `attr(old$embedded_attr[[2]], 'letter')`: "b"
#> `attr(new$embedded_attr[[2]], 'letter')`: "n"
#>
#> `attr(old$embedded_attr[[3]], 'letter')`: "c"
#> `attr(new$embedded_attr[[3]], 'letter')`: "b"
#>
#> `attr(old$embedded_attr[[4]], 'letter')`: "d"
#> `attr(new$embedded_attr[[4]], 'letter')`: "o"
#>
#> `attr(old$embedded_attr[[5]], 'letter')`: "e"
#> `attr(new$embedded_attr[[5]], 'letter')`: "c"
#>
#> `attr(old$embedded_attr[[6]], 'letter')`: "f"
#> `attr(new$embedded_attr[[6]], 'letter')`: "p"
#>
#> `attr(old$embedded_attr[[7]], 'letter')`: "g"
#> `attr(new$embedded_attr[[7]], 'letter')`: "d"
#>
#> `attr(old$embedded_attr[[8]], 'letter')`: "h"
#> `attr(new$embedded_attr[[8]], 'letter')`: "q"
#>
#> `attr(old$embedded_attr[[9]], 'letter')`: "i"
#> `attr(new$embedded_attr[[9]], 'letter')`: "e"
#>
#> `attr(old$embedded_attr[[10]], 'letter')`: "j"
#> `attr(new$embedded_attr[[10]], 'letter')`: "r"
#>
#> And 15 more differences ...
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)