You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "nealrichardson (via GitHub)" <gi...@apache.org> on 2023/04/06 18:49:26 UTC

[GitHub] [arrow] nealrichardson commented on pull request #34825: GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame

nealrichardson commented on PR #34825:
URL: https://github.com/apache/arrow/pull/34825#issuecomment-1499479934

   Apologies for being late to the discussion here, and forgive me if I'm missing something, but it seems like the odd behavior @thisisnic reported is a more narrow problem of how the auto-splicing works in `arrow_table() / Table$create()`. That was added in #4704, and there was a TODO issue made about improving the C++ code around it (#22186), though I don't think this particular issue was what they had in mind. This seems worth fixing.
   
   But I'm not sure there's a more general problem to solve; or rather, I'm not sure there's a better solution than the one we have that meets our requirements. It seems like the constraints we're solving for, ranked by priority, are:
   
   1. `as.data.frame(ArrowObject)` should return an object where `is.data.frame()` is `TRUE` and can be used by any other R functions. No Arrow methods required to work with it.
   2. Fidelity: when a user saves an R data object to Arrow or Parquet, they should get the same R object back when they load it, including metadata.
   3. Internal coherence: the same kind of operation in various places should yield the same kind of result.
   4. `tbl`s are useful because they print nicer than vanilla `data.frame`s.
   5. Because we want Arrow to have broad adoption and appeal, we want to avoid adding hard dependencies.
   
   Our current approach is: when converting a Table to a data.frame, we [set the class](https://github.com/apache/arrow/blob/main/r/src/array_to_vector.cpp#L1386) as [`c("tbl_df", "tbl", "data.frame")`](https://github.com/apache/arrow/blob/main/r/src/symbols.cpp#L60). instead of just "data.frame" But [if there is R metadata](https://github.com/apache/arrow/blob/main/r/R/metadata.R#L99) saved that overwrites the class, we use that so that we can faithfully round-trip R objects. Leaving aside the autosplice issue, I think that's all that's happening. 
   
   We're relying on how S3 dispatch works, such that if you don't have `tibble` loaded and print a `c("tbl_df", "tbl", "data.frame")`, it just prints like a data.frame, so we can both (a) return tbls and (b) not depend on `tibble` just fine. So we can get `#4` and `#5`, but because of `#2`, we sacrifice a small amount of `#3`: we get the weaker `inherits(x, "data.frame")` requirement and not `identical(class(x), "data.frame")`. I'm of the opinion that this is a reasonable tradeoff since tibbles are just subclasses of data.frame--you always get back a data.frame. We could push back the other way, but I think we have to give up some `#4` as a result. 
   
   To be clear, I think you/we could justify any number of tradeoffs along these dimensions, I just wanted to sketch out what I think they are.
   
   A few other specific thoughts:
   
   * I'm not sure you can make `as.data.frame()` always return `class == "data.frame"` without either sacrificing roundtrip fidelity or inventing a new function that means "take this Arrow Table with metadata and make an R object". The latter doesn't sound like something we want to do.
   * The `read_xxx()` functions can't really be strict about `class` without losing roundtrip fidelity. For me, that's enough to say `inherits(x, "data.frame")` is sufficient and we don't have to always return identical class. 
   * Seems like `dplyr::collect()` should always return tibble, regardless of other changes you make (I believe it already does, given how the ExecPlan evaluation works).
   
   Additional historical context, in case it's useful in understanding how we got here:
   
   * https://github.com/apache/arrow/pull/4454
   * https://github.com/apache/arrow/pull/5399
   * https://github.com/apache/arrow/issues/22715
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org