You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/03 15:47:07 UTC

[GitHub] [arrow] bkietz commented on a change in pull request #10191: [R] [WIP] Use InMemoryDataset for Table/RecordBatch in dplyr code

bkietz commented on a change in pull request #10191:
URL: https://github.com/apache/arrow/pull/10191#discussion_r625161974



##########
File path: r/tests/testthat/test-dplyr-mutate.R
##########
@@ -344,20 +344,21 @@ test_that("print a mutated table", {
       select(int) %>%
       mutate(twice = int * 2) %>%
       print(),
-'Table (query)
+'InMemoryDataset (query)
 int: int32
 twice: expr
 
 See $.data for the source Arrow object',
   fixed = TRUE)
 
   # Handling non-expressions/edge cases
+  skip("InMemoryDataset$Project() doesn't accept array (or could it?)")
   expect_output(
     Table$create(tbl) %>%
       select(int) %>%
       mutate(again = 1:10) %>%

Review comment:
       This isn't supported because (for most datasets) we don't know the number of rows before scanning, so it's unclear if augmenting the table with an explicit column is valid. It's *possible* to add support for projections like this to Expression but it'd mean they were no longer strictly scalar (since they'd only be valid in the case where the result of the scan happened to have the expected number of rows). This would complicate scanning drastically, since we'd have to accommodate array literals in each scanned batch
   
   I'd expect that augmenting with an explicit array is only done with small, cheap scans? If so, one way to work around this would be to auto-`collect()` the preceding steps and handle the explicit array with `Table$AddColumn`. In any case it might be worthwhile to maintain an `is_scan_of_in_memory_data` flag of the dplyr query as a hint of whether auto `collect()`ion is reasonable
   
   @westonpace what do you think?

##########
File path: r/tests/testthat/test-dplyr.R
##########
@@ -501,6 +503,7 @@ test_that("explicit type conversions with as.*()", {
 })
 
 test_that("as.factor()/dictionary_encode()", {
+  skip("ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}")

Review comment:
       Adding support for dictionary encoding is possible but it will be subtle since it will be the first Expression with mutable state to share between threads. I've added https://issues.apache.org/jira/browse/ARROW-12632 to track this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org