You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/12/17 16:00:00 UTC
[jira] [Commented] (ARROW-14266) [R] Use WriteNode to write queries
[ https://issues.apache.org/jira/browse/ARROW-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461530#comment-17461530 ]
Dewey Dunnington commented on ARROW-14266:
------------------------------------------
I'd be happy to take a look at this but need a bit more background on what changes you envision in (approximately) which parts of the code.
Some example code with a simple join and aggregation + write_dataset:
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df1 <- data.frame(a = letters, b = 1:26)
df2 <- data.frame(b = 1:5, c = LETTERS[1:5])
tf1 <- tempfile()
tf2 <- tempfile()
record_batch(df2) %>%
left_join(df1) %>%
write_dataset(tf1)
open_dataset(tf1) %>%
collect()
#> b c a
#> 1 1 A a
#> 2 2 B b
#> 3 3 C c
#> 4 4 D d
#> 5 5 E e
record_batch(df1) %>%
summarise(col = mean(b)) %>%
write_dataset(tf2)
open_dataset(tf2) %>%
collect()
#> # A tibble: 1 × 1
#> col
#> <dbl>
#> 1 13.5
{code}
> [R] Use WriteNode to write queries
> ----------------------------------
>
> Key: ARROW-14266
> URL: https://issues.apache.org/jira/browse/ARROW-14266
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Neal Richardson
> Priority: Major
> Labels: query-engine
> Fix For: 7.0.0
>
>
> Following ARROW-13542. Any query that has a join or an aggregation currently has to first evaluate the query and hold it in memory before creating a Scanner to write it. We could improve that by using a WriteNode inside write_dataset() (and maybe that improves the other cases too, or at least allows us to delete some code).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)