You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2021/12/17 16:00:00 UTC

[jira] [Commented] (ARROW-14266) [R] Use WriteNode to write queries

    [ https://issues.apache.org/jira/browse/ARROW-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461530#comment-17461530 ] 

Dewey Dunnington commented on ARROW-14266:
------------------------------------------

I'd be happy to take a look at this but need a bit more background on what changes you envision in (approximately) which parts of the code.

Some example code with a simple join and aggregation + write_dataset:

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df1 <- data.frame(a = letters, b = 1:26)
df2 <- data.frame(b = 1:5, c = LETTERS[1:5])

tf1 <- tempfile()
tf2 <- tempfile()

record_batch(df2) %>% 
  left_join(df1) %>% 
  write_dataset(tf1)

open_dataset(tf1) %>% 
  collect()
#>   b c a
#> 1 1 A a
#> 2 2 B b
#> 3 3 C c
#> 4 4 D d
#> 5 5 E e


record_batch(df1) %>% 
  summarise(col = mean(b)) %>% 
  write_dataset(tf2)

open_dataset(tf2) %>% 
  collect()
#> # A tibble: 1 × 1
#>     col
#>   <dbl>
#> 1  13.5
{code}


> [R] Use WriteNode to write queries
> ----------------------------------
>
>                 Key: ARROW-14266
>                 URL: https://issues.apache.org/jira/browse/ARROW-14266
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Neal Richardson
>            Priority: Major
>              Labels: query-engine
>             Fix For: 7.0.0
>
>
> Following ARROW-13542. Any query that has a join or an aggregation currently has to first evaluate the query and hold it in memory before creating a Scanner to write it. We could improve that by using a WriteNode inside write_dataset() (and maybe that improves the other cases too, or at least allows us to delete some code). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)