You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2021/02/17 19:56:00 UTC

[jira] [Commented] (ARROW-11679) [R] Optimal arrow queries for benchmarking

    [ https://issues.apache.org/jira/browse/ARROW-11679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286117#comment-17286117 ] 

Wes McKinney commented on ARROW-11679:
--------------------------------------

I would suggest waiting until these queries can be evaluated natively against Arrow data otherwise you're largely looking at dplyr benchmarks. We're working on implementing the algorithms so it could be anywhere from 3-6 months from now, maybe sooner. 

> [R] Optimal arrow queries for benchmarking
> ------------------------------------------
>
>                 Key: ARROW-11679
>                 URL: https://issues.apache.org/jira/browse/ARROW-11679
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Benchmarking, R
>            Reporter: Jan Gorecki
>            Priority: Major
>
> Hi
> Hello,
> We are running a continuous benchmarking project (https://h2oai.github.io/db-benchmark). In recent days we added Arrow project.
> It uses R's dplyr and ArrowTable as backend. Queries have been written based on arrow R package documentation.
> There are 10 grouping queries:
> ```
> # q1: sum v1 by id1
> AT %>% select(id1, v1) %>% group_by(id1) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE))
> # q2: sum v1 by id1:id2
> AT %>% select(id1, id2, v1) %>% group_by(id1, id2) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE))
> # q3: sum v1 mean v3 by id3
> AT %>% select(id3, v1, v3) %>% group_by(id3) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE), v3=mean(v3, na.rm=TRUE))
> # q4: mean v1:v3 by id4
> AT %>% select(id4, v1, v2, v3) %>% group_by(id4) %>% collect() %>% summarise_at(.funs=\"mean\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q5: sum v1:v3 by id6
> AT %>% select(id6, v1, v2, v3) %>% group_by(id6) %>% collect () %>% summarise_at(.funs=\"sum\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q6: median v3 sd v3 by id4 id5
> AT %>% select(id4, id5, v3) %>% group_by(id4, id5) %>% collect() %>% summarise(median_v3=median(v3, na.rm=TRUE), sd_v3=sd(v3, na.rm=TRUE))
> # q7: max v1 - min v2 by id3
> AT %>% select(id3, v1, v2) %>% group_by(id3) %>% collect() %>% summarise(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE))
> # q8: largest two v3 by id6
> AT %>% select(id6, largest2_v3=v3) %>% filter(!is.na(largest2_v3)) %>% arrange(desc(largest2_v3)) %>% group_by(id6) %>% filter(row_number() <= 2L) %>% compute()
> # q9: regression v1 v2 by id2 id4
> AT %>% select(id2, id4, v1, v2) %>% group_by(id2, id4) %>% collect() %>% summarise(r2=cor(v1, v2, use=\"na.or.complete\")^2)
> # q10: sum v3 count by id1:id6
> AT %>% select(id1, id2, id3, id4, id5, id6, v3) %>% group_by(id1, id2, id3, id4, id5, id6) %>% collect() %>% summarise(v3=sum(v3, na.rm=TRUE), count=n())
> ```
> Full benchmark script can be found at https://github.com/h2oai/db-benchmark/blob/master/arrow/groupby-arrow.R
> ----
> As per my understanding, all above queries (maybe excluding query 8) will not utilize any arrow computation, as of now. It is because those operations are not yet implemented in arrow, and they are falling back to dplyr implementation.
> According to Neal's presentation I watched recently, code written now will over time get improved by improvements in arrow implementation. Continuous benchmark I am working on upgrades software automatically, therefore I would like to use the fact to write code now, and have it faster in future, as arrow implementation progresses. I believe the mentioned queries will not satisfy that, because of `collect()` call in the middle. AFAIU it needs a `compute()` call at the end instead (like now in query 8).
> Is there a way to write this code to be optimal now, and also optimal in future. Similarly as presented by Neal in his presentation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)