You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/05/15 00:30:00 UTC
[jira] [Updated] (ARROW-12688) [R] Use DuckDB to query an Arrow
Dataset
[ https://issues.apache.org/jira/browse/ARROW-12688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-12688:
------------------------------------
Description:
DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream.
A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow.
We need a function something like this:
{code}
#' Run a DuckDB query on Arrow data
#'
#' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or
#' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
#' @return A `duckdb::duckdb_connection`
to_duckdb <- function(.data) {
# ARROW-12687: [C++][Python][Dataset] Convert Scanner into a RecordBatchReader
reader <- Scanner$create(.data)$ToRecordBatchReader()
# ARROW-12689: [R] Implement ArrowArrayStream C interface
stream_ptr <- allocate_arrow_array_stream()
on.exit(delete_arrow_array_stream(stream_ptr))
ExportRecordBatchReader(x, stream_ptr)
# TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
duckdb::duck_connection_from_arrow_stream(stream_ptr)
}
{code}
Assuming this existed, we could do something like (a variation of https://arrow.apache.org/docs/r/articles/dataset.html):
{code}
ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
ds %>%
filter(total_amount > 100, year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = 100 * tip_amount / total_amount) %>%
to_duckdb() %>%
group_by(passenger_count) %>%
summarise(
median_tip_pct = median(tip_pct),
n = n()
)
{code}
and duckdb would do the aggregation while the data reading, predicate pushdown, filtering, and projection would happen in Arrow. Or you could do {{dbGetQuery(ds, "SOME SQL")}} and that would evaluate on arrow data.
was:
DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream.
A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow.
> [R] Use DuckDB to query an Arrow Dataset
> ----------------------------------------
>
> Key: ARROW-12688
> URL: https://issues.apache.org/jira/browse/ARROW-12688
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, R
> Reporter: Neal Richardson
> Assignee: Neal Richardson
> Priority: Major
>
> DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream.
> A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow.
> We need a function something like this:
> {code}
> #' Run a DuckDB query on Arrow data
> #'
> #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or
> #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
> #' @return A `duckdb::duckdb_connection`
> to_duckdb <- function(.data) {
> # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a RecordBatchReader
> reader <- Scanner$create(.data)$ToRecordBatchReader()
> # ARROW-12689: [R] Implement ArrowArrayStream C interface
> stream_ptr <- allocate_arrow_array_stream()
> on.exit(delete_arrow_array_stream(stream_ptr))
> ExportRecordBatchReader(x, stream_ptr)
> # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
> duckdb::duck_connection_from_arrow_stream(stream_ptr)
> }
> {code}
> Assuming this existed, we could do something like (a variation of https://arrow.apache.org/docs/r/articles/dataset.html):
> {code}
> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
> ds %>%
> filter(total_amount > 100, year == 2015) %>%
> select(tip_amount, total_amount, passenger_count) %>%
> mutate(tip_pct = 100 * tip_amount / total_amount) %>%
> to_duckdb() %>%
> group_by(passenger_count) %>%
> summarise(
> median_tip_pct = median(tip_pct),
> n = n()
> )
> {code}
> and duckdb would do the aggregation while the data reading, predicate pushdown, filtering, and projection would happen in Arrow. Or you could do {{dbGetQuery(ds, "SOME SQL")}} and that would evaluate on arrow data.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)