You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/01/31 13:16:00 UTC
[jira] [Commented] (ARROW-15489) [R] Expand RecordBatchReader use-ability

    [ https://issues.apache.org/jira/browse/ARROW-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484667#comment-17484667 ] 

Dewey Dunnington commented on ARROW-15489:
------------------------------------------

Adding a few notes here:

- RecordBatchReaders area really useful because they can be exported via the C API and can stream bigger-than-memory data sets. One vision for a future DBI abstraction is that database results could be streamed as RecordBatchReaders (which would mean better support for big databse results and column types that are lossy when they go through R). That would all have to go through the C API, in which case the RecordBatchReader will see a lot more use than it currently does.
- The {{ArrowTabular}} abstract class ( https://github.com/apache/arrow/blob/7eba11595c9753f18ac901eb3187f414a19a871c/r/R/arrow-tabular.R ) is mostly about selecting columns and filtering rows. Neither of those are good fits for a RecordBatchReader, whose contents are stateful (i.e., touching anything except {{$schema}} changes the contents of the object). Probably the only overlap is printing the schema.
- We don't currently export the {{RecordBatchReader}} class and probably should as part of this ticket.
- The way DuckDB does its registration is via a function that, when called, produces a {{RecordBatchReader}}, which becomes the basis for the registered View. In narrow I prototyped this as a {{as_narrow_array_stream()}} S3 generic that can take a function as input. Whether this is officially called a {{LazyArrowTabular}} or implemented as something simpler, it's an important concept.
- Python has a `RecordBatchReader.from_batches(schema, iterator_that_produces_record_batches)` method that we should implement too (useful for testing).

> [R] Expand RecordBatchReader use-ability 
> -----------------------------------------
>
>                 Key: ARROW-15489
>                 URL: https://issues.apache.org/jira/browse/ARROW-15489
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Jonathan Keane
>            Priority: Major
>
> In ARROW-14745 we thought about having {{to_arrow()}} returning a RecordBatchReader only. Though this would work, it's not quite as friendly as wrapping the RecordBatchReader since {{arrow_dplyr_query}}s have a (slightly) nicer print method.
> We should add more methods and a print method that makes it clearer what a RecordBatchReader is and what it might be useful to do (e.g. continue a dplyr query)
> Is it possible that we could make up a name/class that encompasses all of the Arrow tabular like things that we could wrap all of these up in (for UX purposes only, really). We have ArrowTabular now, maybe we lean into that more (along side an LazyArrowTabular like dbplyr has?).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)