You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/03/25 17:56:00 UTC
[jira] [Updated] (ARROW-10882) [Python][Dataset] Writing dataset
from python iterator of record batches
[ https://issues.apache.org/jira/browse/ARROW-10882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-10882:
-----------------------------------
Labels: dataset pull-request-available (was: dataset)
> [Python][Dataset] Writing dataset from python iterator of record batches
> ------------------------------------------------------------------------
>
> Key: ARROW-10882
> URL: https://issues.apache.org/jira/browse/ARROW-10882
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Joris Van den Bossche
> Assignee: David Li
> Priority: Major
> Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> At the moment, from python you can write a dataset with {{ds.write_dataset}} for example starting from a *list* of record batches.
> But this currently needs to be an actual list (or gets converted to a list), so an iterator or generated gets fully consumed (potentially bringing the record batches in memory), before starting to write.
> We should also be able to use the python iterator itself to back a {{RecordBatchIterator}}-like object, that can be consumed while writing the batches.
> We already have a {{arrow::py::PyRecordBatchReader}} that might be useful here.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)