You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/12/09 16:45:00 UTC
[jira] [Comment Edited] (ARROW-4283) [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?
[ https://issues.apache.org/jira/browse/ARROW-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645386#comment-17645386 ]
Weston Pace edited comment on ARROW-4283 at 12/9/22 4:44 PM:
-------------------------------------------------------------
Things have changed a bit since 2019. The {{RecordBatchFileWriter}} has an asynchronous API now. It's currently exposed as a whole-file reading {{AsyncGenerator}} (an iterator function that returns a promise each time you call it) via {{RecordBatchFileWriter::OpenAsync}} and {{RecordBatchFileWriter::GetRecordBatchGenerator}}. Although, under the hood, there are {{ReadFooterAsync}}, {{ReadRecordBatchAsync}} methods that could be exposed should more direct control be desired.
Adapting this pattern to the streaming reader should be pretty straightforward. These methods all return {{arrow::Future}}. As far as I know no one has done the neccesary work to plumb {{arrow::Future}} into a python async API (e.g. {{asyncio}}).
Asynchronous methods in Arrow typically work by offloading the blocking I/O calls to a global I/O thread pool (which can have more threads than there are cores and should generally be sized appropriately for the I/O device). This keeps the CPU threads free and non-blocking. To hook this into {{asyncio}} you would probably want to call {{arrow::Future::AddCallback}} and then, in that callback, schedule a task on some kind of python executor. In that python executor task you will want to mark some kind of {{asyncio}} future complete and this will presumably run any needed callbacks.
was (Author: westonpace):
Things have changed a bit since 2019. The {{RecordBatchFileWriter}} has an asynchronous API now. It's currently exposed as a whole-file reading {{AsyncGenerator}} (an iterator function that returns a promise each time you call it) via {{RecordBatchFileWriter::OpenAsync}} and {{RecordBatchFileWriter::GetRecordBatchGenerator}}. Although, under the hood, there are {{ReadFooterAsync}}, {{ReadRecordBatchAsync}} methods that could be exposed should more direct control be desired.
Adapting this pattern to the streaming reader should be pretty straightforward. These methods all return {{arrow::Future}}. As far as I know no one has done the neccesary work to plumb {{arrow::Future}} into a python async API (e.g. {{asyncio}}).
Asynchronous methods in Arrow typically work by offloading the blocking I/O calls to a global I/O thread pool (which can have more threads than there are cores and should generally be sized appropriately for the I/O device). This keeps the CPU threads free and non-blocking. To hook this into {{asyncio}} you would probably want to call {{arrow::Future::AddCallback}} and then, in that callback, schedule a task on some kind of python executor. In that python executor task you will want to mark some kind of {{asyncio}} future complete.
> [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?
> ----------------------------------------------------------------
>
> Key: ARROW-4283
> URL: https://issues.apache.org/jira/browse/ARROW-4283
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Paul Taylor
> Priority: Minor
>
> Filing this issue after a discussion today with [~xhochy] about how to implement streaming pyarrow http services. I had attempted to use both Flask and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s streaming interfaces because they seemed familiar, but no dice. I have no idea how hard this would be to add -- supporting all the asynciterable primitives in JS was non-trivial.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)