You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2022/05/31 23:13:00 UTC

[jira] [Commented] (BEAM-14540) Native implementation for serialized Rows to/from Arrow

    [ https://issues.apache.org/jira/browse/BEAM-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544641#comment-17544641 ] 

Brian Hulette commented on BEAM-14540:
--------------------------------------

To implement this we need a native "coder" which can convert between a batch of rows encoded with beam:coder:row:v1 and an arrow RecordBatch (or Table) object.

We would also need to add logic to detect when this can be used and insert it, [~robertwb] may have some pointers there.

> Native implementation for serialized Rows to/from Arrow
> -------------------------------------------------------
>
>                 Key: BEAM-14540
>                 URL: https://issues.apache.org/jira/browse/BEAM-14540
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Priority: P2
>
> With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage users to develop pipelines that process arrow data within the Python context. But we don't propose any cross-SDK changes, so when Arrow data is processed in the SDK, it must be converted to/from Rows for transmission over the Fn API. So the ideal Python execution looks like:
> 1. read row oriented data over the Fn API, deserialize  with SchemaCoder 
> 2. Buffer rows and construct an arrow RecordBatch/Table object
> 3. Perform user computation
> 4. Explode output RecordBatch/Table into rows
> 5. Serialize rows with SchemaCoder and write out over the Fn API
> We can improve performance for this type of flow by making a native (cythonized) implementation for (1,2) and (4,5).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)