You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/11/14 16:11:01 UTC

[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

    [ https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251621#comment-16251621 ] 

Wes McKinney commented on ARROW-1808:
-------------------------------------

[~kou] I am going to start looking at this soon, since it may cause a little bit of disruption in the glib bindings. This is a moderately disruptive API change, but long-term it will be for the best. The idea is that the current {{arrow::RecordBatch}} is a "simple in-memory record batch". But the object-boxing requirements to produce a vector of {{std::shared_ptr<arrow::ArrayData>}} can be quite expensive for large record batches. 

Instead, we could have {{arrow::RecordBatch}} as an abstract interface with virtual function for column access, with the current incarnation of RecordBatch as a subclass. So we could also create an {{arrow::IpcRecordBatch}} that does late-materialization of the {{arrow::Array}} objects. So if you have 1000 columns, you do not pay the cost of creating array objects for all of them if you only end up accessing a few columns in some analytics algorithm

> [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns
> -----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1808
>                 URL: https://issues.apache.org/jira/browse/ARROW-1808
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>             Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual interface for record batches. There are places where we are using the record batch constructor directly, and in some third party code (like MapD), so this might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)