You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/02/17 15:52:00 UTC

[jira] [Commented] (ARROW-645) [Format] Mitigating the cost of random access in "wide" record batches

    [ https://issues.apache.org/jira/browse/ARROW-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285901#comment-17285901 ] 

Antoine Pitrou commented on ARROW-645:
--------------------------------------

It's not obvious to me we want to add that kind of complication to the IPC format. This threatens to turn the IPC format to a Parquet-like spec with niche optional features that just fragment the ecosystem and makes it difficult to predict whether two endpoints will be compatible with each other.

Do people actually have such mega-wide schemas? What is the use case for having 1e6 fields in a schema?

> [Format] Mitigating the cost of random access in "wide" record batches
> ----------------------------------------------------------------------
>
>                 Key: ARROW-645
>                 URL: https://issues.apache.org/jira/browse/ARROW-645
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Format
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 4.0.0
>
>
> In very large schemas, due of the way we are flattening the field and buffer metadata in the RecordBatch:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L271
> The cost to reconstruct / load a single array from a RecordBatch can be arbitrarily high. 
> As an example, let's consider a schema:
> {code}
> f0: int32
> f1: string
> ...  omitting 999996 duplicate
> f999998: int32
> f999999: string
> {code}
> Here, a record batch has 1 million fields, and in total 2.5 million buffers. The problem with this is: to select a single field out of a record batch, we have to inspect all types leading up to the field of interest to know how many {{FieldNode}} and {{Buffer}} metadata values will have occurred in the serialized metadata before that field's metadata appears.
> Solving this is a little bit tricky. One way would be to add optional "field position" and "buffer position" attributes to the {{Field}} table:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L188
> So here, we would know that for the {{f1}} field, the field index is 1 and the buffer index is 2. Because a string has 3 buffers associated with it, we would know to select buffers in slots 2, 3, 4 to reconstruct the vector container. 
> Let me know if the problem is not clear, and any other ideas about solutions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)