You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/11/22 02:09:58 UTC

[jira] [Comment Edited] (ARROW-384) Align Java and C++ RecordBatch data and metadata layout

    [ https://issues.apache.org/jira/browse/ARROW-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685423#comment-15685423 ] 

Wes McKinney edited comment on ARROW-384 at 11/22/16 2:08 AM:
--------------------------------------------------------------

This seems reasonable, and saves you from always requiring the metadata size. 

If we look at the flatbuffers IDL for the RecordBatch and Buffer metadata (https://github.com/apache/arrow/blob/master/format/Message.fbs#L210), it says the offset is "The relative offset into the shared memory page where the bytes for this buffer starts". This is somewhat rigid because it means that file-like record batches are not easily relocatable -- by that definition, the offset would need to be the position in the file relative to the start, not the start of the record batch. This is what the C++ implementation is doing now. 

Here's my idea: add an enum flag to RecordBatch that indicates whether the buffer offsets are absolute (relative to the start of the file or shared memory block) or relative to a contiguous blob of bytes (what the Java file implementation is doing now). The latter is not good necessarily for shared memory because it presumes contiguousness, but it also makes record batches relocatable when they are (e.g. in a file-like setting).

Relocatable record batch metadata / fully relative offsets is also better for RPC / socket-based exchange (which is effectively the same as sending a segment of the current "file format"), so that's another argument for adding that as an option.

I don't think I can make an argument that either absolute (needed for general shared memory IPC) or relative (better for file / RPC) offsets should be the only option available to the exclusion of the other. 


was (Author: wesmckinn):
This seems reasonable, and saves you from always requiring the metadata size. 

If we look at the flatbuffers IDL for the RecordBatch and Buffer metadata (https://github.com/apache/arrow/blob/master/format/Message.fbs#L210), it says the offset is "The relative offset into the shared memory page where the bytes for this buffer starts". This is somewhat rigid because it means that file-like record batches are not easily relocatable -- by that definition, the offset would need to be the position in the file relative to the start, not the start of the record batch. This is what the C++ implementation is doing now. 

Here's my idea: add an enum flag to RecordBatch that indicates whether the buffer offsets are absolute (relative to the start of the file or shared memory block) or relative to a contiguous blob of bytes (what the Java file implementation is doing now). The latter is not good necessarily for shared memory because it presumes contiguousness, but it also makes record batches relocatable when they are (e.g. in a file-like setting).

Relocatable record batch metadata / fully relative offsets is also better for RPC / socket-based exchange (which is effectively the same as sending a segment of the current "file format"), so that's another argument for adding that as an option.

I don't think I can make an argument that absolute (needed for general shared memory IPC) or relative (better for file / RPC) offsets as the only option. 

> Align Java and C++ RecordBatch data and metadata layout
> -------------------------------------------------------
>
>                 Key: ARROW-384
>                 URL: https://issues.apache.org/jira/browse/ARROW-384
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Julien Le Dem
>
> layout on C++ side:
> {noformat}
> <buffers> <metadata> <metadata size: int32>
> {noformat}
> and on the java side:
> {noformat}
> <metadata> <buffers>
> {noformat}
> In the file format the footer has a Block info that contains the metadata length.
> https://github.com/apache/arrow/blob/f082b17323354dc2af31f39c15c58b995ba08360/format/File.fbs#L36
> See:
> https://github.com/apache/arrow/pull/211#issuecomment-262080545



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)