You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2021/02/13 03:56:56 UTC

Array offset in IPC

Hi,

I am going through the Rust implementation of the IPC, and I am trying to
understand how we share Arraydata offsets.

Specifically, our C data interface supports the notion of an offset,
measured in slots, that denotes how many slots ahead of the buffer pointers
we read from. This enables us to share buffers between arrays, as we can
create a new array out of an existing array by slicing, and have its offset
increased.

However, I can't find this notion in our IPC format. We could argue that
IPC format does not support that, and that we should just slice our buffers
before encoding them to the message. That works for any type whose
bit_width is a multiple of 8, but I can't see this to work for bitmap
buffers: when we have an offset of 3 < 8 slots and a buffer with a validity
bitmap, we can't slice that buffer by 3 bits. If we do not share the offset
information and the consumer assumes an offset of 0, we will be consuming a
buffer offsetted by 0, and we lost the 3 in communication, thereby
incorrectly reconstructing the validity bitmap.

One solution is to assume an offset of zero when reading from IPC. But afai
understand, in that case, producers must themselves only share bitmap
buffers that are aligned at "8 bit boundaries". For example, an array with
offset 3, len 12 and a (shared) validity buffer with

01101010, 01101010

can't just write the above to the message; it must write the "new" below:

new: (010){01101}, 0000[1101]
old: {01101}010, 0[1101](010)  # 12 + 3 = 15, unbracket bits are ignored

i.e. skip the first 3 bits from the first byte and shift all bits
accordingly.

Is this reasoning correct? Is this the intention?

Best,
Jorge

Re: Array offset in IPC

Posted by Antoine Pitrou <an...@python.org>.
Hi Jorge,

Le 13/02/2021 à 04:56, Jorge Cardoso Leitão a écrit :
> 
> One solution is to assume an offset of zero when reading from IPC. But afai
> understand, in that case, producers must themselves only share bitmap
> buffers that are aligned at "8 bit boundaries". For example, an array with
> offset 3, len 12 and a (shared) validity buffer with
> 
> 01101010, 01101010
> 
> can't just write the above to the message; it must write the "new" below:
> 
> new: (010){01101}, 0000[1101]
> old: {01101}010, 0[1101](010)  # 12 + 3 = 15, unbracket bits are ignored
> 
> i.e. skip the first 3 bits from the first byte and shift all bits
> accordingly.
> 
> Is this reasoning correct? Is this the intention?

This is right.  You'll see here the implementation in the C++ IPC
writer, where non-byte aligned bitmaps are being copied to a temporary
buffer:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L84-L99

(note this code is a bit suboptimal, it could avoid copying if the
offset is a multiple of 8)

This must be done for the data of boolean arrays as well:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L301-L307

Regards

Antoine.

Re: Array offset in IPC

Posted by Micah Kornfield <em...@gmail.com>.
Hi Jorge,
This is correct to my knowledge offsets are not modelled in IPC.  There was
a lot of debate on whether to include them in the c data interface.

Cheers,
Micah

On Friday, February 12, 2021, Jorge Cardoso Leitão <jo...@gmail.com>
wrote:

> Hi,
>
> I am going through the Rust implementation of the IPC, and I am trying to
> understand how we share Arraydata offsets.
>
> Specifically, our C data interface supports the notion of an offset,
> measured in slots, that denotes how many slots ahead of the buffer pointers
> we read from. This enables us to share buffers between arrays, as we can
> create a new array out of an existing array by slicing, and have its offset
> increased.
>
> However, I can't find this notion in our IPC format. We could argue that
> IPC format does not support that, and that we should just slice our buffers
> before encoding them to the message. That works for any type whose
> bit_width is a multiple of 8, but I can't see this to work for bitmap
> buffers: when we have an offset of 3 < 8 slots and a buffer with a validity
> bitmap, we can't slice that buffer by 3 bits. If we do not share the offset
> information and the consumer assumes an offset of 0, we will be consuming a
> buffer offsetted by 0, and we lost the 3 in communication, thereby
> incorrectly reconstructing the validity bitmap.
>
> One solution is to assume an offset of zero when reading from IPC. But afai
> understand, in that case, producers must themselves only share bitmap
> buffers that are aligned at "8 bit boundaries". For example, an array with
> offset 3, len 12 and a (shared) validity buffer with
>
> 01101010, 01101010
>
> can't just write the above to the message; it must write the "new" below:
>
> new: (010){01101}, 0000[1101]
> old: {01101}010, 0[1101](010)  # 12 + 3 = 15, unbracket bits are ignored
>
> i.e. skip the first 3 bits from the first byte and shift all bits
> accordingly.
>
> Is this reasoning correct? Is this the intention?
>
> Best,
> Jorge
>