You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2022/08/01 16:55:18 UTC

[QUESTION] How is mmap implemented for 8bit padded files?

Hi,

I am trying to follow the C++ implementation with respect to mmap IPC files
and reading them zero-copy, in the context of reproducing it in Rust.

My understanding from reading the source code is that we essentially:
* identify the memory regions (offset and length) of each of the buffers,
via IPC's flatbuffer "Node".
* cast the uint8 pointer to the corresponding type based on the datatype
(e.g. f32 for float32)

I am struggling to understand how we ensure that the pointer is aligned
[2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted
to it.

In other words, I would expect mmap to work when:
* the files' bit padding is 64 bits
* the target type is <= 64 bits
However,
* we have types with more than 64 bits (int128 and int256)
* a file can be 8-bit aligned

The background is that Rust requires pointers to be aligned to the type for
safe casting (it is UB to read unaligned pointers), and the above naturally
poses a challenge when reading i128, i256 and 8-bit padded files.

Best,
Jorge

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
[2] https://en.wikipedia.org/wiki/Data_structure_alignment
[3] https://stackoverflow.com/a/4322950/931303

Re: [QUESTION] How is mmap implemented for 8bit padded files?

Posted by Antoine Pitrou <an...@python.org>.
Le 03/08/2022 à 18:29, Jorge Cardoso Leitão a écrit :
> Hi Antoine,
> 
> Thanks a lot for your answer.
> 
> So, if I understand (I may have not), we do not impose restrictions to the
> alignment of the data when we get the pointer; only when we read from it.
> Doesn't this require checking for alignment at runtime?

Only if you do things that are alignment-sensitive.

That said, while it is formally allowed AFAIK, it probably occurs rarely 
so potential issues (if any) are probably not surfaced.

Best regards

Antoine.


> 
> Best,
> Jorge
> 
> 
> 
> On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Hi Jorge,
>>
>> So there are two aspects to the answer:
>>
>> - ideally, the C++ implementation also works on non-aligned data (though
>> this is poorly tested, if any)
>>
>> - when mmap'ing a file, you should get a page-aligned address
>>
>> As for int128 and int256, these usually don't exist at the hardware
>> level anyway, so implementing those reads as a combination of 64-bit
>> reads shouldn't hurt performance-wise.
>>
>> More generally, I don't know about Rust but in C++ unaligned access
>> would be made UB-safe by using the memcpy trick, which is correctly
>> optimized by production compilers:
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit :
>>> Hi,
>>>
>>> I am trying to follow the C++ implementation with respect to mmap IPC
>> files
>>> and reading them zero-copy, in the context of reproducing it in Rust.
>>>
>>> My understanding from reading the source code is that we essentially:
>>> * identify the memory regions (offset and length) of each of the buffers,
>>> via IPC's flatbuffer "Node".
>>> * cast the uint8 pointer to the corresponding type based on the datatype
>>> (e.g. f32 for float32)
>>>
>>> I am struggling to understand how we ensure that the pointer is aligned
>>> [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely
>> casted
>>> to it.
>>>
>>> In other words, I would expect mmap to work when:
>>> * the files' bit padding is 64 bits
>>> * the target type is <= 64 bits
>>> However,
>>> * we have types with more than 64 bits (int128 and int256)
>>> * a file can be 8-bit aligned
>>>
>>> The background is that Rust requires pointers to be aligned to the type
>> for
>>> safe casting (it is UB to read unaligned pointers), and the above
>> naturally
>>> poses a challenge when reading i128, i256 and 8-bit padded files.
>>>
>>> Best,
>>> Jorge
>>>
>>> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
>>> [2] https://en.wikipedia.org/wiki/Data_structure_alignment
>>> [3] https://stackoverflow.com/a/4322950/931303
>>>
>>
> 

Re: [QUESTION] How is mmap implemented for 8bit padded files?

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Hi Antoine,

Thanks a lot for your answer.

So, if I understand (I may have not), we do not impose restrictions to the
alignment of the data when we get the pointer; only when we read from it.
Doesn't this require checking for alignment at runtime?

Best,
Jorge



On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou <an...@python.org> wrote:

>
> Hi Jorge,
>
> So there are two aspects to the answer:
>
> - ideally, the C++ implementation also works on non-aligned data (though
> this is poorly tested, if any)
>
> - when mmap'ing a file, you should get a page-aligned address
>
> As for int128 and int256, these usually don't exist at the hardware
> level anyway, so implementing those reads as a combination of 64-bit
> reads shouldn't hurt performance-wise.
>
> More generally, I don't know about Rust but in C++ unaligned access
> would be made UB-safe by using the memcpy trick, which is correctly
> optimized by production compilers:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69
>
> Regards
>
> Antoine.
>
>
> Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit :
> > Hi,
> >
> > I am trying to follow the C++ implementation with respect to mmap IPC
> files
> > and reading them zero-copy, in the context of reproducing it in Rust.
> >
> > My understanding from reading the source code is that we essentially:
> > * identify the memory regions (offset and length) of each of the buffers,
> > via IPC's flatbuffer "Node".
> > * cast the uint8 pointer to the corresponding type based on the datatype
> > (e.g. f32 for float32)
> >
> > I am struggling to understand how we ensure that the pointer is aligned
> > [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely
> casted
> > to it.
> >
> > In other words, I would expect mmap to work when:
> > * the files' bit padding is 64 bits
> > * the target type is <= 64 bits
> > However,
> > * we have types with more than 64 bits (int128 and int256)
> > * a file can be 8-bit aligned
> >
> > The background is that Rust requires pointers to be aligned to the type
> for
> > safe casting (it is UB to read unaligned pointers), and the above
> naturally
> > poses a challenge when reading i128, i256 and 8-bit padded files.
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
> > [2] https://en.wikipedia.org/wiki/Data_structure_alignment
> > [3] https://stackoverflow.com/a/4322950/931303
> >
>

Re: [QUESTION] How is mmap implemented for 8bit padded files?

Posted by Antoine Pitrou <an...@python.org>.
Hi Jorge,

So there are two aspects to the answer:

- ideally, the C++ implementation also works on non-aligned data (though 
this is poorly tested, if any)

- when mmap'ing a file, you should get a page-aligned address

As for int128 and int256, these usually don't exist at the hardware 
level anyway, so implementing those reads as a combination of 64-bit 
reads shouldn't hurt performance-wise.

More generally, I don't know about Rust but in C++ unaligned access 
would be made UB-safe by using the memcpy trick, which is correctly 
optimized by production compilers:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69

Regards

Antoine.


Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit :
> Hi,
> 
> I am trying to follow the C++ implementation with respect to mmap IPC files
> and reading them zero-copy, in the context of reproducing it in Rust.
> 
> My understanding from reading the source code is that we essentially:
> * identify the memory regions (offset and length) of each of the buffers,
> via IPC's flatbuffer "Node".
> * cast the uint8 pointer to the corresponding type based on the datatype
> (e.g. f32 for float32)
> 
> I am struggling to understand how we ensure that the pointer is aligned
> [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted
> to it.
> 
> In other words, I would expect mmap to work when:
> * the files' bit padding is 64 bits
> * the target type is <= 64 bits
> However,
> * we have types with more than 64 bits (int128 and int256)
> * a file can be 8-bit aligned
> 
> The background is that Rust requires pointers to be aligned to the type for
> safe casting (it is UB to read unaligned pointers), and the above naturally
> poses a challenge when reading i128, i256 and 8-bit padded files.
> 
> Best,
> Jorge
> 
> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
> [2] https://en.wikipedia.org/wiki/Data_structure_alignment
> [3] https://stackoverflow.com/a/4322950/931303
>