You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by qu...@apache.org on 2022/11/03 16:19:38 UTC

[arrow-julia] branch main updated: Ensure Julia types have alignment respected (#357)

This is an automated email from the ASF dual-hosted git repository.

quinnj pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-julia.git


The following commit(s) were added to refs/heads/main by this push:
     new 6d0ac49  Ensure Julia types have alignment respected (#357)
6d0ac49 is described below

commit 6d0ac4946f062414e2b60aa3d67c2875bb2e7958
Author: Jacob Quinn <qu...@gmail.com>
AuthorDate: Thu Nov 3 10:19:32 2022 -0600

    Ensure Julia types have alignment respected (#357)
    
    Fixes https://github.com/apache/arrow-julia/issues/345.
    
    Alternative to https://github.com/apache/arrow-julia/pull/350.
    
    Whew, a bit of a rabbit hole here. It turns out this issue doesn't have
    anything to do w/ the _arrow_ alignment when writing, but actually goes
    back to a core Julia issue, specifically the inconsistency reported in
    https://github.com/JuliaLang/julia/issues/42326. Essentially, the issue
    is that different platforms (intel vs. arm64) are requiring different
    alignments for types like `UInt128` (8 for intel, 16 for arm64).
    
    And it turns out this isn't even a Julia issue, but all the way back to LLVM.
    
    So why does this crop up in Arrow.jl? Because when you try to serialize/deserialize
    a `Arrow.Primitive{Int128}` array, we're going to write it out in the proper arrow
    format, but when you read it back in, we've been using the zero-copy technique of:
    
    ```julia
    unsafe_wrap(Array, Ptr{Int128}(arrow_ptr), len)
    ```
    in order to give the user a Julia `Vector{Int128}` to the underlying arrow data.
    
    _BUT_, in Julia when you allocate any `Vector` with an eltype size of less than
    4, then only 8 bytes of alignment are specified. So the fact that most users
    will pass arrow data as a file (which is mmapped as `Vector{UInt8}`), a byte
    vector directly (`Vector{UInt8}`), or an IO (which we read as `Vector{UInt8}`),
    means these vectors are only 8-byte aligned. This then throws the fatal error reported
    in the original #350 issue about the pointer not being 16-byte aligned.
    
    So one option to consider is allowing users to pass in a 16-byte aligned arrow "source"
    and then that would just work, right? Well, except Julia doesn't expose any way of
    "upcasting" an array's alignment, so it's purely based off the array eltype. Which in
    turn is a struggle because the current Arrow.jl architecture assumes the source will be
    exactly a `Vector{UInt8}` everywhere. So essentially it requires some heruclean effort
    to try and pass your own aligned array; I think the only option I can think of is if you
    did something like:
      * Mmap an arrow file into Vector{UInt8}
      * Allocate your own Vector{UInt32} and copy arrow data into it
      * Use unsafe_wrap to make a Vector{UInt8} of the Vector{UInt32} data
      * But you _MUST_ keep a reference to the original Vector{UInt32} array
        otherwise, your new Vector{UInt8} gets corrupted
      * Pass the new Vector{UInt8} into Arrow.Table to read
    
    Anyway, not easy to say the least.
    
    The proposed solution here is as follows:
      * We check `sizeof(T)` to see if it's > 16 bytes, which we're using as a proxy
        for the alignment, regardless of platform (core devs are leaning towards the 8-byte
        alignment on intel actually being a bug for 16-byte primitives and I agree)
      * We use sizeof since the `jl_datatype_t` alignment field isn't part of the public Julia
        API and thus subject to change (and in fact I think it did just change location in Julia#master)
      * It's a good enough proxy for our purposes anyway
      * If the original arrow pointer _isn't_ 16-byte aligned, then we'll allocate a new `Vector{T}`,
        which _will_ be aligned, then copy the arrow data directly into it via pointers. Simple, easy,
        just one extra allocation/copy.
    
    If Julia _does_ get the ability in the future to specify a custom larger-than-eltype-required alignemnt
    for arrays, then we could potentially do that ourselves when reading, but it's a little tricky because
    we really only need to do that if there are 16-byte primitives we'll be deserializing and we don't know that
    until we read a schema message. So *shrug*.
---
 src/table.jl | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/table.jl b/src/table.jl
index 799b922..c4a2e3d 100644
--- a/src/table.jl
+++ b/src/table.jl
@@ -512,7 +512,22 @@ end
 function reinterp(::Type{T}, batch, buf, compression) where {T}
     ptr = pointer(batch.bytes, batch.pos + buf.offset)
     if compression === nothing
-        return batch.bytes, unsafe_wrap(Array, convert(Ptr{T}, ptr), div(buf.length, sizeof(T)))
+        # it would be technically more correct to check that T.layout->alignment > 8
+        # but the datatype alignment isn't officially exported, so we're using
+        # primitive types w/ sizeof(T) >= 16 as a proxy for types that need 16-byte alignment
+        if sizeof(T) >= 16 && (UInt(ptr) & 15) != 0
+            # https://github.com/apache/arrow-julia/issues/345
+            # https://github.com/JuliaLang/julia/issues/42326
+            # need to ensure that the data/pointers are aligned to 16 bytes
+            # so we can't use unsafe_wrap here, but do an extra allocation
+            # to avoid the allocation, user needs to ensure input buffer is
+            # 16-byte aligned (somehow, it's not super straightforward how to ensure that)
+            A = Vector{T}(undef, div(buf.length, sizeof(T)))
+            unsafe_copyto!(Ptr{UInt8}(pointer(A)), ptr, buf.length)
+            return batch.bytes, A
+        else
+            return batch.bytes, unsafe_wrap(Array, convert(Ptr{T}, ptr), div(buf.length, sizeof(T)))
+        end
     else
         # compressed
         len, decodedbytes = uncompress(ptr, buf, compression)