You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Moelf (via GitHub)" <gi...@apache.org> on 2023/04/02 23:39:33 UTC

[GitHub] [arrow-julia] Moelf opened a new issue, #411: `Vector{UInt8}` mis-represented when writing to disk

Moelf opened a new issue, #411:
URL: https://github.com/apache/arrow-julia/issues/411

   ```julia
   julia> using Arrow, DataFrames
   
   julia> df = DataFrame(; x = [[0x01, 0x02], UInt8[], [0x03]])
   3×1 DataFrame
    Row │ x
        │ Array…
   ─────┼───────────────────
      1 │ UInt8[0x01, 0x02]
      2 │ UInt8[]
      3 │ UInt8[0x03]
   
   julia> Arrow.write("/tmp/julia.feather", df)
   "/tmp/julia.feather"
   ```
   
   ```python
   In [1]: import pyarrow.feather
   
   In [3]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
   Out[3]:
   <pyarrow.lib.ChunkedArray object at 0x7fb2994c86d0>
   [
     [
       0102,
       ,
       03
     ]
   ]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on issue #411: `Vector{UInt8}` mis-represented in metadata when writing to disk

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on issue #411:
URL: https://github.com/apache/arrow-julia/issues/411#issuecomment-1500679705

   https://github.com/apache/arrow-julia/blob/c469151d4ff261b50c59bf98101f068fa577fca4/src/arraytypes/list.jl#L194
   
   this seems to be the reason, and one step back, `ToList()` converts both into flat `Vector{UInt8}` so it's not distinguishable if you only look at variable `flat`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on issue #411: `Vector{UInt8}` mis-represented in metadata when writing to disk

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on issue #411:
URL: https://github.com/apache/arrow-julia/issues/411#issuecomment-1500670163

   I did some digging
   ```diff
   diff --git a/src/arraytypes/arraytypes.jl b/src/arraytypes/arraytypes.jl
   index f3cee5d..a338004 100644
   --- a/src/arraytypes/arraytypes.jl
   +++ b/src/arraytypes/arraytypes.jl
   @@ -34,7 +34,9 @@ Base.deleteat!(x::T, inds) where {T <: ArrowVector} = throw(ArgumentError("`$T`
    function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
        @debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(values(kw))"
        @debugv 3 x
   +    @show typeof(x)
        A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...)
   +    @show typeof(A)
        if compression isa LZ4FrameCompressor
            A = compress(Meta.CompressionTypes.LZ4_FRAME, compression, A)
        elseif compression isa Vector{LZ4FrameCompressor}
   ```
   ```julia
   julia> data = (; x = [[0x01, 0x02], UInt8[], [0x03]], y = [[0, 1], Int[], [2,3]])
   (x = Vector{UInt8}[[0x01, 0x02], [], [0x03]], y = [[0, 1], Int64[], [2, 3]])
   
   julia> Arrow.write("/tmp/bug411.feather", data)
   typeof(x) = Vector{Vector{UInt8}}
   typeof(A) = Arrow.List{Vector{UInt8}, Int32, Arrow.ToList{UInt8, false, Vector{UInt8}, Int32}}
   typeof(x) = Vector{Vector{Int64}}
   typeof(A) = Arrow.List{Vector{Int64}, Int32, Arrow.Primitive{Int64, Arrow.ToList{Int64, false, Vector{Int64}, Int32}}}
   "/tmp/bug411.feather"
   ```
   
   the question is why `UInt8` is built `ToList` while `Int64` is Primitive while both of them seem to be possible primitive https://arrow.apache.org/docs/python/generated/pyarrow.uint8.html#pyarrow.uint8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj closed issue #411: `Vector{UInt8}` mis-represented when writing to disk

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj closed issue #411: `Vector{UInt8}` mis-represented when writing to disk
URL: https://github.com/apache/arrow-julia/issues/411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on issue #411: `Vector{UInt8}` mis-represented in metadata when writing to disk

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on issue #411:
URL: https://github.com/apache/arrow-julia/issues/411#issuecomment-1500690716

   we also hit this part:
   https://github.com/apache/arrow-julia/blob/c469151d4ff261b50c59bf98101f068fa577fca4/src/eltypes.jl#L405-L407
   
   all in all it seems like a deliberate choice which I think is wrong, given pyarrow behavior and application of `Vector{UInt8}` that's not byte-string


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj commented on issue #411: `Vector{UInt8}` mis-represented when writing to disk

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on issue #411:
URL: https://github.com/apache/arrow-julia/issues/411#issuecomment-1553618585

   I think it's a reasonable request to not treat `Vector{UInt8}` as the `Binary` arrow type and only have `CodeUnits` be treated that way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on issue #411: `Vector{UInt8}` mis-represented when writing to disk

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on issue #411:
URL: https://github.com/apache/arrow-julia/issues/411#issuecomment-1496684663

   to show that `pyarrow` does something different and consistent:
   ```python
   In [8]: import pyarrow.feather, numpy as np, pandas as pd
   
   In [9]: df = pd.DataFrame({"x": [[np.uint8(0)], [np.uint8(1), np.uint8(2)]]})
   
   In [11]: pyarrow.feather.write_feather(df, "/tmp/pyarrow.feather", compression="uncompressed")
   
   In [12]: pyarrow.feather.read_table("/tmp/pyarrow.feather")["x"]
   Out[12]:
   <pyarrow.lib.ChunkedArray object at 0x7f80e3f93ec0>
   [
     [
       [
         0
       ],
       [
         1,
         2
       ]
     ]
   ]
   ```
   
   read it back from Julia
   ```julia
   julia> Arrow.Table("/tmp/pyarrow.feather").x
   2-element Arrow.List{Union{Missing, Vector{Union{Missing, UInt8}}}, Int32, Arrow.Primitive{Union{Missing, UInt8}, Vector{UInt8}}}:
    Union{Missing, UInt8}[0x00]
    Union{Missing, UInt8}[0x01, 0x02]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org