You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Moelf (via GitHub)" <gi...@apache.org> on 2023/04/07 23:44:34 UTC

[GitHub] [arrow-julia] Moelf opened a new pull request, #419: fix Vector{UInt8} writing

Moelf opened a new pull request, #419:
URL: https://github.com/apache/arrow-julia/pull/419

   fix #411
   
   Setup:
   ```julia
   julia> using Arrow
   
   julia> df = (; x = [[0x01, 0x02], UInt8[], [0x03]])
   
   julia> Arrow.write("/tmp/julia.feather", df)
   "/tmp/julia.feather"
   ```
   
   Before:
   ```python
   
   In [3]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
   Out[3]:
   <pyarrow.lib.ChunkedArray object at 0x7fb2994c86d0>
   [
     [
       0102,
       ,
       03
     ]
   ]
   ```
   
   After:
   ```python
   In [39]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
   Out[39]:
   <pyarrow.lib.ChunkedArray object at 0x7f4568b376a0>
   [
     [
       [
         1,
         2
       ],
       [],
       [
         3
       ]
     ]
   ]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1546790826

   I see.  Probably we should split this up into 2 smaller PRs: one for ArrowTypes and one for Arrow.  Assuming we make a new release of ArrowTypes then in the follow-on Arrow PR we can bump the compat of ArrowTypes to the newly released version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on a diff in pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on code in PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#discussion_r1164964813


##########
src/eltypes.jl:
##########
@@ -393,23 +393,21 @@ end
 
 # arrowtype will call fieldoffset recursively for children
 function arrowtype(b, x::List{T, O, A}) where {T, O, A}
-    if eltype(A) == UInt8
-        if T <: AbstractString || T <: Union{AbstractString, Missing}
-            if O == Int32
-                Meta.utf8Start(b)
-                return Meta.Utf8, Meta.utf8End(b), nothing
-            else # if O == Int64
-                Meta.largUtf8Start(b)
-                return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
-            end
-        else # if Vector{UInt8}
-            if O == Int32
-                Meta.binaryStart(b)
-                return Meta.Binary, Meta.binaryEnd(b), nothing
-            else # if O == Int64
-                Meta.largeBinaryStart(b)
-                return Meta.LargeBinary, Meta.largeBinaryEnd(b), nothing
-            end
+    if eltype(A) == UInt8 && T <: AbstractString || T <: Union{AbstractString, Missing}
+        if O == Int32
+            Meta.utf8Start(b)
+            return Meta.Utf8, Meta.utf8End(b), nothing
+        else # if O == Int64
+            Meta.largUtf8Start(b)
+            return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
+        end
+    elseif eltype(A) == UInt8 && ArrowTypes.isstringtype(T)

Review Comment:
   the only additional type cathed here is `Base.CodeUnits`.
   
   Ideally, we want the first (above) used for long string and this one for short-string, but that's not a well defined concept, so for now we just say, "use CodeUnits if you want output byte-string in Arrow"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1553761868

   Ok, PR up: https://github.com/apache/arrow-julia/pull/439. Sorry to be so MIA lately; I've been tied up in some heavy other projects and it's been too hard to context-switch back here. A lot of that work (webstack-related) has wrapped up (mostly), I'm hoping to have more time to help review/fix stuff here.
   
   This PR was definitely in the right direction @Moelf; thanks for the contribution. It was a great starting point. Couple of specific points:
   * I'd rather not pun the `ArrowTypes.isstringtype` function for non-ListKind types; it keeps that a little cleaner IMO
   * We can make Base.CodeUnits the main translation between the Arrow Binary type
   * We can add some compat shims in Arrow to avoid the awkward ArrowTypes/Arrow mis-compat


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1546887888

   So what people think, I should do a separate PR to add convince function??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1553671653

   @Moelf, do you mind if I make an alternative PR to fix the original issue here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] ericphanson commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "ericphanson (via GitHub)" <gi...@apache.org>.
ericphanson commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1546859945

   > isn't the whole point of monorepo so that when we have 2 packages we don't have to make rapid release to fix tests for the other package?
   
   I’m not sure all the reasons for using this setup in this case but that definitely isn’t always a goal. E.g. https://discourse.julialang.org/t/how-beacon-packages-julia-code-in-a-monorepo/90822. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on a diff in pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on code in PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#discussion_r1198275423


##########
src/eltypes.jl:
##########
@@ -393,23 +393,21 @@ end
 
 # arrowtype will call fieldoffset recursively for children
 function arrowtype(b, x::List{T, O, A}) where {T, O, A}
-    if eltype(A) == UInt8
-        if T <: AbstractString || T <: Union{AbstractString, Missing}
-            if O == Int32
-                Meta.utf8Start(b)
-                return Meta.Utf8, Meta.utf8End(b), nothing
-            else # if O == Int64
-                Meta.largUtf8Start(b)
-                return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
-            end
-        else # if Vector{UInt8}
-            if O == Int32
-                Meta.binaryStart(b)
-                return Meta.Binary, Meta.binaryEnd(b), nothing
-            else # if O == Int64
-                Meta.largeBinaryStart(b)
-                return Meta.LargeBinary, Meta.largeBinaryEnd(b), nothing
-            end
+    if eltype(A) == UInt8 && T <: AbstractString || T <: Union{AbstractString, Missing}
+        if O == Int32
+            Meta.utf8Start(b)
+            return Meta.Utf8, Meta.utf8End(b), nothing
+        else # if O == Int64
+            Meta.largUtf8Start(b)
+            return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
+        end
+    elseif eltype(A) == UInt8 && ArrowTypes.isstringtype(T)

Review Comment:
   sorry, "caught" (I probably tried to type catched, isn't a word), as in caught by the condition



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj commented on a diff in pull request #419: fix `Vector{UInt8}` writing

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on code in PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#discussion_r1198274330


##########
src/eltypes.jl:
##########
@@ -393,23 +393,21 @@ end
 
 # arrowtype will call fieldoffset recursively for children
 function arrowtype(b, x::List{T, O, A}) where {T, O, A}
-    if eltype(A) == UInt8
-        if T <: AbstractString || T <: Union{AbstractString, Missing}
-            if O == Int32
-                Meta.utf8Start(b)
-                return Meta.Utf8, Meta.utf8End(b), nothing
-            else # if O == Int64
-                Meta.largUtf8Start(b)
-                return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
-            end
-        else # if Vector{UInt8}
-            if O == Int32
-                Meta.binaryStart(b)
-                return Meta.Binary, Meta.binaryEnd(b), nothing
-            else # if O == Int64
-                Meta.largeBinaryStart(b)
-                return Meta.LargeBinary, Meta.largeBinaryEnd(b), nothing
-            end
+    if eltype(A) == UInt8 && T <: AbstractString || T <: Union{AbstractString, Missing}
+        if O == Int32
+            Meta.utf8Start(b)
+            return Meta.Utf8, Meta.utf8End(b), nothing
+        else # if O == Int64
+            Meta.largUtf8Start(b)
+            return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
+        end
+    elseif eltype(A) == UInt8 && ArrowTypes.isstringtype(T)

Review Comment:
   What does "cathed" mean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1553678257

   go ahead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1542647754

   @quinnj 👀  gentle bump?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1548023108

   just to be clear, in that case we will be doing:
   1. make a PR to add a convenient function ("will be used in next PR trust me")
   2. release a new ArrowTypes.jl for that, and potentially making a bunch of users re-precompile their stuff for no reason
   3. see tests in this PR turn all green?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj closed pull request #419: fix `Vector{UInt8}` writing

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj closed pull request #419: fix `Vector{UInt8}` writing
URL: https://github.com/apache/arrow-julia/pull/419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1546805883

   what?.........


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1500732469

   test wouldn't fully pass, could use some help and also this is technically breaking but I believe this is the correct thing to do.
   
   If user wants byte-string, they should use `b""` and not `Vector{UInt8}` @quinnj 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1506309702

   The reason why this PR doesn't need to touch test is because our implementation was self-consistent in a hard way -- it ignored an entire possible type composition in arrow:
   ![image](https://user-images.githubusercontent.com/5306213/231649413-14f1135b-d9c9-4314-9c64-65befcb0cdb8.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1503696228

   CI waiting for #424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] simsurace commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "simsurace (via GitHub)" <gi...@apache.org>.
simsurace commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1547215131

   If this is what it takes to keep the ball moving, I would suggest doing so, but I am not a maintainer, just an observer interested to see this package continue getting improved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] simsurace commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "simsurace (via GitHub)" <gi...@apache.org>.
simsurace commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1546338769

   @ericphanson or @baumgold would you be willing/able to review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1546663216

   @baumgold no they are not
   
   The tests are failing because this changes both code base, you should look at this test:
   
   https://github.com/apache/arrow-julia/actions/runs/4685294460/jobs/8302259625?pr=419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold commented on pull request #419: fix `Vector{UInt8}` writing

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#issuecomment-1546659025

   > @ericphanson or @baumgold would you be willing/able to review this?
   
   Looks like some of the tests are failing.  Would you mind investigating and fixing?  Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on a diff in pull request #419: fix `Vector{UInt8}` writing

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on code in PR #419:
URL: https://github.com/apache/arrow-julia/pull/419#discussion_r1198275423


##########
src/eltypes.jl:
##########
@@ -393,23 +393,21 @@ end
 
 # arrowtype will call fieldoffset recursively for children
 function arrowtype(b, x::List{T, O, A}) where {T, O, A}
-    if eltype(A) == UInt8
-        if T <: AbstractString || T <: Union{AbstractString, Missing}
-            if O == Int32
-                Meta.utf8Start(b)
-                return Meta.Utf8, Meta.utf8End(b), nothing
-            else # if O == Int64
-                Meta.largUtf8Start(b)
-                return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
-            end
-        else # if Vector{UInt8}
-            if O == Int32
-                Meta.binaryStart(b)
-                return Meta.Binary, Meta.binaryEnd(b), nothing
-            else # if O == Int64
-                Meta.largeBinaryStart(b)
-                return Meta.LargeBinary, Meta.largeBinaryEnd(b), nothing
-            end
+    if eltype(A) == UInt8 && T <: AbstractString || T <: Union{AbstractString, Missing}
+        if O == Int32
+            Meta.utf8Start(b)
+            return Meta.Utf8, Meta.utf8End(b), nothing
+        else # if O == Int64
+            Meta.largUtf8Start(b)
+            return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
+        end
+    elseif eltype(A) == UInt8 && ArrowTypes.isstringtype(T)

Review Comment:
   sorry, "caught", as in caught by the condition



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org