You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "baumgold (via GitHub)" <gi...@apache.org> on 2023/03/15 02:31:40 UTC

[GitHub] [arrow-julia] baumgold opened a new pull request, #405: do not instantiate new Vector when calling fromarrow

baumgold opened a new pull request, #405:
URL: https://github.com/apache/arrow-julia/pull/405

   Currently when there's a table containing a column of type variable-size list, whenever the user accesses an entry in the list a new Vector is instantiated causing allocations and performance degradation.  This PR proposes to prevent instantiating the Vector and instead return to the user a zero-copy view of the data (SubArray) that still abides by the AbstractVector interface.
   
   For example, instead of getting a newly-allocated `Vector{Int64}`:
   
   ```julia
   julia> using Arrow, Tables
   
   julia> x = (a=[1,2,3], b=[[1,1,1], [2,2], [3,3,3,3]]);
   
   julia> table = Arrow.Table(Arrow.tobuffer(x))
   Arrow.Table with 3 rows, 2 columns, and schema:
   :a  Int64
   :b  Vector{Int64} (alias for Array{Int64, 1})
   
   julia> p = iterate(Tables.rows(table))
   (Tables.ColumnsRow{Tables.CopiedColumns{Arrow.Table}}: (a = 1, b = [1, 1, 1]), 2)
   
   julia> p[1].b
   3-element Vector{Int64}:
   1
   1
   1
   ```
   
   we'll instead get a `SubArray`:
   
   ```julia
   julia> using Arrow, Tables
   
   julia> x = (a=[1,2,3], b=[[1,1,1], [2,2], [3,3,3,3]]);
   
   julia> table = Arrow.Table(Arrow.tobuffer(x))
   Arrow.Table with 3 rows, 2 columns, and schema:
   :a  Int64
   :b  Vector{Int64} (alias for Array{Int64, 1})
   
   julia> p = iterate(Tables.rows(table))
   (Tables.ColumnsRow{Tables.CopiedColumns{Arrow.Table}}: (a = 1, b = [1, 1, 1]), 2)
   
   julia> p[1].b
   3-element view(::Arrow.Primitive{Int64, Vector{Int64}}, 1:3) with eltype Int64:
   1
   1
   1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold commented on pull request #405: do not instantiate new Vector when calling fromarrow

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1568907469

   > Maybe there's something we can do in the list.jl code or eltypes.jl to say that the variable sized list type returns SubArrays?
   
   I spent some time trying to implement that but couldn't find a clean way to do so.  For `juliaeltype` to return SubArray instead of Vector it would need to know the "parent" type of the SubArray ([parent-type is a type-parameter](https://github.com/JuliaLang/julia/blob/v1.9.0/base/subarray.jl#L14-L24)).  However the "parent" type is deeply embedded in return-type of the [build function](https://github.com/apache/arrow-julia/blob/v2.5.2/src/table.jl#L583).  We could refactor the build functions to separate type calculation from type instantiation so that the calculation part could be reused in the `juliaeltype` function, but that seems like a lot of work so I didn't peruse it yet.  Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] codecov-commenter commented on pull request #405: do not instantiate new Vector when calling fromarrow

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1469229741

   ## [Codecov](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#405](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a0876cf) into [main](https://codecov.io/gh/apache/arrow-julia/commit/aaa4cf878a2f116944d910b84aa4747fa3112ecb?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (aaa4cf8) will **increase** coverage by `0.02%`.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##             main     #405      +/-   ##
   ==========================================
   + Coverage   86.88%   86.91%   +0.02%     
   ==========================================
     Files          26       26              
     Lines        3257     3256       -1     
   ==========================================
     Hits         2830     2830              
   + Misses        427      426       -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [src/ArrowTypes/src/ArrowTypes.jl](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3JjL0Fycm93VHlwZXMvc3JjL0Fycm93VHlwZXMuamw=) | `86.47% <0.00%> (+0.50%)` | :arrow_up: |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold closed pull request #405: do not instantiate new Vector when calling fromarrow

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold closed pull request #405: do not instantiate new Vector when calling fromarrow
URL: https://github.com/apache/arrow-julia/pull/405


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj commented on pull request #405: do not instantiate new Vector when calling fromarrow

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1490699371

   Sorry for the slow response. Yeah.......I do like this. I think I remember considering doing this too back when I originally implemented this, but I was worried about something, but can't remember now.
   
   I like this change, but I'm worried about it being breaking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold commented on pull request #405: do not instantiate new Vector when calling fromarrow

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1485280092

   @quinnj - Any thoughts on this proposed change?  Many of the datasets that I work with have schemas containing variable-sized lists, which are currently quite slow to load due to this allocation issue.  I'd appreciate some feedback on whether this is the right approach to solve the problem.  Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] quinnj commented on pull request #405: do not instantiate new Vector when calling fromarrow

Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1571324812

   Here's a somewhat simple approach: https://github.com/apache/arrow-julia/pull/446. It's not quite passing things yet; I need to check if the append test error is legit. I think the others are just updates where we're checking the schema strictly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold commented on pull request #405: do not instantiate new Vector when calling fromarrow

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1573120969

   Closing in place off #446


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org