You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "baumgold (via GitHub)" <gi...@apache.org> on 2023/03/15 02:31:40 UTC
[GitHub] [arrow-julia] baumgold opened a new pull request, #405: do not instantiate new Vector when calling fromarrow
baumgold opened a new pull request, #405:
URL: https://github.com/apache/arrow-julia/pull/405
Currently when there's a table containing a column of type variable-size list, whenever the user accesses an entry in the list a new Vector is instantiated causing allocations and performance degradation. This PR proposes to prevent instantiating the Vector and instead return to the user a zero-copy view of the data (SubArray) that still abides by the AbstractVector interface.
For example, instead of getting a newly-allocated `Vector{Int64}`:
```julia
julia> using Arrow, Tables
julia> x = (a=[1,2,3], b=[[1,1,1], [2,2], [3,3,3,3]]);
julia> table = Arrow.Table(Arrow.tobuffer(x))
Arrow.Table with 3 rows, 2 columns, and schema:
:a Int64
:b Vector{Int64} (alias for Array{Int64, 1})
julia> p = iterate(Tables.rows(table))
(Tables.ColumnsRow{Tables.CopiedColumns{Arrow.Table}}: (a = 1, b = [1, 1, 1]), 2)
julia> p[1].b
3-element Vector{Int64}:
1
1
1
```
we'll instead get a `SubArray`:
```julia
julia> using Arrow, Tables
julia> x = (a=[1,2,3], b=[[1,1,1], [2,2], [3,3,3,3]]);
julia> table = Arrow.Table(Arrow.tobuffer(x))
Arrow.Table with 3 rows, 2 columns, and schema:
:a Int64
:b Vector{Int64} (alias for Array{Int64, 1})
julia> p = iterate(Tables.rows(table))
(Tables.ColumnsRow{Tables.CopiedColumns{Arrow.Table}}: (a = 1, b = [1, 1, 1]), 2)
julia> p[1].b
3-element view(::Arrow.Primitive{Int64, Vector{Int64}}, 1:3) with eltype Int64:
1
1
1
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] baumgold commented on pull request #405: do not instantiate new Vector when calling fromarrow
Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1568907469
> Maybe there's something we can do in the list.jl code or eltypes.jl to say that the variable sized list type returns SubArrays?
I spent some time trying to implement that but couldn't find a clean way to do so. For `juliaeltype` to return SubArray instead of Vector it would need to know the "parent" type of the SubArray ([parent-type is a type-parameter](https://github.com/JuliaLang/julia/blob/v1.9.0/base/subarray.jl#L14-L24)). However the "parent" type is deeply embedded in return-type of the [build function](https://github.com/apache/arrow-julia/blob/v2.5.2/src/table.jl#L583). We could refactor the build functions to separate type calculation from type instantiation so that the calculation part could be reused in the `juliaeltype` function, but that seems like a lot of work so I didn't peruse it yet. Thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] codecov-commenter commented on pull request #405: do not instantiate new Vector when calling fromarrow
Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1469229741
## [Codecov](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#405](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a0876cf) into [main](https://codecov.io/gh/apache/arrow-julia/commit/aaa4cf878a2f116944d910b84aa4747fa3112ecb?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (aaa4cf8) will **increase** coverage by `0.02%`.
> The diff coverage is `0.00%`.
```diff
@@ Coverage Diff @@
## main #405 +/- ##
==========================================
+ Coverage 86.88% 86.91% +0.02%
==========================================
Files 26 26
Lines 3257 3256 -1
==========================================
Hits 2830 2830
+ Misses 427 426 -1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [src/ArrowTypes/src/ArrowTypes.jl](https://codecov.io/gh/apache/arrow-julia/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3JjL0Fycm93VHlwZXMvc3JjL0Fycm93VHlwZXMuamw=) | `86.47% <0.00%> (+0.50%)` | :arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] baumgold closed pull request #405: do not instantiate new Vector when calling fromarrow
Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold closed pull request #405: do not instantiate new Vector when calling fromarrow
URL: https://github.com/apache/arrow-julia/pull/405
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] quinnj commented on pull request #405: do not instantiate new Vector when calling fromarrow
Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1490699371
Sorry for the slow response. Yeah.......I do like this. I think I remember considering doing this too back when I originally implemented this, but I was worried about something, but can't remember now.
I like this change, but I'm worried about it being breaking.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] baumgold commented on pull request #405: do not instantiate new Vector when calling fromarrow
Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1485280092
@quinnj - Any thoughts on this proposed change? Many of the datasets that I work with have schemas containing variable-sized lists, which are currently quite slow to load due to this allocation issue. I'd appreciate some feedback on whether this is the right approach to solve the problem. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] quinnj commented on pull request #405: do not instantiate new Vector when calling fromarrow
Posted by "quinnj (via GitHub)" <gi...@apache.org>.
quinnj commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1571324812
Here's a somewhat simple approach: https://github.com/apache/arrow-julia/pull/446. It's not quite passing things yet; I need to check if the append test error is legit. I think the others are just updates where we're checking the schema strictly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] baumgold commented on pull request #405: do not instantiate new Vector when calling fromarrow
Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #405:
URL: https://github.com/apache/arrow-julia/pull/405#issuecomment-1573120969
Closing in place off #446
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org