You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "JoaoAparicio (via GitHub)" <gi...@apache.org> on 2023/04/11 01:32:05 UTC
[GitHub] [arrow-julia] JoaoAparicio opened a new pull request, #422: Pre-allocate buffer
JoaoAparicio opened a new pull request, #422:
URL: https://github.com/apache/arrow-julia/pull/422
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc.
Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode().
Inspired by https://github.com/apache/arrow-julia/pull/399
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] JoaoAparicio commented on pull request #422: Pre-allocate buffer
Posted by "JoaoAparicio (via GitHub)" <gi...@apache.org>.
JoaoAparicio commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502560810
Waiting on release of
https://github.com/JuliaIO/TranscodingStreams.jl/pull/136
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] Moelf commented on pull request #422: Pre-allocate buffer
Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502635310
```
stream = Arrow.Stream(path)
for tbl in stream
```
the problem of this approach is it's decompressing every branch. Consider examples such as:
- https://github.com/apache/arrow-julia/issues/417
decompressing every branch would be super slower
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] Moelf commented on pull request #422: Pre-allocate buffer
Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502609845
any thoughts on:
- https://github.com/apache/arrow-julia/issues/340
?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] JoaoAparicio commented on pull request #422: Pre-allocate buffer
Posted by "JoaoAparicio (via GitHub)" <gi...@apache.org>.
JoaoAparicio commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502630284
I have some thoughts. One solution to "my dataset is larger than memory" is partitioning. If your dataset is partitioned in such a way that each partition fits in memory, you can iterate it with
```
stream = Arrow.Stream(path)
for tbl in stream
...
end
```
You can do this right now without requiring any additional features from this package.
In contrast what is discussed in #340 (which is: don't decompress if you don't have to) is a different approach, but doesn't yet exist.
Currently I have some commits that add the feature to multi-thread decompression at the buffer level. I will be trying to upstream what I have so far. The difficulty is that these commits touch a lot of code, so this won't happen overnight. I imagine couple of weeks? On top of that it should be straightforward to implement what is discussed in #340.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] ericphanson commented on pull request #422: Pre-allocate buffer
Posted by "ericphanson (via GitHub)" <gi...@apache.org>.
ericphanson commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1503531442
nope, LGTM!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] baumgold merged pull request #422: Pre-allocate buffer
Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold merged PR #422:
URL: https://github.com/apache/arrow-julia/pull/422
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-julia] baumgold commented on pull request #422: Pre-allocate buffer
Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1503514915
@quinnj / @ericphanson - any comments before we merge?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org