You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "JoaoAparicio (via GitHub)" <gi...@apache.org> on 2023/04/11 01:32:05 UTC

[GitHub] [arrow-julia] JoaoAparicio opened a new pull request, #422: Pre-allocate buffer

JoaoAparicio opened a new pull request, #422:
URL: https://github.com/apache/arrow-julia/pull/422

   If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc.
   
   Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode().
   
   Inspired by https://github.com/apache/arrow-julia/pull/399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] JoaoAparicio commented on pull request #422: Pre-allocate buffer

Posted by "JoaoAparicio (via GitHub)" <gi...@apache.org>.
JoaoAparicio commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502560810

   Waiting on release of 
   https://github.com/JuliaIO/TranscodingStreams.jl/pull/136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #422: Pre-allocate buffer

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502635310

   ```
   stream = Arrow.Stream(path)
   for tbl in stream
   ```
   
   the problem of this approach is it's decompressing every branch. Consider examples such as:
   - https://github.com/apache/arrow-julia/issues/417
   
   decompressing every branch would be super slower
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] Moelf commented on pull request #422: Pre-allocate buffer

Posted by "Moelf (via GitHub)" <gi...@apache.org>.
Moelf commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502609845

   any thoughts on:
   - https://github.com/apache/arrow-julia/issues/340
   
   ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] JoaoAparicio commented on pull request #422: Pre-allocate buffer

Posted by "JoaoAparicio (via GitHub)" <gi...@apache.org>.
JoaoAparicio commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502630284

   I have some thoughts. One solution to "my dataset is larger than memory" is partitioning. If your dataset is partitioned in such a way that each partition fits in memory, you can iterate it with
   ```
   stream = Arrow.Stream(path)
   for tbl in stream
       ...
   end
   ```
   You can do this right now without requiring any additional features from this package.
   
   In contrast what is discussed in #340 (which is: don't decompress if you don't have to) is a different approach, but doesn't yet exist.
   
   Currently I have some commits that add the feature to multi-thread decompression at the buffer level. I will be trying to upstream what I have so far. The difficulty is that these commits touch a lot of code, so this won't happen overnight. I imagine couple of weeks? On top of that it should be straightforward to implement what is discussed in #340.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] ericphanson commented on pull request #422: Pre-allocate buffer

Posted by "ericphanson (via GitHub)" <gi...@apache.org>.
ericphanson commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1503531442

   nope, LGTM!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold merged pull request #422: Pre-allocate buffer

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold merged PR #422:
URL: https://github.com/apache/arrow-julia/pull/422


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-julia] baumgold commented on pull request #422: Pre-allocate buffer

Posted by "baumgold (via GitHub)" <gi...@apache.org>.
baumgold commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1503514915

   @quinnj / @ericphanson - any comments before we merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org