You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/06 14:14:34 UTC

[GitHub] [arrow-julia] Moelf opened a new issue, #340: Feather file with compression and larger than RAM

Moelf opened a new issue, #340:
URL: https://github.com/apache/arrow-julia/issues/340

   Last time I checked, `mmap` breaks down for files with compression. This is understandable because the compressed buffers clearly can't be re-interpreted without inflation.
   
   But the larger the file is the more likely it's compressed, can we decompressed only a single "row group" (and only the relevant columns of course) on the fly yet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] Moelf commented on issue #340: Feather file with compression and larger than RAM

Posted by GitBox <gi...@apache.org>.

Moelf commented on issue #340:
URL: https://github.com/apache/arrow-julia/issues/340#issuecomment-1270485419

   this is the whole thing we do un UnROOT.jl for a physics-community only thing called `TTree`, their next-gen storage called `RNTuple` is basically `Feather`: https://indico.cern.ch/event/1208767/contributions/5083082/attachments/2523220/4340111/PPP_uproot_RNTuple.pdf#page=13
   
   while we will get there eventually, in `UnROOT` I have the whole machinery basically:
   - `getindex` -> find the row group -> 
       - if not in cache, decompress and put in cache
       - if in cache, directly try to locate the slot
   
   this way, at most one row group worth of data ever lives in RAM, in fact that's the minimal amount you need in RAM, because you can only know row number start-end for an entire row group and you have to count inside it.
   
   but yeah, this is a whole thing in UnROOT.jl and it's mission-critical because our data are like, O(100) GB compressed all the time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] quinnj commented on issue #340: Feather file with compression and larger than RAM

Posted by GitBox <gi...@apache.org>.

quinnj commented on issue #340:
URL: https://github.com/apache/arrow-julia/issues/340#issuecomment-1270465872

   Hmmmm......we'll have to see what we can do here. I've had the idea for a while as a Tables.jl-wide feature to support projection/filter push down for sources in a generic way. That would translate really well to Arrow and would allow us to more easily avoid decompressing when not necessary. There's probably more we can do in the short-term though to avoid materializing when not needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] JoaoAparicio commented on issue #340: Feather file with compression and larger than RAM

Posted by "JoaoAparicio (via GitHub)" <gi...@apache.org>.

JoaoAparicio commented on issue #340:
URL: https://github.com/apache/arrow-julia/issues/340#issuecomment-1493485143

   > Hmmmm......we'll have to see what we can do here. I've had the idea for a while as a Tables.jl-wide feature to support projection/filter push down for sources in a generic way. That would translate really well to Arrow and would allow us to more easily avoid decompressing when not necessary. There's probably more we can do in the short-term though to avoid materializing when not needed.
   
   Ping for comments on #412.
   This isn't the most general filter push down, but it does allow us to avoid unnecessary decompression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org