You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/24 09:56:46 UTC

[GitHub] [arrow] yli1994 opened a new issue, #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

yli1994 opened a new issue, #14726:
URL: https://github.com/apache/arrow/issues/14726

   ### Describe the usage question you have. Please include as many useful details as possible.
   
   Hi,
   
   does the memory map does not work?
   
   
   ### Component
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] yli1994 commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

yli1994 commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1334688842

   > If you want to reduce memory usage when reading a file, you should not read it as an entire table, but as a sequence of batches. See here: https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
   
   Thank you for your reply! I am confused how could Huggingface's datasets library (which uses pyarrow as backend and parquet as file format) load data without increasing memory consumption


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] yli1994 commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

yli1994 commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1326961628

   > Parquet files are written with compression turned on by default, which means that usually the size on disk is much (depending on the data several times!) smaller than the actual in-memory size of the data.
   > 
   > Can you confirm if the file is written with compression?
   > 
   > cc @jorisvandenbossche
   
   Hi @assignUser ,
   
   I wrote parquet both in "snappy" and "zstd" compression format, and the sizes are 202G and 158G respectively, but what I expect is that "memory map" reading method should not increase the memory occupied.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] assignUser commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

assignUser commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1326473631

   Parquet files are written with compression turned on by default, which means that usually the size on disk is much (depending on the data several times!) smaller than the actual in-memory size of the data. 
   
   Can you confirm if the file is written with compression? 
   
   cc @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

pitrou commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1332020913

   > does the memory map does not work?
   
   Why do you think it is not working? As long as the memory is not needed by anything else, your operating system simply chooses to keep the file in cache AFAIU.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

pitrou commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1332023174

   If you want to reduce memory usage when reading a file, you should not read it as an entire table, but as a sequence of batches. See here: https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1334867906

   > It might be nice to have a higher-level function to read a Parquet file as a stream of batches by the way. 
   
   There are already two options, I think:
   
   - With the parquet module, using `pq.ParquetFile(..).iter_batches()`
   - With the datasets module, using `ds.dataset(...).to_batches()`
   
   The first one should probably be added to that documentation section you linked to, though.
   
   Or are you thinking of even more high level? (something like `pq.read_batches` in addition to `pq.read_table`)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] yli1994 commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

yli1994 commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1334690052

   > If you want to reduce memory usage when reading a file, you should not read it as an entire table, but as a sequence of batches. See here: https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
   
   And LMDB (which uses 'memory map', not sure difference between LMDB mmap and arrow mmap) can read directly from disk, thanks a lot for your answering!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #14726: pq.read_table("parquet files path", memory_map=True) still consume large memory space(200G file cost 200G memory and slow)

Posted by GitBox <gi...@apache.org>.

pitrou commented on issue #14726:
URL: https://github.com/apache/arrow/issues/14726#issuecomment-1332025764

   It might be nice to have a higher-level function to read a Parquet file as a stream of batches by the way. @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org