You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/09 02:11:31 UTC

[GitHub] [arrow] hengaini2055 opened a new issue #10899: Does feather file identify with pyarrow.memory_map(file)?

hengaini2055 opened a new issue #10899:
URL: https://github.com/apache/arrow/issues/10899


   I understand the pyarrow can read different file such as csv, parquet and something else.  Does feather file format identify with csv or parquet? According to [Zero memory](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a) there is memory map file. Does feather file format identity with memory_map file? [](url)
   I want to use arrow development a BI project, and I don't know about the following issues:
   
   1. Zero memory! When I use a larger than Ram dataset， the data in disk must be feather or memory_map file?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] hengaini2055 commented on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Posted by GitBox <gi...@apache.org>.

hengaini2055 commented on issue #10899:
URL: https://github.com/apache/arrow/issues/10899#issuecomment-895609491


   @lidavidm Thanks! How can I memory-map a Parquet file? I want to gain 'zero copy' from a directory database(pyarrow). In Microsoft Power BI, We must read all dataset to memory and process it. The Memory will be the bottleneck. Should I use  uncompressed feather( memory map, file named *.arrow) file in directory, to gain 'zero copy' benefits? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #10899:
URL: https://github.com/apache/arrow/issues/10899#issuecomment-895294381

The Feather (V2) file format, also known as the Arrow IPC file format, is neither CSV nor Parquet, but rather Arrow's format for data on disk. See this [FAQ entry](https://arrow.apache.org/faq/#what-is-the-difference-between-apache-arrow-and-apache-parquet) as well as the one immediately after it.

The article you link is describing using uncompressed Feather/Arrow IPC files. This is because then the layout of data in memory is the same as the layout on disk, due to the Arrow specification, and you can memory-map the file and use it as-is. Of course, you can still memory-map a Parquet or CSV file - but you will have to decode the data first, which carries overhead. (This may still be manageable, e.g. you could decode and process one row group of a Parquet file at a time, but you won't gain the 'zero copy' benefits.)

For analysis of data, it depends. PyArrow has some [compute functions](https://arrow.apache.org/docs/python/api/compute.html) available and they may be sufficient for your needs. Else you may convert to Pandas or some other format as needed, but of course this increases your memory usage.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] hengaini2055 closed issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Posted by GitBox <gi...@apache.org>.

hengaini2055 closed issue #10899:
URL: https://github.com/apache/arrow/issues/10899


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Posted by GitBox <gi...@apache.org>.

lidavidm commented on issue #10899:
URL: https://github.com/apache/arrow/issues/10899#issuecomment-896009602


   So to be clear: Parquet, CSV, and Feather are all different file formats, and memory-mapping is just one way to read a file. 
   
   In the case of uncompressed Feather files *only*, you can get zero-copy reads. Again, this is because the format on disk is in this case the same as the format in memory. So if memory is your bottleneck, it sounds like this is likely your best choice.
   
   Otherwise, memory-mapping may be faster (or slower!) than just reading the file. You cannot and will not get zero-copy from a Parquet file. However, Parquet files can be read incrementally, so you may still be able to make this work - it will just take more effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] hengaini2055 edited a comment on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Posted by GitBox <gi...@apache.org>.

hengaini2055 edited a comment on issue #10899:
URL: https://github.com/apache/arrow/issues/10899#issuecomment-895618846


   [library datasets](https://github.com/huggingface/datasets/blob/171f2bba9dd8b92006b13cf076a5bf31d67d3e69/src/datasets/table.py#L42), use ```pa.memory_map(filename)``` to create a memory mapped pa.table. The file may be a parquet file, a cvs file, or a *.arrow (feather file)? As you said, "Of course, you can still memory-map a Parquet or CSV file", maybe!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] hengaini2055 commented on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Posted by GitBox <gi...@apache.org>.

hengaini2055 commented on issue #10899:
URL: https://github.com/apache/arrow/issues/10899#issuecomment-894911025


   2. When I analyse the dataset,  I must change pa.table to pandas or dask?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] hengaini2055 commented on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Posted by GitBox <gi...@apache.org>.

hengaini2055 commented on issue #10899:
URL: https://github.com/apache/arrow/issues/10899#issuecomment-895618846


   [library datasets](url), use ```pa.memory_map(filename)``` to create a memory mapped pa.table. The file may be a parquet file, a cvs file, or a *.arrow (feather file)? As you said, "Of course, you can still memory-map a Parquet or CSV file", maybe!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org