You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Jonathan Yu <jo...@gmail.com> on 2020/10/08 02:33:05 UTC

write_feather, new_file, and compression

Hello there,

I am using Arrow to store data on disk temporarily, so disk space is not a
problem (I understand that Parquet is preferable for more efficient disk
storage). It seems that Arrow's memory mapping/zero copy capabilities would
provide better performance given this use case.

Here are my questions:

1. For new applications, should we prefer the pa.ipc.new_file interface
over write_feather? My understanding from reading [0] is that
pa.feather.write_feather is an API provided for backward compatibility, and
with compression disabled, it seems to produce files of the same size (the
files appear to be identical) as the RecordBatchFileWriter.

2. Does compression affect the need to make copies? I imagine that
compressing the file means that the code to use the file cannot be
zero-copy anymore.

3. When using pandas to analyze the data, is there a way to load the data
using memory mapping, and if so, would this be expected to improve
deserialization performance and memory utilization if multiple processes
are reading the same table data simultaneously? Assume that I'm running on
a modern server-class SSD.

Thank you!

Jonathan

[0] https://arrow.apache.org/faq/#what-about-the-feather-file-format

Re: write_feather, new_file, and compression

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Oct 7, 2020 at 9:33 PM Jonathan Yu <jo...@gmail.com> wrote:
>
> Hello there,
>
> I am using Arrow to store data on disk temporarily, so disk space is not a problem (I understand that Parquet is preferable for more efficient disk storage). It seems that Arrow's memory mapping/zero copy capabilities would provide better performance given this use case.
>
> Here are my questions:
>
> 1. For new applications, should we prefer the pa.ipc.new_file interface over write_feather? My understanding from reading [0] is that pa.feather.write_feather is an API provided for backward compatibility, and with compression disabled, it seems to produce files of the same size (the files appear to be identical) as the RecordBatchFileWriter.
>

You can use either, neither API is deprecated nor planning to be.

> 2. Does compression affect the need to make copies? I imagine that compressing the file means that the code to use the file cannot be zero-copy anymore.
>

Right, when using compression by definition zero copy is not possible.

> 3. When using pandas to analyze the data, is there a way to load the data using memory mapping, and if so, would this be expected to improve deserialization performance and memory utilization if multiple processes are reading the same table data simultaneously? Assume that I'm running on a modern server-class SSD.
>

No, pandas doesn't support memory mapping.

> Thank you!
>
> Jonathan
>
> [0] https://arrow.apache.org/faq/#what-about-the-feather-file-format