You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Sam Shleifer <ss...@gmail.com> on 2021/02/18 03:11:31 UTC

[Python] Saving ChunkedArray to disk and reading with flight

*My goal*

I have a list of numpy arrays of uneven length. From the docs, I guess the right format for this is ChunkedArray

I want to save my list to disk in one process, and then start many new processes (a pytorch dataloader) that are able to read chunks from the file with low memory overhead.

The current solution is to flatten the array, keep a list of the lengths/offsets, store the flattened array inĀ  `np.memmap`, then have each process slice into the memmap at the right index.

It seems that with arrow, we can at least delete the list of lengths/offsets.

*What I have tried:*

padding each entry in the list to a fixed length, and saving pa.Table to pa.NativeFile. Each process reads it's own pa.Table. This is slower and less memory efficient than `memmap` by about 15%.

*My questions:*

1) Are there any examples online that do this sort of operation? I can't find how to save chunked array to disk, or a python Flight example after a few googles.

2) Is it unreasonable to think this will use less memory than np.memmap?

Thanks in advance!

Sam

Re: [Python] Saving ChunkedArray to disk and reading with flight

Posted by Wes McKinney <we...@gmail.com>.
On the "This is slower and less memory efficient than `memmap` by about
15%." -- if you can show us more precisely what code you have written that
will help us advise you. In principle if you are using pyarrow.memory_map
the performance / memory use shouldn't be significantly different

On Wed, Feb 17, 2021 at 9:57 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Sam,
> Could you elaborate on what advantages you were hoping to benefit from
> Arrow?  It seems like the process you describe is probably close to optimal
> (I have limited knowledge of np.memmap). And there could be alternative
> suggestions based on the exact shape of your data and how you want to
> process it.  I added some more comments inline below.
>
> The current solution is to flatten the array, keep a list of the
>> lengths/offsets, store the flattened array in  `np.memmap`, then have each
>> process slice into the memmap at the right index.
>> It seems that with arrow, we can at least delete the list of
>> lengths/offsets.
>
> In Arrow it seems like the natural fit here is to use a ListArray wrapped
> around the numpy arrays. This would add back in the indices/offsets.
>
> padding each entry in the list to a fixed length, and saving pa.Table to
>> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
>> less memory efficient than `memmap` by about 15%.
>
> How are you reading back the file?  Are you using MemoryMappedFile [1]?
>
> 1) Are there any examples online that do this sort of operation? I can't
>> find how to save chunked array to disk, or a python Flight example after a
>> few googles.
>
> ChunkedArray's aren't a first class citizen in the Arrow File Format
> specification.  Working through tables that get converted to RecordBatches
> when saving is all that is supported.
>
>
> 2) Is it unreasonable to think this will use less memory than np.memmap?
>
> I'm not familiar with np.memmap, so I can't really say.
>
>
> [1] https://arrow.apache.org/docs/python/generated/pyarrow
>
>
>
> On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer <ss...@gmail.com> wrote:
>
>> *My goal*
>> I have a list of numpy arrays of uneven length. From the docs, I guess
>> the right format for this is ChunkedArray
>> I want to save my list to disk in one process, and then start many new
>> processes (a pytorch dataloader) that are able to read chunks from the file
>> with low memory overhead.
>> The current solution is to flatten the array, keep a list of the
>> lengths/offsets, store the flattened array in  `np.memmap`, then have each
>> process slice into the memmap at the right index.
>> It seems that with arrow, we can at least delete the list of
>> lengths/offsets.
>>
>> *What I have tried:*
>> padding each entry in the list to a fixed length, and saving pa.Table to
>> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
>> less memory efficient than `memmap` by about 15%.
>>
>> *My questions:*
>> 1) Are there any examples online that do this sort of operation? I can't
>> find how to save chunked array to disk, or a python Flight example after a
>> few googles.
>> 2) Is it unreasonable to think this will use less memory than np.memmap?
>>
>> Thanks in advance!
>> Sam
>>
>>

Re: [Python] Saving ChunkedArray to disk and reading with flight

Posted by Micah Kornfield <em...@gmail.com>.
Hi Sam,
Could you elaborate on what advantages you were hoping to benefit from
Arrow?  It seems like the process you describe is probably close to optimal
(I have limited knowledge of np.memmap). And there could be alternative
suggestions based on the exact shape of your data and how you want to
process it.  I added some more comments inline below.

The current solution is to flatten the array, keep a list of the
> lengths/offsets, store the flattened array in  `np.memmap`, then have each
> process slice into the memmap at the right index.
> It seems that with arrow, we can at least delete the list of
> lengths/offsets.

In Arrow it seems like the natural fit here is to use a ListArray wrapped
around the numpy arrays. This would add back in the indices/offsets.

padding each entry in the list to a fixed length, and saving pa.Table to
> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
> less memory efficient than `memmap` by about 15%.

How are you reading back the file?  Are you using MemoryMappedFile [1]?

1) Are there any examples online that do this sort of operation? I can't
> find how to save chunked array to disk, or a python Flight example after a
> few googles.

ChunkedArray's aren't a first class citizen in the Arrow File Format
specification.  Working through tables that get converted to RecordBatches
when saving is all that is supported.


2) Is it unreasonable to think this will use less memory than np.memmap?

I'm not familiar with np.memmap, so I can't really say.


[1] https://arrow.apache.org/docs/python/generated/pyarrow



On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer <ss...@gmail.com> wrote:

> *My goal*
> I have a list of numpy arrays of uneven length. From the docs, I guess the
> right format for this is ChunkedArray
> I want to save my list to disk in one process, and then start many new
> processes (a pytorch dataloader) that are able to read chunks from the file
> with low memory overhead.
> The current solution is to flatten the array, keep a list of the
> lengths/offsets, store the flattened array in  `np.memmap`, then have each
> process slice into the memmap at the right index.
> It seems that with arrow, we can at least delete the list of
> lengths/offsets.
>
> *What I have tried:*
> padding each entry in the list to a fixed length, and saving pa.Table to
> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
> less memory efficient than `memmap` by about 15%.
>
> *My questions:*
> 1) Are there any examples online that do this sort of operation? I can't
> find how to save chunked array to disk, or a python Flight example after a
> few googles.
> 2) Is it unreasonable to think this will use less memory than np.memmap?
>
> Thanks in advance!
> Sam
>
>