You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Yue Ni <ni...@gmail.com> on 2020/06/23 03:52:45 UTC

Feather v2 random access

Hi there,

I am evaluating using feather v2 on disk to store some data that needs
random access. I did some experiments to see the performance, but since
there are many scenarios I cannot verify each of them, I am searching for
some details about how it works internally to understand if it satisfies my
requirements, in particular about the random access and its
compression/decompression, but I am not able to find any documentation
describing it. Could someone shed some light on this?

So far I read some Arrow source code and PRs like the two below but I still
have no idea how it works internally (it is likely because I am not
familiar with Flatbuffers)
* ARROW-300: [Format] Proposal for "trivial" IPC body buffer compression
using either LZ4 or ZSTD codecs, https://github.com/apache/arrow/pull/6707
* ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC
file format, https://github.com/apache/arrow/pull/6694

I would like to understand how in general Feather v2 performs decompression
when randomly accessed via mmap, and have some specific questions below:
1) If a feather file contains multiple columns, are they compressed
separately? I assume each column is compressed separately, and instead of
decompressing the entire feather file, only the accessed column will be
decompressed, is it correct?
2) If a particular column value is randomly accessed via the column array's
index using mmap, will the entire column data be decompressed? I assume
only a portion of the column will be decompressed, is this correct?
3) If only part of the column is decompressed, what is the mechanism for
caching the decompressed data? For example, if we access 10
contiguous array values, do we need to decompress the column (or part of
the column) multiple times? What kind of access pattern could be not
friendly to this cache mechanism?
4) If there is an internal caching mechanism, is there any way
users/developers could tune the cache for different use scenarios, for
example, some fields may store large text data which may need bigger cache.

And besides the above questions, I would like to learn more details about
this and it will be great if someone could point me to any documentation or
certain part of the source code that I should check out. Any help is
appreciated. Thanks.

Regards,
Yue

Re: Feather v2 random access

Posted by Yue Ni <ni...@gmail.com>.

Hi François,

Thanks so much for the very detailed explanation, and that makes sense to
me. I will check out the links for more information.

@Wes,
ARROW-8250 is very useful to me as well and I will keep an eye on it.
Thanks.

On Wed, Jun 24, 2020 at 11:08 PM Wes McKinney <we...@gmail.com> wrote:

> See also this JIRA regarding adding random access read APIs for IPC
> files (and thus Feather)
>
> https://issues.apache.org/jira/browse/ARROW-8250
>
> I hope to see this implemented someday.
>
> On Wed, Jun 24, 2020 at 10:03 AM Francois Saint-Jacques
> <fs...@gmail.com> wrote:
> >
> > I forgot to mention that you can see how this is glued in
> > `feather::reader::Read` [1]. This makes it obvious that nothing is
> > cached and everything is loaded in memory.
> >
> > François
> >
> > [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/feather.cc#L715-L723
> >
> > On Wed, Jun 24, 2020 at 10:53 AM Francois Saint-Jacques
> > <fs...@gmail.com> wrote:
> > >
> > > Hello Yue,
> > >
> > > FeatherV2 is just a facade for the Arrow IPC file format. You can find
> > > the implementation here [1]. I will try to answer your question with
> > > inline comments. On a high level, the file format writes a schema and
> > > then multiple "chunks" called RecordBatch.  Your lowest level of
> > > granularity for fetching data is a RecordBatch [2]. Thus, a Table is
> > > divided into multiple RecordBatch at write-time and the file stores a
> > > series of said batches. When you read a file, you can either read the
> > > whole table, or do point query on RecordBatch, e.g.
> > > `RecordBatchFileReader::ReadRecordBatch(int i)`. If you use the
> > > convenience API for reading the table in a single shot, e.g.
> > > `feather::Reader::Read`, it will decompress all buffers and
> > > materialize everything in memory.
> > >
> > > If you use compression, it means copying and decompressioning the
> > > data. In other words, you'll have an RSS of the mmap size +
> > > decompressed size. If you don't use compression, the buffers will be
> > > zero-copy slices of the mmap-ed memory and *could* be lazily loaded
> > > until pointers are dereferenced. But this assumes that the reader code
> > > doesn't dereference, which might not always hold, e.g. sometimes we
> > > call `{Array,RecordBatch,Table}::Validate` to ensure well formed
> > > arrays. This method can
> > > read the buffer for some types to validate that no segfault will
> > > happen at runtime.
> > >
> > > IMHO, mmap and compression for the IPC file format are mutually
> > > exclusive. If you use compression, you lose all the benefits of mmap
> > > and you might as well disable mmap. If you want lazy loading and late
> > > memory materialization (from disk), turn off compression.
> > >
> > > > 1) If a feather file contains multiple columns, are they compressed
> > > > separately? I assume each column is compressed separately, and
> instead of
> > > > decompressing the entire feather file, only the accessed column will
> be
> > > > decompressed, is it correct?
> > >
> > > They are compressed separately [3]. The Reader will decompress all
> > > columns of the requested batch. You can pass an option to limit the
> > > number of columns [4] of interest.
> > >
> > > > 2) If a particular column value is randomly accessed via the column
> array's
> > > > index using mmap, will the entire column data be decompressed? I
> assume
> > > > only a portion of the column will be decompressed, is this correct?
> > >
> > > The entire column of the RecordBatch will be decompressed (and stored
> > > in memory). If your table has a single RecordBatch, then yes the whole
> > > column will be decompressed.
> > >
> > > > 3) If only part of the column is decompressed, what is the mechanism
> for
> > > > caching the decompressed data? For example, if we access 10
> > > > contiguous array values, do we need to decompress the column (or
> part of
> > > > the column) multiple times? What kind of access pattern could be not
> > > > friendly to this cache mechanism?
> > > > 4) If there is an internal caching mechanism, is there any way
> > > > users/developers could tune the cache for different use scenarios,
> for
> > > > example, some fields may store large text data which may need bigger
> cache.
> > >
> > > There is no caching, the RecordBatchReader yields a fully materialized
> > > batch, it is up to the caller to decide how to handle the lifetime of
> > > such batch.
> > >
> > > Long short story,
> > > - it seems that you want lazy materialization via mmap to control the
> > > active memory usage. This is not going to work with compression.
> > > - if you use the ReadTable interface (instead of a stream reader) of
> > > the reader, you get a _fully_ materialized table, i.e. each
> > > RecordBatch is decompressed.
> > >
> > > The feather public API loads the whole table, you will need to work
> > > with the IPC interface if you want to do stream reading.
> > >
> > > François
> > >
> > > [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/ipc
> > > [2]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L65-L90
> > > [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L113-L255
> > > [4]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L85
>

Re: Feather v2 random access

Posted by Wes McKinney <we...@gmail.com>.

See also this JIRA regarding adding random access read APIs for IPC
files (and thus Feather)

https://issues.apache.org/jira/browse/ARROW-8250

I hope to see this implemented someday.

On Wed, Jun 24, 2020 at 10:03 AM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> I forgot to mention that you can see how this is glued in
> `feather::reader::Read` [1]. This makes it obvious that nothing is
> cached and everything is loaded in memory.
>
> François
>
> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/feather.cc#L715-L723
>
> On Wed, Jun 24, 2020 at 10:53 AM Francois Saint-Jacques
> <fs...@gmail.com> wrote:
> >
> > Hello Yue,
> >
> > FeatherV2 is just a facade for the Arrow IPC file format. You can find
> > the implementation here [1]. I will try to answer your question with
> > inline comments. On a high level, the file format writes a schema and
> > then multiple "chunks" called RecordBatch.  Your lowest level of
> > granularity for fetching data is a RecordBatch [2]. Thus, a Table is
> > divided into multiple RecordBatch at write-time and the file stores a
> > series of said batches. When you read a file, you can either read the
> > whole table, or do point query on RecordBatch, e.g.
> > `RecordBatchFileReader::ReadRecordBatch(int i)`. If you use the
> > convenience API for reading the table in a single shot, e.g.
> > `feather::Reader::Read`, it will decompress all buffers and
> > materialize everything in memory.
> >
> > If you use compression, it means copying and decompressioning the
> > data. In other words, you'll have an RSS of the mmap size +
> > decompressed size. If you don't use compression, the buffers will be
> > zero-copy slices of the mmap-ed memory and *could* be lazily loaded
> > until pointers are dereferenced. But this assumes that the reader code
> > doesn't dereference, which might not always hold, e.g. sometimes we
> > call `{Array,RecordBatch,Table}::Validate` to ensure well formed
> > arrays. This method can
> > read the buffer for some types to validate that no segfault will
> > happen at runtime.
> >
> > IMHO, mmap and compression for the IPC file format are mutually
> > exclusive. If you use compression, you lose all the benefits of mmap
> > and you might as well disable mmap. If you want lazy loading and late
> > memory materialization (from disk), turn off compression.
> >
> > > 1) If a feather file contains multiple columns, are they compressed
> > > separately? I assume each column is compressed separately, and instead of
> > > decompressing the entire feather file, only the accessed column will be
> > > decompressed, is it correct?
> >
> > They are compressed separately [3]. The Reader will decompress all
> > columns of the requested batch. You can pass an option to limit the
> > number of columns [4] of interest.
> >
> > > 2) If a particular column value is randomly accessed via the column array's
> > > index using mmap, will the entire column data be decompressed? I assume
> > > only a portion of the column will be decompressed, is this correct?
> >
> > The entire column of the RecordBatch will be decompressed (and stored
> > in memory). If your table has a single RecordBatch, then yes the whole
> > column will be decompressed.
> >
> > > 3) If only part of the column is decompressed, what is the mechanism for
> > > caching the decompressed data? For example, if we access 10
> > > contiguous array values, do we need to decompress the column (or part of
> > > the column) multiple times? What kind of access pattern could be not
> > > friendly to this cache mechanism?
> > > 4) If there is an internal caching mechanism, is there any way
> > > users/developers could tune the cache for different use scenarios, for
> > > example, some fields may store large text data which may need bigger cache.
> >
> > There is no caching, the RecordBatchReader yields a fully materialized
> > batch, it is up to the caller to decide how to handle the lifetime of
> > such batch.
> >
> > Long short story,
> > - it seems that you want lazy materialization via mmap to control the
> > active memory usage. This is not going to work with compression.
> > - if you use the ReadTable interface (instead of a stream reader) of
> > the reader, you get a _fully_ materialized table, i.e. each
> > RecordBatch is decompressed.
> >
> > The feather public API loads the whole table, you will need to work
> > with the IPC interface if you want to do stream reading.
> >
> > François
> >
> > [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/ipc
> > [2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L65-L90
> > [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L113-L255
> > [4] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L85

Re: Feather v2 random access

Posted by Francois Saint-Jacques <fs...@gmail.com>.

I forgot to mention that you can see how this is glued in
`feather::reader::Read` [1]. This makes it obvious that nothing is
cached and everything is loaded in memory.

François

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/feather.cc#L715-L723

On Wed, Jun 24, 2020 at 10:53 AM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> Hello Yue,
>
> FeatherV2 is just a facade for the Arrow IPC file format. You can find
> the implementation here [1]. I will try to answer your question with
> inline comments. On a high level, the file format writes a schema and
> then multiple "chunks" called RecordBatch.  Your lowest level of
> granularity for fetching data is a RecordBatch [2]. Thus, a Table is
> divided into multiple RecordBatch at write-time and the file stores a
> series of said batches. When you read a file, you can either read the
> whole table, or do point query on RecordBatch, e.g.
> `RecordBatchFileReader::ReadRecordBatch(int i)`. If you use the
> convenience API for reading the table in a single shot, e.g.
> `feather::Reader::Read`, it will decompress all buffers and
> materialize everything in memory.
>
> If you use compression, it means copying and decompressioning the
> data. In other words, you'll have an RSS of the mmap size +
> decompressed size. If you don't use compression, the buffers will be
> zero-copy slices of the mmap-ed memory and *could* be lazily loaded
> until pointers are dereferenced. But this assumes that the reader code
> doesn't dereference, which might not always hold, e.g. sometimes we
> call `{Array,RecordBatch,Table}::Validate` to ensure well formed
> arrays. This method can
> read the buffer for some types to validate that no segfault will
> happen at runtime.
>
> IMHO, mmap and compression for the IPC file format are mutually
> exclusive. If you use compression, you lose all the benefits of mmap
> and you might as well disable mmap. If you want lazy loading and late
> memory materialization (from disk), turn off compression.
>
> > 1) If a feather file contains multiple columns, are they compressed
> > separately? I assume each column is compressed separately, and instead of
> > decompressing the entire feather file, only the accessed column will be
> > decompressed, is it correct?
>
> They are compressed separately [3]. The Reader will decompress all
> columns of the requested batch. You can pass an option to limit the
> number of columns [4] of interest.
>
> > 2) If a particular column value is randomly accessed via the column array's
> > index using mmap, will the entire column data be decompressed? I assume
> > only a portion of the column will be decompressed, is this correct?
>
> The entire column of the RecordBatch will be decompressed (and stored
> in memory). If your table has a single RecordBatch, then yes the whole
> column will be decompressed.
>
> > 3) If only part of the column is decompressed, what is the mechanism for
> > caching the decompressed data? For example, if we access 10
> > contiguous array values, do we need to decompress the column (or part of
> > the column) multiple times? What kind of access pattern could be not
> > friendly to this cache mechanism?
> > 4) If there is an internal caching mechanism, is there any way
> > users/developers could tune the cache for different use scenarios, for
> > example, some fields may store large text data which may need bigger cache.
>
> There is no caching, the RecordBatchReader yields a fully materialized
> batch, it is up to the caller to decide how to handle the lifetime of
> such batch.
>
> Long short story,
> - it seems that you want lazy materialization via mmap to control the
> active memory usage. This is not going to work with compression.
> - if you use the ReadTable interface (instead of a stream reader) of
> the reader, you get a _fully_ materialized table, i.e. each
> RecordBatch is decompressed.
>
> The feather public API loads the whole table, you will need to work
> with the IPC interface if you want to do stream reading.
>
> François
>
> [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/ipc
> [2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L65-L90
> [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L113-L255
> [4] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L85

Re: Feather v2 random access

Posted by Francois Saint-Jacques <fs...@gmail.com>.

Hello Yue,

FeatherV2 is just a facade for the Arrow IPC file format. You can find
the implementation here [1]. I will try to answer your question with
inline comments. On a high level, the file format writes a schema and
then multiple "chunks" called RecordBatch.  Your lowest level of
granularity for fetching data is a RecordBatch [2]. Thus, a Table is
divided into multiple RecordBatch at write-time and the file stores a
series of said batches. When you read a file, you can either read the
whole table, or do point query on RecordBatch, e.g.
`RecordBatchFileReader::ReadRecordBatch(int i)`. If you use the
convenience API for reading the table in a single shot, e.g.
`feather::Reader::Read`, it will decompress all buffers and
materialize everything in memory.

If you use compression, it means copying and decompressioning the
data. In other words, you'll have an RSS of the mmap size +
decompressed size. If you don't use compression, the buffers will be
zero-copy slices of the mmap-ed memory and *could* be lazily loaded
until pointers are dereferenced. But this assumes that the reader code
doesn't dereference, which might not always hold, e.g. sometimes we
call `{Array,RecordBatch,Table}::Validate` to ensure well formed
arrays. This method can
read the buffer for some types to validate that no segfault will
happen at runtime.

IMHO, mmap and compression for the IPC file format are mutually
exclusive. If you use compression, you lose all the benefits of mmap
and you might as well disable mmap. If you want lazy loading and late
memory materialization (from disk), turn off compression.

> 1) If a feather file contains multiple columns, are they compressed
> separately? I assume each column is compressed separately, and instead of
> decompressing the entire feather file, only the accessed column will be
> decompressed, is it correct?

They are compressed separately [3]. The Reader will decompress all
columns of the requested batch. You can pass an option to limit the
number of columns [4] of interest.

> 2) If a particular column value is randomly accessed via the column array's
> index using mmap, will the entire column data be decompressed? I assume
> only a portion of the column will be decompressed, is this correct?

The entire column of the RecordBatch will be decompressed (and stored
in memory). If your table has a single RecordBatch, then yes the whole
column will be decompressed.

> 3) If only part of the column is decompressed, what is the mechanism for
> caching the decompressed data? For example, if we access 10
> contiguous array values, do we need to decompress the column (or part of
> the column) multiple times? What kind of access pattern could be not
> friendly to this cache mechanism?
> 4) If there is an internal caching mechanism, is there any way
> users/developers could tune the cache for different use scenarios, for
> example, some fields may store large text data which may need bigger cache.

There is no caching, the RecordBatchReader yields a fully materialized
batch, it is up to the caller to decide how to handle the lifetime of
such batch.

Long short story,
- it seems that you want lazy materialization via mmap to control the
active memory usage. This is not going to work with compression.
- if you use the ReadTable interface (instead of a stream reader) of
the reader, you get a _fully_ materialized table, i.e. each
RecordBatch is decompressed.

The feather public API loads the whole table, you will need to work
with the IPC interface if you want to do stream reading.

François

[1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/ipc
[2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L65-L90
[3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L113-L255
[4] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L85