You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Jacob Zelko <ja...@gmail.com> on 2020/10/22 18:38:55 UTC

Does Arrow Support Larger-than-Memory Handling?

Hi all,

Very basic question as I have seen conflicting sources. I come from the
Julia community and was wondering if Arrow can handle larger-than-memory
datasets? I saw this post by Wes McKinney here discussing that the tooling
is being laid down:

Table columns in Arrow C++ can be chunked, so that appending to a table is
a zero copy operation, requiring no non-trivial computation or memory
allocation. By designing up front for streaming, chunked tables, appending
to existing in-memory tabler is computationally inexpensive relative to
pandas now. Designing for chunked or streaming data is also essential for
implementing out-of-core algorithms, so we are also laying the foundation
for processing larger-than-memory datasets.

~ *Apache Arrow and the “10 Things I Hate About pandas”*
<https://wesmckinney.com/blog/apache-arrow-pandas-internals/>

And then in the docs I saw this:

The pyarrow.dataset module provides functionality to efficiently work with
tabular, potentially larger than memory and multi-file datasets:

   - A unified interface for different sources: supporting different
   sources and file formats (Parquet, Feather files) and different file
   systems (local, cloud).
   - Discovery of sources (crawling directories, handle directory-based
   partitioned datasets, basic schema normalization, ..)
   - Optimized reading with predicate pushdown (filtering rows), projection
   (selecting columns), parallel reading or fine-grained managing of tasks.

Currently, only Parquet and Feather / Arrow IPC files are supported. The
goal is to expand this in the future to other file formats and data sources
(e.g. database connections).

~ *Tabular Datasets* <https://arrow.apache.org/docs/python/dataset.html>

The article from Wes was from 2017 and the snippet on Tabular Datasets is
from the current documentation for pyarrow.

Could anyone answer this question or at least clear up my confusion for me?
Thank you!
-- 
Jacob Zelko
Georgia Institute of Technology - Biomedical Engineering B.S. '20
Corning Community College - Engineering Science A.S. '17
Cell Number: (607) 846-8947

Re: Does Arrow Support Larger-than-Memory Handling?

Posted by Jacob Quinn <qu...@gmail.com>.

Hi Jacob,

Yes, the arrow format allows for larger-than-memory datasets. I can
describe a little what this looks like on the Julia side of things, which
should be pretty similar in other languages.

When you write a dataset to the arrow format, either on disk or in memory,
you're laying the data + metadata down in a specific, (mostly)
self-describing format, with the actual data in particular being written in
pre-determined, binary formats by supported type. Provisions are made for
even writing a long table in chunks, and writing columns with a dictionary
encoding (which can "compress" a column with low cardinality).

What this allows when _reading_, is you can "memory map" this entire region
of arrow memory to make it available to a running program, which means the
OS will "give you" the memory, without necessarily loading the entire
region into RAM at the same time, instead it "swaps" requested regions into
RAM as necessary. For the Arrow.jl Julia package, reading a table or stream
starts with getting access to an arrow memory region: if a file, it calls
`Mmap.mmap(file)`, or you can pass a `Vector{UInt8}` directly. For
`Arrow.Stream`, it only reads the initial schema metadata, and then you can
iterate the stream to get the group of columns for each batch (chunk). For
`Arrow.Table`, it will process all record batches, "chaining" each chunk
together using a ChainedVector type for each column. When the columns are
"read", they're really just custom types that wrap a specific region of the
arrow memory along with the metadata type information and length. This
means no new memory (well, practically none) is allocated when creating one
of these ArrowVectors or chaining them together, but they still satisfy the
AbstractArray interface which allows all sorts of operations on the data.

The Arrow.jl package also defines the Tables.jl interface for
`Arrow.Table`, which means, for example, you can operate on an
arrow-memory-backed DataFrame just by doing `df =
DataFrame(Arrow.Table(file))`. This builds the `Arrow.Table` like I
described above, then the `DataFrame` constructor uses the memory-mapped
columns directly in its construction. You can then use all of the useful
functionality of DataFrames directly on these arrow columns. Similarly for
other Tables.jl-compatible packages, they're just as accessible:
SQLite.load!(db, "arrow_data", arrow_table) to load arrow data into an
sqlite database, CSV.write("arrow.csv", arrow_table) to write arrow data
out as csv file, MySQL.load!(db, "arrow_data", arrow_table) to load data
into a mysql database table, and so on.

Sorry for the diatribe, but I've actually been meaning to write a bunch of
this down for some enhanced documentation for the Arrow.jl package, so
consider this a teaser!

Hope that helps!

-Jacob

On Thu, Oct 22, 2020 at 12:39 PM Jacob Zelko <ja...@gmail.com> wrote:

> Hi all,
>
> Very basic question as I have seen conflicting sources. I come from the
> Julia community and was wondering if Arrow can handle larger-than-memory
> datasets? I saw this post by Wes McKinney here discussing that the tooling
> is being laid down:
>
> Table columns in Arrow C++ can be chunked, so that appending to a table is
> a zero copy operation, requiring no non-trivial computation or memory
> allocation. By designing up front for streaming, chunked tables, appending
> to existing in-memory tabler is computationally inexpensive relative to
> pandas now. Designing for chunked or streaming data is also essential for
> implementing out-of-core algorithms, so we are also laying the foundation
> for processing larger-than-memory datasets.
>
> ~ *Apache Arrow and the “10 Things I Hate About pandas”*
> <https://wesmckinney.com/blog/apache-arrow-pandas-internals/>
>
> And then in the docs I saw this:
>
> The pyarrow.dataset module provides functionality to efficiently work with
> tabular, potentially larger than memory and multi-file datasets:
>
>    - A unified interface for different sources: supporting different
>    sources and file formats (Parquet, Feather files) and different file
>    systems (local, cloud).
>    - Discovery of sources (crawling directories, handle directory-based
>    partitioned datasets, basic schema normalization, ..)
>    - Optimized reading with predicate pushdown (filtering rows),
>    projection (selecting columns), parallel reading or fine-grained managing
>    of tasks.
>
> Currently, only Parquet and Feather / Arrow IPC files are supported. The
> goal is to expand this in the future to other file formats and data sources
> (e.g. database connections).
>
> ~ *Tabular Datasets* <https://arrow.apache.org/docs/python/dataset.html>
>
> The article from Wes was from 2017 and the snippet on Tabular Datasets is
> from the current documentation for pyarrow.
>
> Could anyone answer this question or at least clear up my confusion for
> me? Thank you!
> --
> Jacob Zelko
> Georgia Institute of Technology - Biomedical Engineering B.S. '20
> Corning Community College - Engineering Science A.S. '17
> Cell Number: (607) 846-8947
>

Re: Does Arrow Support Larger-than-Memory Handling?

Posted by Wes McKinney <we...@gmail.com>.

Sure, anything is possible if you want to write the code to do it. You
could create a CompressedRecordBatch class where you only decompress a
field/column when you need it.

On Thu, Oct 22, 2020 at 4:05 PM Daniel Nugent <nu...@gmail.com> wrote:
>
> The biggest problem with mapped arrow data is that it's only possible with uncompressed Feather files. Is there ever a possibility that compressed files could be mappable (I know that you'd have to decompress a given RecordBatch to actually work with it, but Feather files should be comprised of many RecordBatches, right?)
>
> -Dan Nugent
>
>
> On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> I'm not sure where the conflict in what's written online is, but by
>> virtue of being designed such that data structures do not require
>> memory buffers to be RAM resident (i.e. can reference memory maps), we
>> are set up well to process larger-than-memory datasets. In C++ at
>> least we are putting the pieces in place to be able to do efficient
>> query execution on on-disk datasets, and it may already be possible in
>> Rust with DataFusion.
>>
>> On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <ch...@techascent.com> wrote:
>> >
>> > There are ways to handle datasets larger than memory.  mmap'ing one or more arrow files and going from there is a pathway forward here:
>> >
>> > https://techascent.com/blog/memory-mapping-arrow.html
>> >
>> > How this maps to other software ecosystems I don't know but many have mmap support.
>> >
>> > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <ja...@gmail.com> wrote:
>> >>
>> >> I believe it would be good if you define your use case.
>> >>
>> >> I do handle larger than memory datasets with pyarrow with the use of
>> >> dataset.scan but my use case is very specific as I am repartitioning
>> >> and cleaning a bit large datasets.
>> >>
>> >> BR,
>> >>
>> >> Jacek
>> >>
>> >> czw., 22 paź 2020 o 20:39 Jacob Zelko <ja...@gmail.com> napisał(a):
>> >> >
>> >> > Hi all,
>> >> >
>> >> > Very basic question as I have seen conflicting sources. I come from the Julia community and was wondering if Arrow can handle larger-than-memory datasets? I saw this post by Wes McKinney here discussing that the tooling is being laid down:
>> >> >
>> >> > Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation. By designing up front for streaming, chunked tables, appending to existing in-memory tabler is computationally inexpensive relative to pandas now. Designing for chunked or streaming data is also essential for implementing out-of-core algorithms, so we are also laying the foundation for processing larger-than-memory datasets.
>> >> >
>> >> > ~ Apache Arrow and the “10 Things I Hate About pandas”
>> >> >
>> >> > And then in the docs I saw this:
>> >> >
>> >> > The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially larger than memory and multi-file datasets:
>> >> >
>> >> > A unified interface for different sources: supporting different sources and file formats (Parquet, Feather files) and different file systems (local, cloud).
>> >> > Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization, ..)
>> >> > Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.
>> >> >
>> >> > Currently, only Parquet and Feather / Arrow IPC files are supported. The goal is to expand this in the future to other file formats and data sources (e.g. database connections).
>> >> >
>> >> > ~ Tabular Datasets
>> >> >
>> >> > The article from Wes was from 2017 and the snippet on Tabular Datasets is from the current documentation for pyarrow.
>> >> >
>> >> > Could anyone answer this question or at least clear up my confusion for me? Thank you!
>> >> >
>> >> > --
>> >> > Jacob Zelko
>> >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
>> >> > Corning Community College - Engineering Science A.S. '17
>> >> > Cell Number: (607) 846-8947

Re: Does Arrow Support Larger-than-Memory Handling?

Posted by Daniel Nugent <nu...@gmail.com>.

The biggest problem with mapped arrow data is that it's only possible with
uncompressed Feather files. Is there ever a possibility that compressed
files could be mappable (I know that you'd have to decompress a given
RecordBatch to actually work with it, but Feather files should be comprised
of many RecordBatches, right?)

-Dan Nugent


On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney <we...@gmail.com> wrote:

> I'm not sure where the conflict in what's written online is, but by
> virtue of being designed such that data structures do not require
> memory buffers to be RAM resident (i.e. can reference memory maps), we
> are set up well to process larger-than-memory datasets. In C++ at
> least we are putting the pieces in place to be able to do efficient
> query execution on on-disk datasets, and it may already be possible in
> Rust with DataFusion.
>
> On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <ch...@techascent.com>
> wrote:
> >
> > There are ways to handle datasets larger than memory.  mmap'ing one or
> more arrow files and going from there is a pathway forward here:
> >
> > https://techascent.com/blog/memory-mapping-arrow.html
> >
> > How this maps to other software ecosystems I don't know but many have
> mmap support.
> >
> > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <ja...@gmail.com>
> wrote:
> >>
> >> I believe it would be good if you define your use case.
> >>
> >> I do handle larger than memory datasets with pyarrow with the use of
> >> dataset.scan but my use case is very specific as I am repartitioning
> >> and cleaning a bit large datasets.
> >>
> >> BR,
> >>
> >> Jacek
> >>
> >> czw., 22 paź 2020 o 20:39 Jacob Zelko <ja...@gmail.com>
> napisał(a):
> >> >
> >> > Hi all,
> >> >
> >> > Very basic question as I have seen conflicting sources. I come from
> the Julia community and was wondering if Arrow can handle
> larger-than-memory datasets? I saw this post by Wes McKinney here
> discussing that the tooling is being laid down:
> >> >
> >> > Table columns in Arrow C++ can be chunked, so that appending to a
> table is a zero copy operation, requiring no non-trivial computation or
> memory allocation. By designing up front for streaming, chunked tables,
> appending to existing in-memory tabler is computationally inexpensive
> relative to pandas now. Designing for chunked or streaming data is also
> essential for implementing out-of-core algorithms, so we are also laying
> the foundation for processing larger-than-memory datasets.
> >> >
> >> > ~ Apache Arrow and the “10 Things I Hate About pandas”
> >> >
> >> > And then in the docs I saw this:
> >> >
> >> > The pyarrow.dataset module provides functionality to efficiently work
> with tabular, potentially larger than memory and multi-file datasets:
> >> >
> >> > A unified interface for different sources: supporting different
> sources and file formats (Parquet, Feather files) and different file
> systems (local, cloud).
> >> > Discovery of sources (crawling directories, handle directory-based
> partitioned datasets, basic schema normalization, ..)
> >> > Optimized reading with predicate pushdown (filtering rows),
> projection (selecting columns), parallel reading or fine-grained managing
> of tasks.
> >> >
> >> > Currently, only Parquet and Feather / Arrow IPC files are supported.
> The goal is to expand this in the future to other file formats and data
> sources (e.g. database connections).
> >> >
> >> > ~ Tabular Datasets
> >> >
> >> > The article from Wes was from 2017 and the snippet on Tabular
> Datasets is from the current documentation for pyarrow.
> >> >
> >> > Could anyone answer this question or at least clear up my confusion
> for me? Thank you!
> >> >
> >> > --
> >> > Jacob Zelko
> >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
> >> > Corning Community College - Engineering Science A.S. '17
> >> > Cell Number: (607) 846-8947
>

Re: Does Arrow Support Larger-than-Memory Handling?

Posted by Wes McKinney <we...@gmail.com>.

I'm not sure where the conflict in what's written online is, but by
virtue of being designed such that data structures do not require
memory buffers to be RAM resident (i.e. can reference memory maps), we
are set up well to process larger-than-memory datasets. In C++ at
least we are putting the pieces in place to be able to do efficient
query execution on on-disk datasets, and it may already be possible in
Rust with DataFusion.

On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <ch...@techascent.com> wrote:
>
> There are ways to handle datasets larger than memory.  mmap'ing one or more arrow files and going from there is a pathway forward here:
>
> https://techascent.com/blog/memory-mapping-arrow.html
>
> How this maps to other software ecosystems I don't know but many have mmap support.
>
> On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <ja...@gmail.com> wrote:
>>
>> I believe it would be good if you define your use case.
>>
>> I do handle larger than memory datasets with pyarrow with the use of
>> dataset.scan but my use case is very specific as I am repartitioning
>> and cleaning a bit large datasets.
>>
>> BR,
>>
>> Jacek
>>
>> czw., 22 paź 2020 o 20:39 Jacob Zelko <ja...@gmail.com> napisał(a):
>> >
>> > Hi all,
>> >
>> > Very basic question as I have seen conflicting sources. I come from the Julia community and was wondering if Arrow can handle larger-than-memory datasets? I saw this post by Wes McKinney here discussing that the tooling is being laid down:
>> >
>> > Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation. By designing up front for streaming, chunked tables, appending to existing in-memory tabler is computationally inexpensive relative to pandas now. Designing for chunked or streaming data is also essential for implementing out-of-core algorithms, so we are also laying the foundation for processing larger-than-memory datasets.
>> >
>> > ~ Apache Arrow and the “10 Things I Hate About pandas”
>> >
>> > And then in the docs I saw this:
>> >
>> > The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially larger than memory and multi-file datasets:
>> >
>> > A unified interface for different sources: supporting different sources and file formats (Parquet, Feather files) and different file systems (local, cloud).
>> > Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization, ..)
>> > Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.
>> >
>> > Currently, only Parquet and Feather / Arrow IPC files are supported. The goal is to expand this in the future to other file formats and data sources (e.g. database connections).
>> >
>> > ~ Tabular Datasets
>> >
>> > The article from Wes was from 2017 and the snippet on Tabular Datasets is from the current documentation for pyarrow.
>> >
>> > Could anyone answer this question or at least clear up my confusion for me? Thank you!
>> >
>> > --
>> > Jacob Zelko
>> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
>> > Corning Community College - Engineering Science A.S. '17
>> > Cell Number: (607) 846-8947

Re: Does Arrow Support Larger-than-Memory Handling?

Posted by Chris Nuernberger <ch...@techascent.com>.

There are ways to handle datasets larger than memory.  mmap'ing one or more
arrow files and going from there is a pathway forward here:

https://techascent.com/blog/memory-mapping-arrow.html

How this maps to other software ecosystems I don't know but many have mmap
support.

On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <ja...@gmail.com>
wrote:

> I believe it would be good if you define your use case.
>
> I do handle larger than memory datasets with pyarrow with the use of
> dataset.scan but my use case is very specific as I am repartitioning
> and cleaning a bit large datasets.
>
> BR,
>
> Jacek
>
> czw., 22 paź 2020 o 20:39 Jacob Zelko <ja...@gmail.com> napisał(a):
> >
> > Hi all,
> >
> > Very basic question as I have seen conflicting sources. I come from the
> Julia community and was wondering if Arrow can handle larger-than-memory
> datasets? I saw this post by Wes McKinney here discussing that the tooling
> is being laid down:
> >
> > Table columns in Arrow C++ can be chunked, so that appending to a table
> is a zero copy operation, requiring no non-trivial computation or memory
> allocation. By designing up front for streaming, chunked tables, appending
> to existing in-memory tabler is computationally inexpensive relative to
> pandas now. Designing for chunked or streaming data is also essential for
> implementing out-of-core algorithms, so we are also laying the foundation
> for processing larger-than-memory datasets.
> >
> > ~ Apache Arrow and the “10 Things I Hate About pandas”
> >
> > And then in the docs I saw this:
> >
> > The pyarrow.dataset module provides functionality to efficiently work
> with tabular, potentially larger than memory and multi-file datasets:
> >
> > A unified interface for different sources: supporting different sources
> and file formats (Parquet, Feather files) and different file systems
> (local, cloud).
> > Discovery of sources (crawling directories, handle directory-based
> partitioned datasets, basic schema normalization, ..)
> > Optimized reading with predicate pushdown (filtering rows), projection
> (selecting columns), parallel reading or fine-grained managing of tasks.
> >
> > Currently, only Parquet and Feather / Arrow IPC files are supported. The
> goal is to expand this in the future to other file formats and data sources
> (e.g. database connections).
> >
> > ~ Tabular Datasets
> >
> > The article from Wes was from 2017 and the snippet on Tabular Datasets
> is from the current documentation for pyarrow.
> >
> > Could anyone answer this question or at least clear up my confusion for
> me? Thank you!
> >
> > --
> > Jacob Zelko
> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
> > Corning Community College - Engineering Science A.S. '17
> > Cell Number: (607) 846-8947
>

Re: Does Arrow Support Larger-than-Memory Handling?

Posted by Jacek Pliszka <ja...@gmail.com>.

I believe it would be good if you define your use case.

I do handle larger than memory datasets with pyarrow with the use of
dataset.scan but my use case is very specific as I am repartitioning
and cleaning a bit large datasets.

BR,

Jacek

czw., 22 paź 2020 o 20:39 Jacob Zelko <ja...@gmail.com> napisał(a):
>
> Hi all,
>
> Very basic question as I have seen conflicting sources. I come from the Julia community and was wondering if Arrow can handle larger-than-memory datasets? I saw this post by Wes McKinney here discussing that the tooling is being laid down:
>
> Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation. By designing up front for streaming, chunked tables, appending to existing in-memory tabler is computationally inexpensive relative to pandas now. Designing for chunked or streaming data is also essential for implementing out-of-core algorithms, so we are also laying the foundation for processing larger-than-memory datasets.
>
> ~ Apache Arrow and the “10 Things I Hate About pandas”
>
> And then in the docs I saw this:
>
> The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially larger than memory and multi-file datasets:
>
> A unified interface for different sources: supporting different sources and file formats (Parquet, Feather files) and different file systems (local, cloud).
> Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization, ..)
> Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.
>
> Currently, only Parquet and Feather / Arrow IPC files are supported. The goal is to expand this in the future to other file formats and data sources (e.g. database connections).
>
> ~ Tabular Datasets
>
> The article from Wes was from 2017 and the snippet on Tabular Datasets is from the current documentation for pyarrow.
>
> Could anyone answer this question or at least clear up my confusion for me? Thank you!
>
> --
> Jacob Zelko
> Georgia Institute of Technology - Biomedical Engineering B.S. '20
> Corning Community College - Engineering Science A.S. '17
> Cell Number: (607) 846-8947