You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Maarten Breddels <ma...@gmail.com> on 2019/11/26 15:01:03 UTC

Non-chunked large files / hdf5 support

In vaex I always write the data to hdf5 as 1 large chunk (per column).
The reason is that it allows the mmapped columns to be exposed as a
single numpy array (talking numerical data only for now), which many
people are quite comfortable with.

The strategy for vaex to write unchunked data, is to first create an
'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
write to that in chunks.

This means that in vaex I need to support mutable data (only used
internally, vaex' default is immutable data like arrow), since I need
to write to the memory mapped data. It also makes the exporting code
relatively simple.

I could not find a way in Arrow to get something similar done, at
least not without having a single pa.array instance for each column. I
think Arrow's mindset is that you should just use chunks right? Or is
this also something that can be considered for Arrow?

An alternative would be to implement Arrow in hdf5, which I basically
do now in vaex (with limited support). Again, I'm wondering if there
is there an interest in storing arrow data in hdf5 from the Arrow
community?

cheers,

Maarten

Re: Non-chunked large files / hdf5 support

Posted by Wes McKinney <we...@gmail.com>.

On Tue, Dec 17, 2019 at 5:15 AM Maarten Breddels
<ma...@gmail.com> wrote:
>
> Hi,
>
> I had to catch up a bit with the arrow documentation before I could respond
> properly. My fear was that Arrow demanded that the in-memory representation
> was always 'packed', or 'flat'. After going through the docs, it seems that
> only when doing IPC or stream writing, it is written in this form. But it
> seems that e.g. ChunkedArray, StructArray can have their arrays/fields
> anywhere in memory. So given a set of contiguous (ignoring chunking)
> datasets in an hdf5 file, we should be able to memory map that and pass it
> to an Apache Arrow Table without any memory copy. Is this assumption
> correct?

Correct, yes.

>
> Indeed, assuming the story above is correct, Apache Arrow and hdf5 are
> completely orthogonal.

Yes, that's right.

>
> As mentioned before, I have a strong preference for having the ability to
> write out any size vaex DataFrame as 1 chunk. Having this in hdf5 will make
> the data trivial to read by anyone using an hdf5 library and then map the
> data to a single numpy array.

This would seem only relevant to numeric data having no nulls. I'm
sure there are problem domains where many datasets are like this, but
in the world of business analytics and databases it seems more
exceptional. NumPy arrays were always a poor fit as a backing data
structure for analytics but in 2008/2009 having NumPy interoperability
was important. It seems a great deal less important to me now,
particular with regards to non-numeric, nullable (of any type),
categorical, or nested data.

> I will probably explore this idea more within Vaex, in Python. I'll try to
> keep in touch with this, and would happily see this go into Apache Arrow. I
> hope we can formalize an intuitive way to write Apache Arrow Tables into
> hdf5.

I agree it would be useful, I will keep an eye out for what you learn.

> cheers,
>
> Maarten
>
>
>
>
>
>
> Op wo 27 nov. 2019 om 22:35 schreef Wes McKinney <we...@gmail.com>:
>
> > hi,
> >
> > There have been a number of discussions over the years about on-disk
> > pre-allocation strategies. No volunteers have implemented anything,
> > though. Developing an HDF5 integration library with pre-allocation and
> > buffer management utilities seems like a reasonable growth area for
> > the project. The functionality provided by HDF5 and Apache Arrow (and
> > whether they're doing the same things -- which they aren't) has
> > actually been a common point of confusions for onlookers, so
> > clarifying that one can work together with the other might be helpful.
> >
> > Both in C++ and Python we have methods for assembling arrays and
> > record batches from mutable buffers, so if you allocate the buffers,
> > populate them, then you can assemble a record batch or table from them
> > in a straightforward manner.
> >
> > - Wes
> >
> >
> > On Tue, Nov 26, 2019 at 10:25 AM Francois Saint-Jacques
> > <fs...@gmail.com> wrote:
> > >
> > > Hello Maarten,
> > >
> > > In theory, you could provide a custom mmap-allocator and use the
> > > builder facility. Since the array is still in "build-phase" and not
> > > sealed, it should be fine if mremap changes the pointer address. This
> > > might fail in practice since the allocator is also used for auxiliary
> > > data, e.g. dictionary hash table data in the case of Dictionary type.
> > >
> > >
> > > Another solution is to create a `FixedBuilder class where
> > > - the number of elements is known
> > > - the data type is of fixed width
> > > - Nullability is know (whether you need an extra buffer).
> > >
> > > I think sooner or later we'll need such class.
> > >
> > > François
> > >
> > > On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
> > > <ma...@gmail.com> wrote:
> > > >
> > > > In vaex I always write the data to hdf5 as 1 large chunk (per column).
> > > > The reason is that it allows the mmapped columns to be exposed as a
> > > > single numpy array (talking numerical data only for now), which many
> > > > people are quite comfortable with.
> > > >
> > > > The strategy for vaex to write unchunked data, is to first create an
> > > > 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> > > > write to that in chunks.
> > > >
> > > > This means that in vaex I need to support mutable data (only used
> > > > internally, vaex' default is immutable data like arrow), since I need
> > > > to write to the memory mapped data. It also makes the exporting code
> > > > relatively simple.
> > > >
> > > > I could not find a way in Arrow to get something similar done, at
> > > > least not without having a single pa.array instance for each column. I
> > > > think Arrow's mindset is that you should just use chunks right? Or is
> > > > this also something that can be considered for Arrow?
> > > >
> > > > An alternative would be to implement Arrow in hdf5, which I basically
> > > > do now in vaex (with limited support). Again, I'm wondering if there
> > > > is there an interest in storing arrow data in hdf5 from the Arrow
> > > > community?
> > > >
> > > > cheers,
> > > >
> > > > Maarten
> >

Re: Non-chunked large files / hdf5 support

Posted by Maarten Breddels <ma...@gmail.com>.

Hi,

I had to catch up a bit with the arrow documentation before I could respond
properly. My fear was that Arrow demanded that the in-memory representation
was always 'packed', or 'flat'. After going through the docs, it seems that
only when doing IPC or stream writing, it is written in this form. But it
seems that e.g. ChunkedArray, StructArray can have their arrays/fields
anywhere in memory. So given a set of contiguous (ignoring chunking)
datasets in an hdf5 file, we should be able to memory map that and pass it
to an Apache Arrow Table without any memory copy. Is this assumption
correct?

Indeed, assuming the story above is correct, Apache Arrow and hdf5 are
completely orthogonal.

As mentioned before, I have a strong preference for having the ability to
write out any size vaex DataFrame as 1 chunk. Having this in hdf5 will make
the data trivial to read by anyone using an hdf5 library and then map the
data to a single numpy array.

I will probably explore this idea more within Vaex, in Python. I'll try to
keep in touch with this, and would happily see this go into Apache Arrow. I
hope we can formalize an intuitive way to write Apache Arrow Tables into
hdf5.

cheers,

Maarten






Op wo 27 nov. 2019 om 22:35 schreef Wes McKinney <we...@gmail.com>:

> hi,
>
> There have been a number of discussions over the years about on-disk
> pre-allocation strategies. No volunteers have implemented anything,
> though. Developing an HDF5 integration library with pre-allocation and
> buffer management utilities seems like a reasonable growth area for
> the project. The functionality provided by HDF5 and Apache Arrow (and
> whether they're doing the same things -- which they aren't) has
> actually been a common point of confusions for onlookers, so
> clarifying that one can work together with the other might be helpful.
>
> Both in C++ and Python we have methods for assembling arrays and
> record batches from mutable buffers, so if you allocate the buffers,
> populate them, then you can assemble a record batch or table from them
> in a straightforward manner.
>
> - Wes
>
>
> On Tue, Nov 26, 2019 at 10:25 AM Francois Saint-Jacques
> <fs...@gmail.com> wrote:
> >
> > Hello Maarten,
> >
> > In theory, you could provide a custom mmap-allocator and use the
> > builder facility. Since the array is still in "build-phase" and not
> > sealed, it should be fine if mremap changes the pointer address. This
> > might fail in practice since the allocator is also used for auxiliary
> > data, e.g. dictionary hash table data in the case of Dictionary type.
> >
> >
> > Another solution is to create a `FixedBuilder class where
> > - the number of elements is known
> > - the data type is of fixed width
> > - Nullability is know (whether you need an extra buffer).
> >
> > I think sooner or later we'll need such class.
> >
> > François
> >
> > On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
> > <ma...@gmail.com> wrote:
> > >
> > > In vaex I always write the data to hdf5 as 1 large chunk (per column).
> > > The reason is that it allows the mmapped columns to be exposed as a
> > > single numpy array (talking numerical data only for now), which many
> > > people are quite comfortable with.
> > >
> > > The strategy for vaex to write unchunked data, is to first create an
> > > 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> > > write to that in chunks.
> > >
> > > This means that in vaex I need to support mutable data (only used
> > > internally, vaex' default is immutable data like arrow), since I need
> > > to write to the memory mapped data. It also makes the exporting code
> > > relatively simple.
> > >
> > > I could not find a way in Arrow to get something similar done, at
> > > least not without having a single pa.array instance for each column. I
> > > think Arrow's mindset is that you should just use chunks right? Or is
> > > this also something that can be considered for Arrow?
> > >
> > > An alternative would be to implement Arrow in hdf5, which I basically
> > > do now in vaex (with limited support). Again, I'm wondering if there
> > > is there an interest in storing arrow data in hdf5 from the Arrow
> > > community?
> > >
> > > cheers,
> > >
> > > Maarten
>

Re: Non-chunked large files / hdf5 support

Posted by Wes McKinney <we...@gmail.com>.

hi,

There have been a number of discussions over the years about on-disk
pre-allocation strategies. No volunteers have implemented anything,
though. Developing an HDF5 integration library with pre-allocation and
buffer management utilities seems like a reasonable growth area for
the project. The functionality provided by HDF5 and Apache Arrow (and
whether they're doing the same things -- which they aren't) has
actually been a common point of confusions for onlookers, so
clarifying that one can work together with the other might be helpful.

Both in C++ and Python we have methods for assembling arrays and
record batches from mutable buffers, so if you allocate the buffers,
populate them, then you can assemble a record batch or table from them
in a straightforward manner.

- Wes


On Tue, Nov 26, 2019 at 10:25 AM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> Hello Maarten,
>
> In theory, you could provide a custom mmap-allocator and use the
> builder facility. Since the array is still in "build-phase" and not
> sealed, it should be fine if mremap changes the pointer address. This
> might fail in practice since the allocator is also used for auxiliary
> data, e.g. dictionary hash table data in the case of Dictionary type.
>
>
> Another solution is to create a `FixedBuilder class where
> - the number of elements is known
> - the data type is of fixed width
> - Nullability is know (whether you need an extra buffer).
>
> I think sooner or later we'll need such class.
>
> François
>
> On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
> <ma...@gmail.com> wrote:
> >
> > In vaex I always write the data to hdf5 as 1 large chunk (per column).
> > The reason is that it allows the mmapped columns to be exposed as a
> > single numpy array (talking numerical data only for now), which many
> > people are quite comfortable with.
> >
> > The strategy for vaex to write unchunked data, is to first create an
> > 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> > write to that in chunks.
> >
> > This means that in vaex I need to support mutable data (only used
> > internally, vaex' default is immutable data like arrow), since I need
> > to write to the memory mapped data. It also makes the exporting code
> > relatively simple.
> >
> > I could not find a way in Arrow to get something similar done, at
> > least not without having a single pa.array instance for each column. I
> > think Arrow's mindset is that you should just use chunks right? Or is
> > this also something that can be considered for Arrow?
> >
> > An alternative would be to implement Arrow in hdf5, which I basically
> > do now in vaex (with limited support). Again, I'm wondering if there
> > is there an interest in storing arrow data in hdf5 from the Arrow
> > community?
> >
> > cheers,
> >
> > Maarten

Re: Non-chunked large files / hdf5 support

Posted by Francois Saint-Jacques <fs...@gmail.com>.

Hello Maarten,

In theory, you could provide a custom mmap-allocator and use the
builder facility. Since the array is still in "build-phase" and not
sealed, it should be fine if mremap changes the pointer address. This
might fail in practice since the allocator is also used for auxiliary
data, e.g. dictionary hash table data in the case of Dictionary type.


Another solution is to create a `FixedBuilder class where
- the number of elements is known
- the data type is of fixed width
- Nullability is know (whether you need an extra buffer).

I think sooner or later we'll need such class.

François

On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
<ma...@gmail.com> wrote:
>
> In vaex I always write the data to hdf5 as 1 large chunk (per column).
> The reason is that it allows the mmapped columns to be exposed as a
> single numpy array (talking numerical data only for now), which many
> people are quite comfortable with.
>
> The strategy for vaex to write unchunked data, is to first create an
> 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> write to that in chunks.
>
> This means that in vaex I need to support mutable data (only used
> internally, vaex' default is immutable data like arrow), since I need
> to write to the memory mapped data. It also makes the exporting code
> relatively simple.
>
> I could not find a way in Arrow to get something similar done, at
> least not without having a single pa.array instance for each column. I
> think Arrow's mindset is that you should just use chunks right? Or is
> this also something that can be considered for Arrow?
>
> An alternative would be to implement Arrow in hdf5, which I basically
> do now in vaex (with limited support). Again, I'm wondering if there
> is there an interest in storing arrow data in hdf5 from the Arrow
> community?
>
> cheers,
>
> Maarten