You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Maarten Breddels <ma...@gmail.com> on 2019/11/26 13:43:49 UTC

Strategy for mixing large_string and string with chunked arrays

Hi Arrow devs,

Small intro: I'm the main Vaex developer, an out of core dataframe
library for Python - https://github.com/vaexio/vaex -, and we're
looking into moving Vaex to use Apache Arrow for the data structure.
At the beginning of this year, we added string support in Vaex, which
required 64 bit offsets. Those were not available back then, so we
added our own data structure for string arrays. Our first step to move
to Apache Arrow is to see if we can use Arrow for the data structure,
and later on, move the strings algorithms of Vaex to Arrow.

(originally posted at https://github.com/apache/arrow/issues/5874)

In vaex I can lazily concatenate dataframes without memory copy. If I
want to implement this using a pa.ChunkedArray, users cannot
concatenate dataframes that have a string column with pa.string type
to a dataframe that has a column with pa.large_string.

In short, there is no arrow data structure to handle this 'mixed
chunked array', but I was wondering if this could change. The only way
out seems to cast them manually to a common type (although blocked by
https://issues.apache.org/jira/browse/ARROW-6071).
Internally I could solve this in vaex, but feedback from building a
DataFrame library with arrow might be useful. Also, it means I cannot
expose the concatenated DataFrame as an arrow table.

Because of this, I am wondering if having two types (large_string and
string) is a good idea in the end since it makes type checking
cumbersome (having to check two types each time).  Could an option be
that there is only 1 string and list type, and that the width of the
indices/offsets can be obtained at runtime? That would also make it
easy to support 16 and 8-bit offsets. That would make Arrow more
flexible and efficient, and I guess it would play better with
pa.ChunkedArray.

Regards,

Maarten Breddels

Re: Strategy for mixing large_string and string with chunked arrays

Posted by Maarten Breddels <ma...@gmail.com>.
Op wo 27 nov. 2019 om 19:37 schreef Wes McKinney <we...@gmail.com>:

> On Tue, Nov 26, 2019 at 9:40 AM Maarten Breddels
> <ma...@gmail.com> wrote:
> >
> > Op di 26 nov. 2019 om 15:02 schreef Wes McKinney <we...@gmail.com>:
> >
> > > hi Maarten
> > >
> > > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part
> based
> > > on this.
> > >
> > > I think that normalizing to a common type (which would require casting
> > > the offsets buffer, but not the data -- which can be shared -- so not
> > > too wasteful) during concatenation would be the approach I would take.
> > > I would be surprised if normalizing string offsets during record batch
> > > / table concatenation showed up as a performance or memory use issue
> > > relative to other kinds of operations -- in theory the
> > > string->large_string promotion should be relatively exceptional (< 5%
> > > of the time). I've found in performance tests that creating many
> > > smaller array chunks is faster anyway due to interplay with the memory
> > > allocator.
> > >
> >
> > Yes, I think it is rare, but it does mean that if a user wants to
> convert a
> > Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1
> > billion rows). Ideally, it would use zero RAM (imagine concatenating many
> > large memory-mapped datasets).
> > I'm ok living with this limitation, but I wanted to raise it before v1.0
> > goes out.
> >
>
> The 1.0 release is about hardening the format and protocol, which
> wouldn't be affected by this discussion. The Binary/String and
> LargeBinary/LargeString are distinct memory layouts and so they need
> to be separate at the protocol level.
>
> At the C++ library / application level there's plenty that could be
> done if this turned out to be an issue. For example, an ExtensionType
> could be defined that allows the storage to be either 32-bit or
> 64-bit.
>

Ok, sounds good.


>
> >
> >
> > >
> > > Of course I think we should have string kernels for both 32-bit and
> > > 64-bit variants. Note that Gandiva already has significant string
> > > kernel support (for 32-bit offsets at the moment) and there is
> > > discussion about pre-compiling the LLVM IR into a shared library to
> > > not introduce an LLVM runtime dependency, so we could maintain a
> > > single code path for string algorithms that can be used both in a
> > > JIT-ed setting as well as pre-compiled / interpreted setting. See
> > > https://issues.apache.org/jira/browse/ARROW-7083
> >
> >
> > That is a very interesting approach, thanks for sharing that resource,
> I'll
> > consider that.
> >
> >
> > > Note that many analytic database engines (notably: Dremio, which is
> > > natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
> > > at all and it does not seem to be an impedance in practical use. We
> > > have the Chunked* builder classes [1] in C++ to facilitate the
> > > creation of chunked binary arrays where there is concern about
> > > overflowing the 2GB limit.
> > >
> > > Others may have different opinions so I'll let them comment.
> > >
> >
> > Yes, I think in many cases it's not a problem at all. Also in vaex, all
> the
> > processing happens in chunks, and no chunk will ever be that large (for
> the
> > near future...).
> > In vaex, when exporting to hdf5, I always write in 1 chunk, and that's
> > where most of my issues show up.
>
> I see. Ideally one would architect around the chunked model since this
> seems to have the best overall performance and scalability qualities.
>

Note that I prefer non-chunked on disk, but chunked in memory (or, while
processing). I think that having 1 linear array gives you the ability to
chunk it up any way you prefer (to make cache hits optimal etc), while if
the on-disk chunking does not match the ideal chunking size, you may end up
processing small chunks or having to memory copies.
Where do you see performance differences between chunked/non-chunked, would
be interesting to know more about that.

cheers,

Maarten


>
> >
> > cheers,
> >
> > Maarten
> >
> >
> > >
> > > - Wes
> > >
> > > [1]:
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510
> > >
> > > On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
> > > <ma...@gmail.com> wrote:
> > > >
> > > > Hi Arrow devs,
> > > >
> > > > Small intro: I'm the main Vaex developer, an out of core dataframe
> > > > library for Python - https://github.com/vaexio/vaex -, and we're
> > > > looking into moving Vaex to use Apache Arrow for the data structure.
> > > > At the beginning of this year, we added string support in Vaex, which
> > > > required 64 bit offsets. Those were not available back then, so we
> > > > added our own data structure for string arrays. Our first step to
> move
> > > > to Apache Arrow is to see if we can use Arrow for the data structure,
> > > > and later on, move the strings algorithms of Vaex to Arrow.
> > > >
> > > > (originally posted at https://github.com/apache/arrow/issues/5874)
> > > >
> > > > In vaex I can lazily concatenate dataframes without memory copy. If I
> > > > want to implement this using a pa.ChunkedArray, users cannot
> > > > concatenate dataframes that have a string column with pa.string type
> > > > to a dataframe that has a column with pa.large_string.
> > > >
> > > > In short, there is no arrow data structure to handle this 'mixed
> > > > chunked array', but I was wondering if this could change. The only
> way
> > > > out seems to cast them manually to a common type (although blocked by
> > > > https://issues.apache.org/jira/browse/ARROW-6071).
> > > > Internally I could solve this in vaex, but feedback from building a
> > > > DataFrame library with arrow might be useful. Also, it means I cannot
> > > > expose the concatenated DataFrame as an arrow table.
> > > >
> > > > Because of this, I am wondering if having two types (large_string and
> > > > string) is a good idea in the end since it makes type checking
> > > > cumbersome (having to check two types each time).  Could an option be
> > > > that there is only 1 string and list type, and that the width of the
> > > > indices/offsets can be obtained at runtime? That would also make it
> > > > easy to support 16 and 8-bit offsets. That would make Arrow more
> > > > flexible and efficient, and I guess it would play better with
> > > > pa.ChunkedArray.
> > > >
> > > > Regards,
> > > >
> > > > Maarten Breddels
> > >
>

Re: Strategy for mixing large_string and string with chunked arrays

Posted by Wes McKinney <we...@gmail.com>.
On Tue, Nov 26, 2019 at 9:40 AM Maarten Breddels
<ma...@gmail.com> wrote:
>
> Op di 26 nov. 2019 om 15:02 schreef Wes McKinney <we...@gmail.com>:
>
> > hi Maarten
> >
> > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based
> > on this.
> >
> > I think that normalizing to a common type (which would require casting
> > the offsets buffer, but not the data -- which can be shared -- so not
> > too wasteful) during concatenation would be the approach I would take.
> > I would be surprised if normalizing string offsets during record batch
> > / table concatenation showed up as a performance or memory use issue
> > relative to other kinds of operations -- in theory the
> > string->large_string promotion should be relatively exceptional (< 5%
> > of the time). I've found in performance tests that creating many
> > smaller array chunks is faster anyway due to interplay with the memory
> > allocator.
> >
>
> Yes, I think it is rare, but it does mean that if a user wants to convert a
> Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1
> billion rows). Ideally, it would use zero RAM (imagine concatenating many
> large memory-mapped datasets).
> I'm ok living with this limitation, but I wanted to raise it before v1.0
> goes out.
>

The 1.0 release is about hardening the format and protocol, which
wouldn't be affected by this discussion. The Binary/String and
LargeBinary/LargeString are distinct memory layouts and so they need
to be separate at the protocol level.

At the C++ library / application level there's plenty that could be
done if this turned out to be an issue. For example, an ExtensionType
could be defined that allows the storage to be either 32-bit or
64-bit.

>
>
> >
> > Of course I think we should have string kernels for both 32-bit and
> > 64-bit variants. Note that Gandiva already has significant string
> > kernel support (for 32-bit offsets at the moment) and there is
> > discussion about pre-compiling the LLVM IR into a shared library to
> > not introduce an LLVM runtime dependency, so we could maintain a
> > single code path for string algorithms that can be used both in a
> > JIT-ed setting as well as pre-compiled / interpreted setting. See
> > https://issues.apache.org/jira/browse/ARROW-7083
>
>
> That is a very interesting approach, thanks for sharing that resource, I'll
> consider that.
>
>
> > Note that many analytic database engines (notably: Dremio, which is
> > natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
> > at all and it does not seem to be an impedance in practical use. We
> > have the Chunked* builder classes [1] in C++ to facilitate the
> > creation of chunked binary arrays where there is concern about
> > overflowing the 2GB limit.
> >
> > Others may have different opinions so I'll let them comment.
> >
>
> Yes, I think in many cases it's not a problem at all. Also in vaex, all the
> processing happens in chunks, and no chunk will ever be that large (for the
> near future...).
> In vaex, when exporting to hdf5, I always write in 1 chunk, and that's
> where most of my issues show up.

I see. Ideally one would architect around the chunked model since this
seems to have the best overall performance and scalability qualities.

>
> cheers,
>
> Maarten
>
>
> >
> > - Wes
> >
> > [1]:
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510
> >
> > On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
> > <ma...@gmail.com> wrote:
> > >
> > > Hi Arrow devs,
> > >
> > > Small intro: I'm the main Vaex developer, an out of core dataframe
> > > library for Python - https://github.com/vaexio/vaex -, and we're
> > > looking into moving Vaex to use Apache Arrow for the data structure.
> > > At the beginning of this year, we added string support in Vaex, which
> > > required 64 bit offsets. Those were not available back then, so we
> > > added our own data structure for string arrays. Our first step to move
> > > to Apache Arrow is to see if we can use Arrow for the data structure,
> > > and later on, move the strings algorithms of Vaex to Arrow.
> > >
> > > (originally posted at https://github.com/apache/arrow/issues/5874)
> > >
> > > In vaex I can lazily concatenate dataframes without memory copy. If I
> > > want to implement this using a pa.ChunkedArray, users cannot
> > > concatenate dataframes that have a string column with pa.string type
> > > to a dataframe that has a column with pa.large_string.
> > >
> > > In short, there is no arrow data structure to handle this 'mixed
> > > chunked array', but I was wondering if this could change. The only way
> > > out seems to cast them manually to a common type (although blocked by
> > > https://issues.apache.org/jira/browse/ARROW-6071).
> > > Internally I could solve this in vaex, but feedback from building a
> > > DataFrame library with arrow might be useful. Also, it means I cannot
> > > expose the concatenated DataFrame as an arrow table.
> > >
> > > Because of this, I am wondering if having two types (large_string and
> > > string) is a good idea in the end since it makes type checking
> > > cumbersome (having to check two types each time).  Could an option be
> > > that there is only 1 string and list type, and that the width of the
> > > indices/offsets can be obtained at runtime? That would also make it
> > > easy to support 16 and 8-bit offsets. That would make Arrow more
> > > flexible and efficient, and I guess it would play better with
> > > pa.ChunkedArray.
> > >
> > > Regards,
> > >
> > > Maarten Breddels
> >

Re: Strategy for mixing large_string and string with chunked arrays

Posted by Maarten Breddels <ma...@gmail.com>.
Op di 26 nov. 2019 om 15:02 schreef Wes McKinney <we...@gmail.com>:

> hi Maarten
>
> I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based
> on this.
>
> I think that normalizing to a common type (which would require casting
> the offsets buffer, but not the data -- which can be shared -- so not
> too wasteful) during concatenation would be the approach I would take.
> I would be surprised if normalizing string offsets during record batch
> / table concatenation showed up as a performance or memory use issue
> relative to other kinds of operations -- in theory the
> string->large_string promotion should be relatively exceptional (< 5%
> of the time). I've found in performance tests that creating many
> smaller array chunks is faster anyway due to interplay with the memory
> allocator.
>

Yes, I think it is rare, but it does mean that if a user wants to convert a
Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1
billion rows). Ideally, it would use zero RAM (imagine concatenating many
large memory-mapped datasets).
I'm ok living with this limitation, but I wanted to raise it before v1.0
goes out.



>
> Of course I think we should have string kernels for both 32-bit and
> 64-bit variants. Note that Gandiva already has significant string
> kernel support (for 32-bit offsets at the moment) and there is
> discussion about pre-compiling the LLVM IR into a shared library to
> not introduce an LLVM runtime dependency, so we could maintain a
> single code path for string algorithms that can be used both in a
> JIT-ed setting as well as pre-compiled / interpreted setting. See
> https://issues.apache.org/jira/browse/ARROW-7083


That is a very interesting approach, thanks for sharing that resource, I'll
consider that.


> Note that many analytic database engines (notably: Dremio, which is
> natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
> at all and it does not seem to be an impedance in practical use. We
> have the Chunked* builder classes [1] in C++ to facilitate the
> creation of chunked binary arrays where there is concern about
> overflowing the 2GB limit.
>
> Others may have different opinions so I'll let them comment.
>

Yes, I think in many cases it's not a problem at all. Also in vaex, all the
processing happens in chunks, and no chunk will ever be that large (for the
near future...).
In vaex, when exporting to hdf5, I always write in 1 chunk, and that's
where most of my issues show up.

cheers,

Maarten


>
> - Wes
>
> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510
>
> On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
> <ma...@gmail.com> wrote:
> >
> > Hi Arrow devs,
> >
> > Small intro: I'm the main Vaex developer, an out of core dataframe
> > library for Python - https://github.com/vaexio/vaex -, and we're
> > looking into moving Vaex to use Apache Arrow for the data structure.
> > At the beginning of this year, we added string support in Vaex, which
> > required 64 bit offsets. Those were not available back then, so we
> > added our own data structure for string arrays. Our first step to move
> > to Apache Arrow is to see if we can use Arrow for the data structure,
> > and later on, move the strings algorithms of Vaex to Arrow.
> >
> > (originally posted at https://github.com/apache/arrow/issues/5874)
> >
> > In vaex I can lazily concatenate dataframes without memory copy. If I
> > want to implement this using a pa.ChunkedArray, users cannot
> > concatenate dataframes that have a string column with pa.string type
> > to a dataframe that has a column with pa.large_string.
> >
> > In short, there is no arrow data structure to handle this 'mixed
> > chunked array', but I was wondering if this could change. The only way
> > out seems to cast them manually to a common type (although blocked by
> > https://issues.apache.org/jira/browse/ARROW-6071).
> > Internally I could solve this in vaex, but feedback from building a
> > DataFrame library with arrow might be useful. Also, it means I cannot
> > expose the concatenated DataFrame as an arrow table.
> >
> > Because of this, I am wondering if having two types (large_string and
> > string) is a good idea in the end since it makes type checking
> > cumbersome (having to check two types each time).  Could an option be
> > that there is only 1 string and list type, and that the width of the
> > indices/offsets can be obtained at runtime? That would also make it
> > easy to support 16 and 8-bit offsets. That would make Arrow more
> > flexible and efficient, and I guess it would play better with
> > pa.ChunkedArray.
> >
> > Regards,
> >
> > Maarten Breddels
>

Re: Strategy for mixing large_string and string with chunked arrays

Posted by Wes McKinney <we...@gmail.com>.
hi Maarten

I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based on this.

I think that normalizing to a common type (which would require casting
the offsets buffer, but not the data -- which can be shared -- so not
too wasteful) during concatenation would be the approach I would take.
I would be surprised if normalizing string offsets during record batch
/ table concatenation showed up as a performance or memory use issue
relative to other kinds of operations -- in theory the
string->large_string promotion should be relatively exceptional (< 5%
of the time). I've found in performance tests that creating many
smaller array chunks is faster anyway due to interplay with the memory
allocator.

Of course I think we should have string kernels for both 32-bit and
64-bit variants. Note that Gandiva already has significant string
kernel support (for 32-bit offsets at the moment) and there is
discussion about pre-compiling the LLVM IR into a shared library to
not introduce an LLVM runtime dependency, so we could maintain a
single code path for string algorithms that can be used both in a
JIT-ed setting as well as pre-compiled / interpreted setting. See
https://issues.apache.org/jira/browse/ARROW-7083

Note that many analytic database engines (notably: Dremio, which is
natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
at all and it does not seem to be an impedance in practical use. We
have the Chunked* builder classes [1] in C++ to facilitate the
creation of chunked binary arrays where there is concern about
overflowing the 2GB limit.

Others may have different opinions so I'll let them comment.

- Wes

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510

On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
<ma...@gmail.com> wrote:
>
> Hi Arrow devs,
>
> Small intro: I'm the main Vaex developer, an out of core dataframe
> library for Python - https://github.com/vaexio/vaex -, and we're
> looking into moving Vaex to use Apache Arrow for the data structure.
> At the beginning of this year, we added string support in Vaex, which
> required 64 bit offsets. Those were not available back then, so we
> added our own data structure for string arrays. Our first step to move
> to Apache Arrow is to see if we can use Arrow for the data structure,
> and later on, move the strings algorithms of Vaex to Arrow.
>
> (originally posted at https://github.com/apache/arrow/issues/5874)
>
> In vaex I can lazily concatenate dataframes without memory copy. If I
> want to implement this using a pa.ChunkedArray, users cannot
> concatenate dataframes that have a string column with pa.string type
> to a dataframe that has a column with pa.large_string.
>
> In short, there is no arrow data structure to handle this 'mixed
> chunked array', but I was wondering if this could change. The only way
> out seems to cast them manually to a common type (although blocked by
> https://issues.apache.org/jira/browse/ARROW-6071).
> Internally I could solve this in vaex, but feedback from building a
> DataFrame library with arrow might be useful. Also, it means I cannot
> expose the concatenated DataFrame as an arrow table.
>
> Because of this, I am wondering if having two types (large_string and
> string) is a good idea in the end since it makes type checking
> cumbersome (having to check two types each time).  Could an option be
> that there is only 1 string and list type, and that the width of the
> indices/offsets can be obtained at runtime? That would also make it
> easy to support 16 and 8-bit offsets. That would make Arrow more
> flexible and efficient, and I guess it would play better with
> pa.ChunkedArray.
>
> Regards,
>
> Maarten Breddels