You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2022/01/08 13:29:34 UTC

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Fair enough (wrt to deprecation). Think that the sequence view is a
replacement for our existing (that allows O(N) selections), but I agree
with the sentiment that preserving compatibility is more important than a
single way of doing it. Thanks for that angle!

Imo the Arrow format is already composed of 3 specifications:

* C data interface (intra-process communication)
* IPC format (inter-process communication)
* Flight (RPC protocol)

E.g.
* IPC requires a `dict_id` in the fields declaration, but the C data
interface has no such requirement (because, pointers)
* IPC accepts endian and compression, the C data interface does not
* DataFusion does not support IPC (yet ^_^), but its Python bindings
leverage the C data interface to pass data to pyarrow

This to say that imo as long as we document the different specifications
that compose Arrow and their intended purposes, it is ok. Because the c
data interface is the one with the highest constraints (zero-copy, higher
chance of out of bound reads, etc.), it makes sense for proposals (and
implementations) first be written against it.


I agree with Neal's point wrt to the IPC. For extra context, many `async`
implementations use cooperative scheduling, which are vulnerable to DOS if
they need to perform heavy CPU-bound tasks (as the p-thread is blocked and
can't switch). QP Hou and I have summarized a broader version of this
statement here [1].

In async contexts, If deserializing from IPC requires a significant amount
of compute, that task should (to avoid blocking) be sent to a separate
thread pool to avoid blocking the p-threads assigned to the runtime's
thread pool. If the format is O(1) in CPU-bounded work, its execution can
be done in an async context without a separate thread pool. Arrow's IPC
format is quite unique there in that it requires almost always O(1) CPU
work to be loaded to memory (at the expense of more disk usage).

I believe that atm we have two O(N) blocking tasks in reading IPC format
(decompression and byte swapping (big <-> little endian)), and three O(N)
blocking tasks in writing (compression, de-offset bitmaps, byte swapping).
The more prevalent O(N) CPU-bound tasks are at the IPC interface, the less
compelling it becomes vs e.g. parquet (file) or avro (stream), which have
an expectation of CPU-bound work. In this context, keeping the IPC format
compatible with the ABI spec is imo an important characteristic of Apache
Arrow that we should strive to preserve. Alternatively, we could also just
abandon this idea and say that the format expects CPU-bound tasks to
deserialize (even if considerably smaller than avro or parquet), so that
people can design the APIs accordingly.

Best,
Jorge

[1]
https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c


On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <an...@python.org> wrote:

>
>
> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> >> I think in this particular case, we should consider the C ABI /
> >> in-memory representation and IPC format as separate beasts. If an
> >> implementation of Arrow does not want to use this string-view array
> >> type at all (for example, if it created memory safety issues in Rust),
> >> then it can choose to convert to the existing string array
> >> representation when receiving a C ABI payload. Whether or not there is
> >> an alternate IPC format for this data type seems like a separate
> >> question -- my preference actually would be to support this for
> >> in-memory / C ABI use but not to alter the IPC format.
> >>
> >
> > I think this idea deserves some clarification or at least more
> exposition.
> > On first reading, it was not clear to me that we might add things to the
> > in-memory Arrow format but not IPC, that that was even an option. I'm
> > guessing I'm not the only one who missed that.
> >
> > If these new types are only part of the Arrow in-memory format, then it's
> > not the case that reading/writing IPC files involves no serialization
> > overhead. I recognize that that's technically already the case since IPC
> > supports compression now, but it's not generally how we talk about the
> > relationship between the IPC and in-memory formats (see our own FAQ [1],
> > for example). If we go forward with these changes, it would be a good
> > opportunity for us to clarify in our docs/website that the "Arrow format"
> > is not a single thing.
>
> I'm worried that making the "Arrow format" polysemic/context-dependent
> would spread a lot of confusion among potential users of Arrow.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
I have prototyped the sequence views in Rust [1], and it seems a pretty
straightforward addition with a trivial representation in both IPC and FFI.

I did observe a performance difference between using signed (int64) and
unsigned (uint64) offsets/lengths:

take/sequence/20            time:   [20.491 ms 20.800 ms 21.125 ms]

take/sequence_signed/20     time:   [22.719 ms 23.142 ms 23.593 ms]

take/array/20               time:   [44.454 ms 45.056 ms 45.712 ms]

where 20 means 2^20 entries,
* array is our current array
* sequence is a sequence view of utf8 with uint64 indices, and
* sequence_signed is the same sequence view layout but with int64 indices

I.e. I observe a ~10% loss to support signed offsets/lengths. Details in
[2].

Best,
Jorge

[1] https://github.com/jorgecarleitao/arrow2/pull/784
[2] https://github.com/DataEngineeringLabs/arrow-string-view

On Wed, Jan 12, 2022 at 2:34 PM Andrew Lamb <al...@influxdata.com> wrote:

> I also agree that splitting the StringView proposal into its own thing
> would be beneficial for discussion clarity
>
> On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> > > hi all,
> > >
> > > Thank you for all the comments on this mailing list thread and in the
> > > Google document. There is definitely a lot of work to take some next
> > > steps from here, so I think it would make sense to fork off each of
> > > the proposed additions into dedicated discussions. The most
> > > contentious issue, it seems, is whether to maintain a 1-to-1
> > > relationship between the IPC format and the C ABI, which would make it
> > > rather difficult to implement the "string view" data type in a way
> > > that is flexible and useful to applications (for example, giving them
> > > control over their own memory management as opposed to forcing data to
> > > be "pre-serialized" into buffers that are referenced by offsets).
> > >
> > > I tend to be of the "practicality beats purity" mindset, where
> > > sufficiently beneficial changes to the in-memory format (and C ABI)
> > > may be worth breaking the implicit contract where the IPC format and
> > > the in-memory data structures have a strict 1-to-1 relationship. I
> > > suggest to help reach some consensus around this that I will create a
> > > new document focused only on the "string/binary view" type and the
> > > different implementation considerations (like what happens when you
> > > write it to the IPC format), as well as the different variants of the
> > > data structure itself that have been discussed with the associated
> > > trade-offs. Does this sound like a good approach?
> >
> > Indeed, this sounds like it will help making a decision.
> >
> > Personally, I am still very concerned by the idea of adding pointers to
> > the in-memory representation. Besides the loss of equivalence with the
> > IPC format, a representation using embedded pointers cannot be fully
> > validated for safety or correctness (how do you decide whether a pointer
> > is correct and doesn't reveal unrelated data?).
> >
> > I think we should discuss this with the DuckDB folks (and possibly the
> > Velox folks, but I guess that it might socio-politically more difficult)
> > so as to measure how much of an inconvenience it would be for them to
> > switch to a purely offsets-based approach.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > >
> > > Thanks,
> > > Wes
> > >
> > >
> > > On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
> > > <jo...@gmail.com> wrote:
> > >>
> > >> Fair enough (wrt to deprecation). Think that the sequence view is a
> > >> replacement for our existing (that allows O(N) selections), but I
> agree
> > >> with the sentiment that preserving compatibility is more important
> than
> > a
> > >> single way of doing it. Thanks for that angle!
> > >>
> > >> Imo the Arrow format is already composed of 3 specifications:
> > >>
> > >> * C data interface (intra-process communication)
> > >> * IPC format (inter-process communication)
> > >> * Flight (RPC protocol)
> > >>
> > >> E.g.
> > >> * IPC requires a `dict_id` in the fields declaration, but the C data
> > >> interface has no such requirement (because, pointers)
> > >> * IPC accepts endian and compression, the C data interface does not
> > >> * DataFusion does not support IPC (yet ^_^), but its Python bindings
> > >> leverage the C data interface to pass data to pyarrow
> > >>
> > >> This to say that imo as long as we document the different
> specifications
> > >> that compose Arrow and their intended purposes, it is ok. Because the
> c
> > >> data interface is the one with the highest constraints (zero-copy,
> > higher
> > >> chance of out of bound reads, etc.), it makes sense for proposals (and
> > >> implementations) first be written against it.
> > >>
> > >>
> > >> I agree with Neal's point wrt to the IPC. For extra context, many
> > `async`
> > >> implementations use cooperative scheduling, which are vulnerable to
> DOS
> > if
> > >> they need to perform heavy CPU-bound tasks (as the p-thread is blocked
> > and
> > >> can't switch). QP Hou and I have summarized a broader version of this
> > >> statement here [1].
> > >>
> > >> In async contexts, If deserializing from IPC requires a significant
> > amount
> > >> of compute, that task should (to avoid blocking) be sent to a separate
> > >> thread pool to avoid blocking the p-threads assigned to the runtime's
> > >> thread pool. If the format is O(1) in CPU-bounded work, its execution
> > can
> > >> be done in an async context without a separate thread pool. Arrow's
> IPC
> > >> format is quite unique there in that it requires almost always O(1)
> CPU
> > >> work to be loaded to memory (at the expense of more disk usage).
> > >>
> > >> I believe that atm we have two O(N) blocking tasks in reading IPC
> format
> > >> (decompression and byte swapping (big <-> little endian)), and three
> > O(N)
> > >> blocking tasks in writing (compression, de-offset bitmaps, byte
> > swapping).
> > >> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the
> > less
> > >> compelling it becomes vs e.g. parquet (file) or avro (stream), which
> > have
> > >> an expectation of CPU-bound work. In this context, keeping the IPC
> > format
> > >> compatible with the ABI spec is imo an important characteristic of
> > Apache
> > >> Arrow that we should strive to preserve. Alternatively, we could also
> > just
> > >> abandon this idea and say that the format expects CPU-bound tasks to
> > >> deserialize (even if considerably smaller than avro or parquet), so
> that
> > >> people can design the APIs accordingly.
> > >>
> > >> Best,
> > >> Jorge
> > >>
> > >> [1]
> > >>
> >
> https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c
> > >>
> > >>
> > >> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <an...@python.org>
> > wrote:
> > >>
> > >>>
> > >>>
> > >>> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> > >>>>> I think in this particular case, we should consider the C ABI /
> > >>>>> in-memory representation and IPC format as separate beasts. If an
> > >>>>> implementation of Arrow does not want to use this string-view array
> > >>>>> type at all (for example, if it created memory safety issues in
> > Rust),
> > >>>>> then it can choose to convert to the existing string array
> > >>>>> representation when receiving a C ABI payload. Whether or not there
> > is
> > >>>>> an alternate IPC format for this data type seems like a separate
> > >>>>> question -- my preference actually would be to support this for
> > >>>>> in-memory / C ABI use but not to alter the IPC format.
> > >>>>>
> > >>>>
> > >>>> I think this idea deserves some clarification or at least more
> > >>> exposition.
> > >>>> On first reading, it was not clear to me that we might add things to
> > the
> > >>>> in-memory Arrow format but not IPC, that that was even an option.
> I'm
> > >>>> guessing I'm not the only one who missed that.
> > >>>>
> > >>>> If these new types are only part of the Arrow in-memory format, then
> > it's
> > >>>> not the case that reading/writing IPC files involves no
> serialization
> > >>>> overhead. I recognize that that's technically already the case since
> > IPC
> > >>>> supports compression now, but it's not generally how we talk about
> the
> > >>>> relationship between the IPC and in-memory formats (see our own FAQ
> > [1],
> > >>>> for example). If we go forward with these changes, it would be a
> good
> > >>>> opportunity for us to clarify in our docs/website that the "Arrow
> > format"
> > >>>> is not a single thing.
> > >>>
> > >>> I'm worried that making the "Arrow format"
> polysemic/context-dependent
> > >>> would spread a lot of confusion among potential users of Arrow.
> > >>>
> > >>> Regards
> > >>>
> > >>> Antoine.
> > >>>
> >
>

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Posted by Andrew Lamb <al...@influxdata.com>.
I also agree that splitting the StringView proposal into its own thing
would be beneficial for discussion clarity

On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> > hi all,
> >
> > Thank you for all the comments on this mailing list thread and in the
> > Google document. There is definitely a lot of work to take some next
> > steps from here, so I think it would make sense to fork off each of
> > the proposed additions into dedicated discussions. The most
> > contentious issue, it seems, is whether to maintain a 1-to-1
> > relationship between the IPC format and the C ABI, which would make it
> > rather difficult to implement the "string view" data type in a way
> > that is flexible and useful to applications (for example, giving them
> > control over their own memory management as opposed to forcing data to
> > be "pre-serialized" into buffers that are referenced by offsets).
> >
> > I tend to be of the "practicality beats purity" mindset, where
> > sufficiently beneficial changes to the in-memory format (and C ABI)
> > may be worth breaking the implicit contract where the IPC format and
> > the in-memory data structures have a strict 1-to-1 relationship. I
> > suggest to help reach some consensus around this that I will create a
> > new document focused only on the "string/binary view" type and the
> > different implementation considerations (like what happens when you
> > write it to the IPC format), as well as the different variants of the
> > data structure itself that have been discussed with the associated
> > trade-offs. Does this sound like a good approach?
>
> Indeed, this sounds like it will help making a decision.
>
> Personally, I am still very concerned by the idea of adding pointers to
> the in-memory representation. Besides the loss of equivalence with the
> IPC format, a representation using embedded pointers cannot be fully
> validated for safety or correctness (how do you decide whether a pointer
> is correct and doesn't reveal unrelated data?).
>
> I think we should discuss this with the DuckDB folks (and possibly the
> Velox folks, but I guess that it might socio-politically more difficult)
> so as to measure how much of an inconvenience it would be for them to
> switch to a purely offsets-based approach.
>
> Regards
>
> Antoine.
>
>
>
> >
> > Thanks,
> > Wes
> >
> >
> > On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
> > <jo...@gmail.com> wrote:
> >>
> >> Fair enough (wrt to deprecation). Think that the sequence view is a
> >> replacement for our existing (that allows O(N) selections), but I agree
> >> with the sentiment that preserving compatibility is more important than
> a
> >> single way of doing it. Thanks for that angle!
> >>
> >> Imo the Arrow format is already composed of 3 specifications:
> >>
> >> * C data interface (intra-process communication)
> >> * IPC format (inter-process communication)
> >> * Flight (RPC protocol)
> >>
> >> E.g.
> >> * IPC requires a `dict_id` in the fields declaration, but the C data
> >> interface has no such requirement (because, pointers)
> >> * IPC accepts endian and compression, the C data interface does not
> >> * DataFusion does not support IPC (yet ^_^), but its Python bindings
> >> leverage the C data interface to pass data to pyarrow
> >>
> >> This to say that imo as long as we document the different specifications
> >> that compose Arrow and their intended purposes, it is ok. Because the c
> >> data interface is the one with the highest constraints (zero-copy,
> higher
> >> chance of out of bound reads, etc.), it makes sense for proposals (and
> >> implementations) first be written against it.
> >>
> >>
> >> I agree with Neal's point wrt to the IPC. For extra context, many
> `async`
> >> implementations use cooperative scheduling, which are vulnerable to DOS
> if
> >> they need to perform heavy CPU-bound tasks (as the p-thread is blocked
> and
> >> can't switch). QP Hou and I have summarized a broader version of this
> >> statement here [1].
> >>
> >> In async contexts, If deserializing from IPC requires a significant
> amount
> >> of compute, that task should (to avoid blocking) be sent to a separate
> >> thread pool to avoid blocking the p-threads assigned to the runtime's
> >> thread pool. If the format is O(1) in CPU-bounded work, its execution
> can
> >> be done in an async context without a separate thread pool. Arrow's IPC
> >> format is quite unique there in that it requires almost always O(1) CPU
> >> work to be loaded to memory (at the expense of more disk usage).
> >>
> >> I believe that atm we have two O(N) blocking tasks in reading IPC format
> >> (decompression and byte swapping (big <-> little endian)), and three
> O(N)
> >> blocking tasks in writing (compression, de-offset bitmaps, byte
> swapping).
> >> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the
> less
> >> compelling it becomes vs e.g. parquet (file) or avro (stream), which
> have
> >> an expectation of CPU-bound work. In this context, keeping the IPC
> format
> >> compatible with the ABI spec is imo an important characteristic of
> Apache
> >> Arrow that we should strive to preserve. Alternatively, we could also
> just
> >> abandon this idea and say that the format expects CPU-bound tasks to
> >> deserialize (even if considerably smaller than avro or parquet), so that
> >> people can design the APIs accordingly.
> >>
> >> Best,
> >> Jorge
> >>
> >> [1]
> >>
> https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c
> >>
> >>
> >> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <an...@python.org>
> wrote:
> >>
> >>>
> >>>
> >>> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> >>>>> I think in this particular case, we should consider the C ABI /
> >>>>> in-memory representation and IPC format as separate beasts. If an
> >>>>> implementation of Arrow does not want to use this string-view array
> >>>>> type at all (for example, if it created memory safety issues in
> Rust),
> >>>>> then it can choose to convert to the existing string array
> >>>>> representation when receiving a C ABI payload. Whether or not there
> is
> >>>>> an alternate IPC format for this data type seems like a separate
> >>>>> question -- my preference actually would be to support this for
> >>>>> in-memory / C ABI use but not to alter the IPC format.
> >>>>>
> >>>>
> >>>> I think this idea deserves some clarification or at least more
> >>> exposition.
> >>>> On first reading, it was not clear to me that we might add things to
> the
> >>>> in-memory Arrow format but not IPC, that that was even an option. I'm
> >>>> guessing I'm not the only one who missed that.
> >>>>
> >>>> If these new types are only part of the Arrow in-memory format, then
> it's
> >>>> not the case that reading/writing IPC files involves no serialization
> >>>> overhead. I recognize that that's technically already the case since
> IPC
> >>>> supports compression now, but it's not generally how we talk about the
> >>>> relationship between the IPC and in-memory formats (see our own FAQ
> [1],
> >>>> for example). If we go forward with these changes, it would be a good
> >>>> opportunity for us to clarify in our docs/website that the "Arrow
> format"
> >>>> is not a single thing.
> >>>
> >>> I'm worried that making the "Arrow format" polysemic/context-dependent
> >>> would spread a lot of confusion among potential users of Arrow.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
>

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Posted by Antoine Pitrou <an...@python.org>.
Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> hi all,
> 
> Thank you for all the comments on this mailing list thread and in the
> Google document. There is definitely a lot of work to take some next
> steps from here, so I think it would make sense to fork off each of
> the proposed additions into dedicated discussions. The most
> contentious issue, it seems, is whether to maintain a 1-to-1
> relationship between the IPC format and the C ABI, which would make it
> rather difficult to implement the "string view" data type in a way
> that is flexible and useful to applications (for example, giving them
> control over their own memory management as opposed to forcing data to
> be "pre-serialized" into buffers that are referenced by offsets).
> 
> I tend to be of the "practicality beats purity" mindset, where
> sufficiently beneficial changes to the in-memory format (and C ABI)
> may be worth breaking the implicit contract where the IPC format and
> the in-memory data structures have a strict 1-to-1 relationship. I
> suggest to help reach some consensus around this that I will create a
> new document focused only on the "string/binary view" type and the
> different implementation considerations (like what happens when you
> write it to the IPC format), as well as the different variants of the
> data structure itself that have been discussed with the associated
> trade-offs. Does this sound like a good approach?

Indeed, this sounds like it will help making a decision.

Personally, I am still very concerned by the idea of adding pointers to 
the in-memory representation. Besides the loss of equivalence with the 
IPC format, a representation using embedded pointers cannot be fully 
validated for safety or correctness (how do you decide whether a pointer 
is correct and doesn't reveal unrelated data?).

I think we should discuss this with the DuckDB folks (and possibly the 
Velox folks, but I guess that it might socio-politically more difficult) 
so as to measure how much of an inconvenience it would be for them to 
switch to a purely offsets-based approach.

Regards

Antoine.



> 
> Thanks,
> Wes
> 
> 
> On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
> <jo...@gmail.com> wrote:
>>
>> Fair enough (wrt to deprecation). Think that the sequence view is a
>> replacement for our existing (that allows O(N) selections), but I agree
>> with the sentiment that preserving compatibility is more important than a
>> single way of doing it. Thanks for that angle!
>>
>> Imo the Arrow format is already composed of 3 specifications:
>>
>> * C data interface (intra-process communication)
>> * IPC format (inter-process communication)
>> * Flight (RPC protocol)
>>
>> E.g.
>> * IPC requires a `dict_id` in the fields declaration, but the C data
>> interface has no such requirement (because, pointers)
>> * IPC accepts endian and compression, the C data interface does not
>> * DataFusion does not support IPC (yet ^_^), but its Python bindings
>> leverage the C data interface to pass data to pyarrow
>>
>> This to say that imo as long as we document the different specifications
>> that compose Arrow and their intended purposes, it is ok. Because the c
>> data interface is the one with the highest constraints (zero-copy, higher
>> chance of out of bound reads, etc.), it makes sense for proposals (and
>> implementations) first be written against it.
>>
>>
>> I agree with Neal's point wrt to the IPC. For extra context, many `async`
>> implementations use cooperative scheduling, which are vulnerable to DOS if
>> they need to perform heavy CPU-bound tasks (as the p-thread is blocked and
>> can't switch). QP Hou and I have summarized a broader version of this
>> statement here [1].
>>
>> In async contexts, If deserializing from IPC requires a significant amount
>> of compute, that task should (to avoid blocking) be sent to a separate
>> thread pool to avoid blocking the p-threads assigned to the runtime's
>> thread pool. If the format is O(1) in CPU-bounded work, its execution can
>> be done in an async context without a separate thread pool. Arrow's IPC
>> format is quite unique there in that it requires almost always O(1) CPU
>> work to be loaded to memory (at the expense of more disk usage).
>>
>> I believe that atm we have two O(N) blocking tasks in reading IPC format
>> (decompression and byte swapping (big <-> little endian)), and three O(N)
>> blocking tasks in writing (compression, de-offset bitmaps, byte swapping).
>> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the less
>> compelling it becomes vs e.g. parquet (file) or avro (stream), which have
>> an expectation of CPU-bound work. In this context, keeping the IPC format
>> compatible with the ABI spec is imo an important characteristic of Apache
>> Arrow that we should strive to preserve. Alternatively, we could also just
>> abandon this idea and say that the format expects CPU-bound tasks to
>> deserialize (even if considerably smaller than avro or parquet), so that
>> people can design the APIs accordingly.
>>
>> Best,
>> Jorge
>>
>> [1]
>> https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c
>>
>>
>> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <an...@python.org> wrote:
>>
>>>
>>>
>>> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
>>>>> I think in this particular case, we should consider the C ABI /
>>>>> in-memory representation and IPC format as separate beasts. If an
>>>>> implementation of Arrow does not want to use this string-view array
>>>>> type at all (for example, if it created memory safety issues in Rust),
>>>>> then it can choose to convert to the existing string array
>>>>> representation when receiving a C ABI payload. Whether or not there is
>>>>> an alternate IPC format for this data type seems like a separate
>>>>> question -- my preference actually would be to support this for
>>>>> in-memory / C ABI use but not to alter the IPC format.
>>>>>
>>>>
>>>> I think this idea deserves some clarification or at least more
>>> exposition.
>>>> On first reading, it was not clear to me that we might add things to the
>>>> in-memory Arrow format but not IPC, that that was even an option. I'm
>>>> guessing I'm not the only one who missed that.
>>>>
>>>> If these new types are only part of the Arrow in-memory format, then it's
>>>> not the case that reading/writing IPC files involves no serialization
>>>> overhead. I recognize that that's technically already the case since IPC
>>>> supports compression now, but it's not generally how we talk about the
>>>> relationship between the IPC and in-memory formats (see our own FAQ [1],
>>>> for example). If we go forward with these changes, it would be a good
>>>> opportunity for us to clarify in our docs/website that the "Arrow format"
>>>> is not a single thing.
>>>
>>> I'm worried that making the "Arrow format" polysemic/context-dependent
>>> would spread a lot of confusion among potential users of Arrow.
>>>
>>> Regards
>>>
>>> Antoine.
>>>

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Posted by Wes McKinney <we...@gmail.com>.
hi all,

Thank you for all the comments on this mailing list thread and in the
Google document. There is definitely a lot of work to take some next
steps from here, so I think it would make sense to fork off each of
the proposed additions into dedicated discussions. The most
contentious issue, it seems, is whether to maintain a 1-to-1
relationship between the IPC format and the C ABI, which would make it
rather difficult to implement the "string view" data type in a way
that is flexible and useful to applications (for example, giving them
control over their own memory management as opposed to forcing data to
be "pre-serialized" into buffers that are referenced by offsets).

I tend to be of the "practicality beats purity" mindset, where
sufficiently beneficial changes to the in-memory format (and C ABI)
may be worth breaking the implicit contract where the IPC format and
the in-memory data structures have a strict 1-to-1 relationship. I
suggest to help reach some consensus around this that I will create a
new document focused only on the "string/binary view" type and the
different implementation considerations (like what happens when you
write it to the IPC format), as well as the different variants of the
data structure itself that have been discussed with the associated
trade-offs. Does this sound like a good approach?

Thanks,
Wes


On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> Fair enough (wrt to deprecation). Think that the sequence view is a
> replacement for our existing (that allows O(N) selections), but I agree
> with the sentiment that preserving compatibility is more important than a
> single way of doing it. Thanks for that angle!
>
> Imo the Arrow format is already composed of 3 specifications:
>
> * C data interface (intra-process communication)
> * IPC format (inter-process communication)
> * Flight (RPC protocol)
>
> E.g.
> * IPC requires a `dict_id` in the fields declaration, but the C data
> interface has no such requirement (because, pointers)
> * IPC accepts endian and compression, the C data interface does not
> * DataFusion does not support IPC (yet ^_^), but its Python bindings
> leverage the C data interface to pass data to pyarrow
>
> This to say that imo as long as we document the different specifications
> that compose Arrow and their intended purposes, it is ok. Because the c
> data interface is the one with the highest constraints (zero-copy, higher
> chance of out of bound reads, etc.), it makes sense for proposals (and
> implementations) first be written against it.
>
>
> I agree with Neal's point wrt to the IPC. For extra context, many `async`
> implementations use cooperative scheduling, which are vulnerable to DOS if
> they need to perform heavy CPU-bound tasks (as the p-thread is blocked and
> can't switch). QP Hou and I have summarized a broader version of this
> statement here [1].
>
> In async contexts, If deserializing from IPC requires a significant amount
> of compute, that task should (to avoid blocking) be sent to a separate
> thread pool to avoid blocking the p-threads assigned to the runtime's
> thread pool. If the format is O(1) in CPU-bounded work, its execution can
> be done in an async context without a separate thread pool. Arrow's IPC
> format is quite unique there in that it requires almost always O(1) CPU
> work to be loaded to memory (at the expense of more disk usage).
>
> I believe that atm we have two O(N) blocking tasks in reading IPC format
> (decompression and byte swapping (big <-> little endian)), and three O(N)
> blocking tasks in writing (compression, de-offset bitmaps, byte swapping).
> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the less
> compelling it becomes vs e.g. parquet (file) or avro (stream), which have
> an expectation of CPU-bound work. In this context, keeping the IPC format
> compatible with the ABI spec is imo an important characteristic of Apache
> Arrow that we should strive to preserve. Alternatively, we could also just
> abandon this idea and say that the format expects CPU-bound tasks to
> deserialize (even if considerably smaller than avro or parquet), so that
> people can design the APIs accordingly.
>
> Best,
> Jorge
>
> [1]
> https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c
>
>
> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> >
> > Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> > >> I think in this particular case, we should consider the C ABI /
> > >> in-memory representation and IPC format as separate beasts. If an
> > >> implementation of Arrow does not want to use this string-view array
> > >> type at all (for example, if it created memory safety issues in Rust),
> > >> then it can choose to convert to the existing string array
> > >> representation when receiving a C ABI payload. Whether or not there is
> > >> an alternate IPC format for this data type seems like a separate
> > >> question -- my preference actually would be to support this for
> > >> in-memory / C ABI use but not to alter the IPC format.
> > >>
> > >
> > > I think this idea deserves some clarification or at least more
> > exposition.
> > > On first reading, it was not clear to me that we might add things to the
> > > in-memory Arrow format but not IPC, that that was even an option. I'm
> > > guessing I'm not the only one who missed that.
> > >
> > > If these new types are only part of the Arrow in-memory format, then it's
> > > not the case that reading/writing IPC files involves no serialization
> > > overhead. I recognize that that's technically already the case since IPC
> > > supports compression now, but it's not generally how we talk about the
> > > relationship between the IPC and in-memory formats (see our own FAQ [1],
> > > for example). If we go forward with these changes, it would be a good
> > > opportunity for us to clarify in our docs/website that the "Arrow format"
> > > is not a single thing.
> >
> > I'm worried that making the "Arrow format" polysemic/context-dependent
> > would spread a lot of confusion among potential users of Arrow.
> >
> > Regards
> >
> > Antoine.
> >