You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Will Jones <wi...@gmail.com> on 2022/04/06 18:17:23 UTC

Re: C++ Helpers for Row and Arrow conversions

Hello,

I've fleshed out the ideas in the doc in this draft PR:
https://github.com/apache/arrow/pull/12775

Feedback on the API design is still welcome.

Best,

Will Jones

On Thu, Mar 24, 2022 at 10:25 AM Will Jones <wi...@gmail.com> wrote:

> Antoine,
>
> That's a good question. I think there's a critical part that I haven't
> articulated well in the doc yet.
>
> When converting from Arrow's columnar format to Rows, you have three
> options:
>
> (1) Go through the record batch row-by-row
> (2) Iterate through each column of record batch, add column value to each
> row
> (3) Iterate through smaller sub-batches of the record batch, and do (2) on
> each sub batch
>
> The converter would do (3). In cases I've heard of seems to be the most
> performant, though I would welcome others' opinions on that. I imagine
> there are some "memory locality" benefits, though I am no expert on that.
>
> This is most apparent when you look at the following two methods:
>
> template<T>
> class ToRowConverter<T> {
>     // This is implemented by subclass
>     virtual arrow::Result<std::vector<T>>
> Convert(std::shared_ptr<arrow::RecordBatch> batch);
>    /// This derived
>     arrow::Result<std::vector<T>>
> RecordBatchToRows(std::shared_ptr<arrow::RecordBatch> batch, size_t
> batch_size);
> }
>
> The idea here is that RecordBatchToRows() will convert in smaller slices
> dictated by batch_size. A Record Batch with 2 million rows might be
> converted 10,000 rows at a time.
>
> I'm going to update the doc to make that clearer, but does what I
> described above seem sensible?
>
> Best,
> Will Jones
>
>
>
> On Thu, Mar 24, 2022 at 9:47 AM Antoine Pitrou <an...@python.org> wrote:
>
>>
>> Hello Will,
>>
>> So the added value would simply be the automatic definition of
>> iterator-returning methods? Or am I missing something?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 23/03/2022 à 19:36, Will Jones a écrit :
>> > Hello Arrow devs,
>> >
>> > I recently created ARROW-16006 [1] ("Helpers for converting between rows
>> > and Arrow objects"), and would appreciate feedback. It's meant for
>> > conversion from arbitrary schemas, whereas the existing C++ examples
>> > demonstrate fixed schemas (that is, known at compile-time).
>> >
>> > If you have implemented conversion between Arrow and a row-based data
>> > structures in C++ (or tried to): Would these helpers work for your use
>> > case? There is an associated draft design doc linked in the issue [2],
>> > which is open to comments.
>> >
>> > Thanks,
>> >
>> > Will Jones
>> >
>> > [1] https://issues.apache.org/jira/browse/ARROW-16006
>> > [2]
>> >
>> https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing
>> >
>>
>

Re: C++ Helpers for Row and Arrow conversions

Posted by Will Jones <wi...@gmail.com>.

For those interested, the PR for this new API is ready for review here:
https://github.com/apache/arrow/pull/12775

On Wed, Apr 6, 2022 at 11:17 AM Will Jones <wi...@gmail.com> wrote:

> Hello,
>
> I've fleshed out the ideas in the doc in this draft PR:
> https://github.com/apache/arrow/pull/12775
>
> Feedback on the API design is still welcome.
>
> Best,
>
> Will Jones
>
> On Thu, Mar 24, 2022 at 10:25 AM Will Jones <wi...@gmail.com>
> wrote:
>
>> Antoine,
>>
>> That's a good question. I think there's a critical part that I haven't
>> articulated well in the doc yet.
>>
>> When converting from Arrow's columnar format to Rows, you have three
>> options:
>>
>> (1) Go through the record batch row-by-row
>> (2) Iterate through each column of record batch, add column value to each
>> row
>> (3) Iterate through smaller sub-batches of the record batch, and do (2)
>> on each sub batch
>>
>> The converter would do (3). In cases I've heard of seems to be the most
>> performant, though I would welcome others' opinions on that. I imagine
>> there are some "memory locality" benefits, though I am no expert on that.
>>
>> This is most apparent when you look at the following two methods:
>>
>> template<T>
>> class ToRowConverter<T> {
>>     // This is implemented by subclass
>>     virtual arrow::Result<std::vector<T>>
>> Convert(std::shared_ptr<arrow::RecordBatch> batch);
>>    /// This derived
>>     arrow::Result<std::vector<T>>
>> RecordBatchToRows(std::shared_ptr<arrow::RecordBatch> batch, size_t
>> batch_size);
>> }
>>
>> The idea here is that RecordBatchToRows() will convert in smaller slices
>> dictated by batch_size. A Record Batch with 2 million rows might be
>> converted 10,000 rows at a time.
>>
>> I'm going to update the doc to make that clearer, but does what I
>> described above seem sensible?
>>
>> Best,
>> Will Jones
>>
>>
>>
>> On Thu, Mar 24, 2022 at 9:47 AM Antoine Pitrou <an...@python.org>
>> wrote:
>>
>>>
>>> Hello Will,
>>>
>>> So the added value would simply be the automatic definition of
>>> iterator-returning methods? Or am I missing something?
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 23/03/2022 à 19:36, Will Jones a écrit :
>>> > Hello Arrow devs,
>>> >
>>> > I recently created ARROW-16006 [1] ("Helpers for converting between
>>> rows
>>> > and Arrow objects"), and would appreciate feedback. It's meant for
>>> > conversion from arbitrary schemas, whereas the existing C++ examples
>>> > demonstrate fixed schemas (that is, known at compile-time).
>>> >
>>> > If you have implemented conversion between Arrow and a row-based data
>>> > structures in C++ (or tried to): Would these helpers work for your use
>>> > case? There is an associated draft design doc linked in the issue [2],
>>> > which is open to comments.
>>> >
>>> > Thanks,
>>> >
>>> > Will Jones
>>> >
>>> > [1] https://issues.apache.org/jira/browse/ARROW-16006
>>> > [2]
>>> >
>>> https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing
>>> >
>>>
>>