You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Will Jones <wi...@gmail.com> on 2022/05/31 14:12:41 UTC

Re: C++ Helpers for Row and Arrow conversions

For those interested, the PR for this new API is ready for review here:
https://github.com/apache/arrow/pull/12775

On Wed, Apr 6, 2022 at 11:17 AM Will Jones <wi...@gmail.com> wrote:

> Hello,
>
> I've fleshed out the ideas in the doc in this draft PR:
> https://github.com/apache/arrow/pull/12775
>
> Feedback on the API design is still welcome.
>
> Best,
>
> Will Jones
>
> On Thu, Mar 24, 2022 at 10:25 AM Will Jones <wi...@gmail.com>
> wrote:
>
>> Antoine,
>>
>> That's a good question. I think there's a critical part that I haven't
>> articulated well in the doc yet.
>>
>> When converting from Arrow's columnar format to Rows, you have three
>> options:
>>
>> (1) Go through the record batch row-by-row
>> (2) Iterate through each column of record batch, add column value to each
>> row
>> (3) Iterate through smaller sub-batches of the record batch, and do (2)
>> on each sub batch
>>
>> The converter would do (3). In cases I've heard of seems to be the most
>> performant, though I would welcome others' opinions on that. I imagine
>> there are some "memory locality" benefits, though I am no expert on that.
>>
>> This is most apparent when you look at the following two methods:
>>
>> template<T>
>> class ToRowConverter<T> {
>>     // This is implemented by subclass
>>     virtual arrow::Result<std::vector<T>>
>> Convert(std::shared_ptr<arrow::RecordBatch> batch);
>>    /// This derived
>>     arrow::Result<std::vector<T>>
>> RecordBatchToRows(std::shared_ptr<arrow::RecordBatch> batch, size_t
>> batch_size);
>> }
>>
>> The idea here is that RecordBatchToRows() will convert in smaller slices
>> dictated by batch_size. A Record Batch with 2 million rows might be
>> converted 10,000 rows at a time.
>>
>> I'm going to update the doc to make that clearer, but does what I
>> described above seem sensible?
>>
>> Best,
>> Will Jones
>>
>>
>>
>> On Thu, Mar 24, 2022 at 9:47 AM Antoine Pitrou <an...@python.org>
>> wrote:
>>
>>>
>>> Hello Will,
>>>
>>> So the added value would simply be the automatic definition of
>>> iterator-returning methods? Or am I missing something?
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 23/03/2022 à 19:36, Will Jones a écrit :
>>> > Hello Arrow devs,
>>> >
>>> > I recently created ARROW-16006 [1] ("Helpers for converting between
>>> rows
>>> > and Arrow objects"), and would appreciate feedback. It's meant for
>>> > conversion from arbitrary schemas, whereas the existing C++ examples
>>> > demonstrate fixed schemas (that is, known at compile-time).
>>> >
>>> > If you have implemented conversion between Arrow and a row-based data
>>> > structures in C++ (or tried to): Would these helpers work for your use
>>> > case? There is an associated draft design doc linked in the issue [2],
>>> > which is open to comments.
>>> >
>>> > Thanks,
>>> >
>>> > Will Jones
>>> >
>>> > [1] https://issues.apache.org/jira/browse/ARROW-16006
>>> > [2]
>>> >
>>> https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing
>>> >
>>>
>>