You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Surya Kiran Gullapalli <su...@gmail.com> on 2023/05/04 17:27:43 UTC

[C++] std::vector to Datum

Hello,
I'm trying to use an std::vector (of strings) in CallFunction ('is_in').
The arrow::compute::SetLookupOptions takes in a datum (array of of strings,
in my case to search).

I tried this

std::vector<std::string> vec;
auto buffer = arrow::Buffer::Wrap(vec);
auto arrayData = arrow::ArrayData::Make (arrow::utf8(), vec.size(),
{nullptr, buffer});
auto options = arrow::compute::SetLookupOptions(arrayData);
auto res = arrow::compute::CallFunction ("is_in", {arrowArray}, &options);

This is resulting in a crash.

I tried calling arrow::MakeArray(arrayData), and that is also failing.

But if I convert the std::vector to arrow::Array (using StringBuilder) then
there's no crash and I'm getting expected results.

Am I using the arrow::Buffer/arrow::ArrayData/arrow::Datum correctly, or
I'm missing something ?

Thanks,
Surya

Re: [C++] std::vector to Datum

Posted by Felipe Oliveira Carvalho <fe...@gmail.com>.
If you control the function that produces the vector<string>, you can avoid
all these fragmented allocations by re-using the same std::string in a loop
and reserving buffers upfront in the builder:

string_builder.Reserve(number_of_strings);
strinb_builder.ReserveData(sum_of_lengths_of_all_strings_or_an_estimate_of_that);

std::string s;
for (...) {
  s.clear();  // this doesn't deallocates s's internal buffer
  // ... populate the string s. Avoids new memory allocation if smaller
than biggest string so far.
  RETURN_NOT_OK(string_builder.Append(s));
}

--
Felipe

On Thu, May 4, 2023 at 3:09 PM Felipe Oliveira Carvalho <fe...@gmail.com>
wrote:

> std::vector<std::string>::data() returns a buffer containing pointers to
> the individual string buffers and Arrow needs a buffer with contiguous
> variable-length character data.
>
> And that is buffers[2]. buffers[1] contains the offsets for beginning and
> end of the strings in buffers[2].
>
> So yes, use the StringBuilder.
>
> --
> Felipe
>
> On Thu, May 4, 2023 at 2:28 PM Surya Kiran Gullapalli <
> suryakiran.gullapalli@gmail.com> wrote:
>
>> Hello,
>> I'm trying to use an std::vector (of strings) in CallFunction ('is_in').
>> The arrow::compute::SetLookupOptions takes in a datum (array of of
>> strings, in my case to search).
>>
>> I tried this
>>
>> std::vector<std::string> vec;
>> auto buffer = arrow::Buffer::Wrap(vec);
>> auto arrayData = arrow::ArrayData::Make (arrow::utf8(), vec.size(),
>> {nullptr, buffer});
>> auto options = arrow::compute::SetLookupOptions(arrayData);
>> auto res = arrow::compute::CallFunction ("is_in", {arrowArray}, &options);
>>
>> This is resulting in a crash.
>>
>> I tried calling arrow::MakeArray(arrayData), and that is also failing.
>>
>> But if I convert the std::vector to arrow::Array (using StringBuilder)
>> then there's no crash and I'm getting expected results.
>>
>> Am I using the arrow::Buffer/arrow::ArrayData/arrow::Datum correctly, or
>> I'm missing something ?
>>
>> Thanks,
>> Surya
>>
>

Re: [C++] std::vector to Datum

Posted by Aldrin <oc...@pm.me>.
To give you a bit of overview that you may be missing, in order of abstraction (high to low):

-   Datum is like a wrapper that provides union semantics, in the C sense. For example, it contains an Array or a ChunkedArray or a Table, etc. but one and only one of them.
    

-   Array is like an interface and it stores data in ArrayData
-   ArrayData is like a container that owns data (it is responsible for releasing the data) and provides functions to interact with that data

-   Buffer is how the data is stored, but it is used for the values, for pointers into the values, and for a bitmap which indicates which values are null (I did not describe these in any particular order)
    



I didn't find a good spot in the documentation that mentions this, but [1] shows the types that you can/should put into Datum. So, compute functions typically expect Arrays (or something that can be wrapped in Datum); ArrayData is a lower level of abstraction than they're expecting.


[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/datum.h#L54




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Thursday, May 4th, 2023 at 11:09, Felipe Oliveira Carvalho <fe...@gmail.com> wrote:


> std::vector<std::string>::data() returns a buffer containing pointers to the individual string buffers and Arrow needs a buffer with contiguous variable-length character data.
> And that is buffers[2]. buffers[1] contains the offsets for beginning and end of the strings in buffers[2].
> So yes, use the StringBuilder.
> 

> --
> Felipe
> 

> On Thu, May 4, 2023 at 2:28 PM Surya Kiran Gullapalli <su...@gmail.com> wrote:
> 

> > Hello,
> > I'm trying to use an std::vector (of strings) in CallFunction ('is_in').
> > The arrow::compute::SetLookupOptions takes in a datum (array of of strings, in my case to search).
> > 

> > I tried this
> > 

> > std::vector<std::string> vec;
> > auto buffer = arrow::Buffer::Wrap(vec);
> > auto arrayData = arrow::ArrayData::Make (arrow::utf8(), vec.size(), {nullptr, buffer});
> > auto options = arrow::compute::SetLookupOptions(arrayData);
> > auto res = arrow::compute::CallFunction ("is_in", {arrowArray}, &options);
> > 

> > This is resulting in a crash.
> > 

> > I tried calling arrow::MakeArray(arrayData), and that is also failing.
> > 

> > But if I convert the std::vector to arrow::Array (using StringBuilder) then there's no crash and I'm getting expected results.
> > 

> > Am I using the arrow::Buffer/arrow::ArrayData/arrow::Datum correctly, or I'm missing something ?
> > 

> > Thanks,
> > Surya

Re: [C++] std::vector to Datum

Posted by Felipe Oliveira Carvalho <fe...@gmail.com>.
std::vector<std::string>::data() returns a buffer containing pointers to
the individual string buffers and Arrow needs a buffer with contiguous
variable-length character data.

And that is buffers[2]. buffers[1] contains the offsets for beginning and
end of the strings in buffers[2].

So yes, use the StringBuilder.

--
Felipe

On Thu, May 4, 2023 at 2:28 PM Surya Kiran Gullapalli <
suryakiran.gullapalli@gmail.com> wrote:

> Hello,
> I'm trying to use an std::vector (of strings) in CallFunction ('is_in').
> The arrow::compute::SetLookupOptions takes in a datum (array of of
> strings, in my case to search).
>
> I tried this
>
> std::vector<std::string> vec;
> auto buffer = arrow::Buffer::Wrap(vec);
> auto arrayData = arrow::ArrayData::Make (arrow::utf8(), vec.size(),
> {nullptr, buffer});
> auto options = arrow::compute::SetLookupOptions(arrayData);
> auto res = arrow::compute::CallFunction ("is_in", {arrowArray}, &options);
>
> This is resulting in a crash.
>
> I tried calling arrow::MakeArray(arrayData), and that is also failing.
>
> But if I convert the std::vector to arrow::Array (using StringBuilder)
> then there's no crash and I'm getting expected results.
>
> Am I using the arrow::Buffer/arrow::ArrayData/arrow::Datum correctly, or
> I'm missing something ?
>
> Thanks,
> Surya
>