You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Li Jin <ic...@gmail.com> on 2022/11/03 14:49:55 UTC

Creating dictionary encoded string in C++

Hello,

I am working on converting some internal data sources to Arrow data. One
particularly sets of data we have contains many string columns that can be
dictionary-encoded (basically string enums)

The current internal C++ API I am using gives me an iterator of "row"
objects, for each string column, the row object exposes a method
"getStringField(index)" that return me a "string_view" and I want to
construct a dictionary-encoded Arrow string column from it.

My question is:
(1) Is there a way to do this using the Arrow C++ API?
(2) Does the internal C++ API need to return something other than a
"string_view" to support this? Internally the string column is already
dictionary-encoded (although not in Arrow format) and it might already know
the dictionary and the encoded (int) value for each string field, but it
doesn't expose it now.

Thanks,
Li

Re: Creating dictionary encoded string in C++

Posted by Rok Mihevc <ro...@gmail.com>.
Hi Li,

If it's practical for you to create an index and a dictionary array from
your source you could use those to create a DictionaryArray as seen here
[1].
Another option that might fit your situation is to use a dictionary builder
[2].

Best,
Rok

[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict_test.cc#L1128-L1160
[2]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict_test.cc#L218-L249

On Thu, Nov 3, 2022 at 3:50 PM Li Jin <ic...@gmail.com> wrote:

> Hello,
>
> I am working on converting some internal data sources to Arrow data. One
> particularly sets of data we have contains many string columns that can be
> dictionary-encoded (basically string enums)
>
> The current internal C++ API I am using gives me an iterator of "row"
> objects, for each string column, the row object exposes a method
> "getStringField(index)" that return me a "string_view" and I want to
> construct a dictionary-encoded Arrow string column from it.
>
> My question is:
> (1) Is there a way to do this using the Arrow C++ API?
> (2) Does the internal C++ API need to return something other than a
> "string_view" to support this? Internally the string column is already
> dictionary-encoded (although not in Arrow format) and it might already know
> the dictionary and the encoded (int) value for each string field, but it
> doesn't expose it now.
>
> Thanks,
> Li
>