You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Haocheng Liu <lb...@gmail.com> on 2023/05/22 15:52:26 UTC

[C++][Parquet] Best practice to write duplicated strings / enums into parquet

Hi,

I have a use case which can be simplified as there are {0-> "RED",
1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions
of times. In each row,  there may be tens of int -> string maps. When user
read the data, they want to see "RED", "GREED" and "BLUE" rather than some
unclear int.

According to the doc
<https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>,
dictionary encoding is enabled by default so there are two possible
solutions:

1. Write strings via a stringBuilder and let Arrow do the encoding under
the hood.
2. Write enums(int) and provide the encoding in metadata(?).

Option 2 sounds preferred to me as it does not require expensive string
comparison and possible string copy. Can folks please guide on  if my
understanding is correct. If so, how to provide the int->string mapping in
metadata? If not, what's the best practice here?

Thanks in advance.

Regards,
Haocheng Liu

Re: [C++][Parquet] Best practice to write duplicated strings / enums into parquet

Posted by Haocheng Liu <lb...@gmail.com>.
StringDictionaryBuilder sounds like a perfect candidate for my use case.
Thanks Weston!

On Mon, May 22, 2023 at 3:01 PM Weston Pace <we...@gmail.com> wrote:

> Arrow can also represent dictionary encoding.  If you like StringBuilder
> then there is also a StringDictionaryBuilder which should be more or less
> compatible:
>
> TEST(TestStringDictionaryBuilder, Basic) {
>   // Build the dictionary Array
>   StringDictionaryBuilder builder;
>   ASSERT_OK(builder.Append("RED"));
>   ASSERT_OK(builder.Append("GREEN"));
>   ASSERT_OK(builder.Append("RED"));
>
>   std::shared_ptr<Array> result;
>   ASSERT_OK(builder.Finish(&result));
>
>   // Build expected data
>   auto ex_dict = ArrayFromJSON(utf8(), "[\"RED\", \"GREEN\"]");
>   auto dtype = dictionary(int8(), utf8());
>   auto int_array = ArrayFromJSON(int8(), "[0, 1, 0]");
>   DictionaryArray expected(dtype, int_array, ex_dict);
>
>   ASSERT_TRUE(expected.Equals(result));
> }
>
> If your encoding is standard (e.g. you must always represent "RED" with 1
> and "GREEN" with 0) then you can use InsertMemoValues to establish your
> encoding first:
>
> TEST(TestStringDictionaryBuilder, Basic) {
>   auto values = ArrayFromJSON(utf8(), R"(["GREEN", "RED"])");
>
>   // Build the dictionary Array
>   StringDictionaryBuilder builder;
>   ASSERT_OK(builder.InsertMemoValues(*values));
>   ASSERT_OK(builder.Append("RED"));
>   ASSERT_OK(builder.Append("GREEN"));
>   ASSERT_OK(builder.Append("RED"));
>
>   std::shared_ptr<Array> result;
>   ASSERT_OK(builder.Finish(&result));
>
>   // Build expected data
>   auto ex_dict = ArrayFromJSON(utf8(), "[\"GREEN\", \"RED\"]");
>   auto dtype = dictionary(int8(), utf8());
>   auto int_array = ArrayFromJSON(int8(), "[1, 0, 1]");
>   DictionaryArray expected(dtype, int_array, ex_dict);
>
>   ASSERT_TRUE(expected.Equals(result));
> }
>
> On Mon, May 22, 2023 at 8:53 AM Haocheng Liu <lb...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a use case which can be simplified as there are {0-> "RED",
>> 1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions
>> of times. In each row,  there may be tens of int -> string maps. When user
>> read the data, they want to see "RED", "GREED" and "BLUE" rather than some
>> unclear int.
>>
>> According to the doc
>> <https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>,
>> dictionary encoding is enabled by default so there are two possible
>> solutions:
>>
>> 1. Write strings via a stringBuilder and let Arrow do the encoding under
>> the hood.
>> 2. Write enums(int) and provide the encoding in metadata(?).
>>
>> Option 2 sounds preferred to me as it does not require expensive string
>> comparison and possible string copy. Can folks please guide on  if my
>> understanding is correct. If so, how to provide the int->string mapping in
>> metadata? If not, what's the best practice here?
>>
>> Thanks in advance.
>>
>> Regards,
>> Haocheng Liu
>>
>>
>>

-- 
Best regards

Re: [C++][Parquet] Best practice to write duplicated strings / enums into parquet

Posted by Weston Pace <we...@gmail.com>.
Arrow can also represent dictionary encoding.  If you like StringBuilder
then there is also a StringDictionaryBuilder which should be more or less
compatible:

TEST(TestStringDictionaryBuilder, Basic) {
  // Build the dictionary Array
  StringDictionaryBuilder builder;
  ASSERT_OK(builder.Append("RED"));
  ASSERT_OK(builder.Append("GREEN"));
  ASSERT_OK(builder.Append("RED"));

  std::shared_ptr<Array> result;
  ASSERT_OK(builder.Finish(&result));

  // Build expected data
  auto ex_dict = ArrayFromJSON(utf8(), "[\"RED\", \"GREEN\"]");
  auto dtype = dictionary(int8(), utf8());
  auto int_array = ArrayFromJSON(int8(), "[0, 1, 0]");
  DictionaryArray expected(dtype, int_array, ex_dict);

  ASSERT_TRUE(expected.Equals(result));
}

If your encoding is standard (e.g. you must always represent "RED" with 1
and "GREEN" with 0) then you can use InsertMemoValues to establish your
encoding first:

TEST(TestStringDictionaryBuilder, Basic) {
  auto values = ArrayFromJSON(utf8(), R"(["GREEN", "RED"])");

  // Build the dictionary Array
  StringDictionaryBuilder builder;
  ASSERT_OK(builder.InsertMemoValues(*values));
  ASSERT_OK(builder.Append("RED"));
  ASSERT_OK(builder.Append("GREEN"));
  ASSERT_OK(builder.Append("RED"));

  std::shared_ptr<Array> result;
  ASSERT_OK(builder.Finish(&result));

  // Build expected data
  auto ex_dict = ArrayFromJSON(utf8(), "[\"GREEN\", \"RED\"]");
  auto dtype = dictionary(int8(), utf8());
  auto int_array = ArrayFromJSON(int8(), "[1, 0, 1]");
  DictionaryArray expected(dtype, int_array, ex_dict);

  ASSERT_TRUE(expected.Equals(result));
}

On Mon, May 22, 2023 at 8:53 AM Haocheng Liu <lb...@gmail.com> wrote:

> Hi,
>
> I have a use case which can be simplified as there are {0-> "RED",
> 1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions
> of times. In each row,  there may be tens of int -> string maps. When user
> read the data, they want to see "RED", "GREED" and "BLUE" rather than some
> unclear int.
>
> According to the doc
> <https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>,
> dictionary encoding is enabled by default so there are two possible
> solutions:
>
> 1. Write strings via a stringBuilder and let Arrow do the encoding under
> the hood.
> 2. Write enums(int) and provide the encoding in metadata(?).
>
> Option 2 sounds preferred to me as it does not require expensive string
> comparison and possible string copy. Can folks please guide on  if my
> understanding is correct. If so, how to provide the int->string mapping in
> metadata? If not, what's the best practice here?
>
> Thanks in advance.
>
> Regards,
> Haocheng Liu
>
>
>