You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by gamaken k <ga...@gmail.com> on 2021/12/03 14:25:52 UTC

RLE-Dictionary encoding spec diverged from implementation?

Hello everyone,

As mentioned in a previous post, I'm writing an implementation for the
parquet specification in .net. My goal is to be cross-compatible with
parquet-cpp and parquet-mr.

My question today is with respect to RLE-Dictionary encoding. The spec
says the "length of the encoded-data" is placed before the "encoded-data".

```
rle-bit-packed-hybrid: <length> <encoded-data>
length := length of the <encoded-data> in bytes stored as 4 bytes little
endian (unsigned int32)
encoded-data := <run>*
```
*RLE-Dictionary-Encoding spec:*
https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

If I did this, parquet-cpp is unable to read my data correctly (I get
garbage values). I'm unable to set up parquet-cli due to another issue, so,
I'm unable to test parquet-mr easily. However, I read the source code and
it appears to me that `bitWidth` (and not length) is placed before the
encoded data.

BytesInput bytes = concat(BytesInput.from(bytesHeader), rleEncodedBytes);
*Source :*
https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L173

the reader also does the same here:
https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesReader.java#L58

If I follow this implementation, Parquet-CPP is able to read the values I
wrote using my implementation of dictionary encoding. Notably, If I ignored
setting the bitwidth in front, parquet-cpp works and parquet-mr tries to
read using a bitWidth of 1 which would work for my current test case (I'm
encoding 0-9 as values) but probably not correct?

Kindly help me understand the correct algorithm for RLE-Dictionary encoding
and what should be set before the encoded data.

Is it the case here that the spec has diverged from the implementation.

Many thanks,
Gamaken (Balaji K).

Re: RLE-Dictionary encoding spec diverged from implementation?

Posted by gamaken k <ga...@gmail.com>.
Hi everyone,

I just wanted to post an update that after a marathon debugging session, I
was able to make my implementation of RLEDictionary encoding roundtrip with
Parquet-MR.
My mistake was
1. I initially encoded the length (length of the <encoded-data> in bytes
stored as 4 bytes little endian) in front of the payload
2. Then looking at the implementation, I switched to encoding the bitWidth
in 4 bytes
3. costly mistake.  I eventually realized I had to encode the bitWidth as 1
byte.
4. My implementation worked.

This should not be this hard or obscure :( The spec should clearly state
"bitWidth encoded as 1 byte little endian. I'm going to file a jira and to
discuss and if everyone agrees, can update the Specification to save some
future implementers' time.

Sincerely,
Balaji

On Fri, Dec 3, 2021 at 12:28 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Sorry, I misunderstood your question.
>
> I just checked with an implementation of mine [1] that roundtrips with C++
> and pyspark, I had to use what you wrote: in dictionary pages, the data is
> presented as: [def levels][bitwidth(1 byte)][rle-encoded indices]. So,
> something like:
>
> ```rust
> let encoded_indices: &[u8] = // splitted from the data page
>
> let bit_width = encoded_indices[0];
> let encoded_indices = &encoded_indices[1..];
> let mut new_indices = HybridRleDecoder::new(encoded_indices, bit_width as
> i32, length);
> ```
>
> length here is presented to us by the number of values announced on the
> page.
>
> I agree that this is inconsistent with how it is described in the spec,
> which has an extra <length>. Maybe that length is only expected to be
> declared when RLE-encoding data pages of integer types?
>
> Best,
> Jorge
>
>
> [1]
>
> https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/primitive/dictionary.rs#L40
> [2]
>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
>
>
>
>
> On Fri, Dec 3, 2021 at 9:00 PM gamaken k <ga...@gmail.com> wrote:
>
> > Thanks Jorge! I read through your PR, but it does not clarify, for me,
> what
> > precedes the runs of encoded data. Is it length (as the spec says) or is
> it
> > bit-width (as the implementation says) ?
> > Your PR calls that bit-width should be a parameter of the decoder. If you
> > meant this should be an argument to the decoder function, I wonder why?
> To
> > me, it seems that is what is written along with the encoded data (at
> least
> > going by code).
> >
> >
> > On Fri, Dec 3, 2021 at 7:04 AM Jorge Cardoso Leitão <
> > jorgecarleitao@gmail.com> wrote:
> >
> > > I agree that the spec is a bit confusing.
> > >
> > > I recently had to go through this exercise and left a PR with a more
> > > verbose description of RLE [1] aimed at mitigating this.
> > >
> > > [1] https://github.com/apache/parquet-format/pull/170
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Dec 3, 2021, 15:26 gamaken k <ga...@gmail.com> wrote:
> > >
> > > > Hello everyone,
> > > >
> > > > As mentioned in a previous post, I'm writing an implementation for
> the
> > > > parquet specification in .net. My goal is to be cross-compatible with
> > > > parquet-cpp and parquet-mr.
> > > >
> > > > My question today is with respect to RLE-Dictionary encoding. The
> spec
> > > > says the "length of the encoded-data" is placed before the
> > > "encoded-data".
> > > >
> > > > ```
> > > > rle-bit-packed-hybrid: <length> <encoded-data>
> > > > length := length of the <encoded-data> in bytes stored as 4 bytes
> > little
> > > > endian (unsigned int32)
> > > > encoded-data := <run>*
> > > > ```
> > > > *RLE-Dictionary-Encoding spec:*
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> > > >
> > > > If I did this, parquet-cpp is unable to read my data correctly (I get
> > > > garbage values). I'm unable to set up parquet-cli due to another
> issue,
> > > so,
> > > > I'm unable to test parquet-mr easily. However, I read the source code
> > and
> > > > it appears to me that `bitWidth` (and not length) is placed before
> the
> > > > encoded data.
> > > >
> > > > BytesInput bytes = concat(BytesInput.from(bytesHeader),
> > rleEncodedBytes);
> > > > *Source :*
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L173
> > > >
> > > > the reader also does the same here:
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesReader.java#L58
> > > >
> > > > If I follow this implementation, Parquet-CPP is able to read the
> > values I
> > > > wrote using my implementation of dictionary encoding. Notably, If I
> > > ignored
> > > > setting the bitwidth in front, parquet-cpp works and parquet-mr tries
> > to
> > > > read using a bitWidth of 1 which would work for my current test case
> > (I'm
> > > > encoding 0-9 as values) but probably not correct?
> > > >
> > > > Kindly help me understand the correct algorithm for RLE-Dictionary
> > > encoding
> > > > and what should be set before the encoded data.
> > > >
> > > > Is it the case here that the spec has diverged from the
> implementation.
> > > >
> > > > Many thanks,
> > > > Gamaken (Balaji K).
> > > >
> > >
> >
>

Re: RLE-Dictionary encoding spec diverged from implementation?

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Sorry, I misunderstood your question.

I just checked with an implementation of mine [1] that roundtrips with C++
and pyspark, I had to use what you wrote: in dictionary pages, the data is
presented as: [def levels][bitwidth(1 byte)][rle-encoded indices]. So,
something like:

```rust
let encoded_indices: &[u8] = // splitted from the data page

let bit_width = encoded_indices[0];
let encoded_indices = &encoded_indices[1..];
let mut new_indices = HybridRleDecoder::new(encoded_indices, bit_width as
i32, length);
```

length here is presented to us by the number of values announced on the
page.

I agree that this is inconsistent with how it is described in the spec,
which has an extra <length>. Maybe that length is only expected to be
declared when RLE-encoding data pages of integer types?

Best,
Jorge


[1]
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/primitive/dictionary.rs#L40
[2]
https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8




On Fri, Dec 3, 2021 at 9:00 PM gamaken k <ga...@gmail.com> wrote:

> Thanks Jorge! I read through your PR, but it does not clarify, for me, what
> precedes the runs of encoded data. Is it length (as the spec says) or is it
> bit-width (as the implementation says) ?
> Your PR calls that bit-width should be a parameter of the decoder. If you
> meant this should be an argument to the decoder function, I wonder why? To
> me, it seems that is what is written along with the encoded data (at least
> going by code).
>
>
> On Fri, Dec 3, 2021 at 7:04 AM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > I agree that the spec is a bit confusing.
> >
> > I recently had to go through this exercise and left a PR with a more
> > verbose description of RLE [1] aimed at mitigating this.
> >
> > [1] https://github.com/apache/parquet-format/pull/170
> >
> >
> >
> >
> >
> > On Fri, Dec 3, 2021, 15:26 gamaken k <ga...@gmail.com> wrote:
> >
> > > Hello everyone,
> > >
> > > As mentioned in a previous post, I'm writing an implementation for the
> > > parquet specification in .net. My goal is to be cross-compatible with
> > > parquet-cpp and parquet-mr.
> > >
> > > My question today is with respect to RLE-Dictionary encoding. The spec
> > > says the "length of the encoded-data" is placed before the
> > "encoded-data".
> > >
> > > ```
> > > rle-bit-packed-hybrid: <length> <encoded-data>
> > > length := length of the <encoded-data> in bytes stored as 4 bytes
> little
> > > endian (unsigned int32)
> > > encoded-data := <run>*
> > > ```
> > > *RLE-Dictionary-Encoding spec:*
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> > >
> > > If I did this, parquet-cpp is unable to read my data correctly (I get
> > > garbage values). I'm unable to set up parquet-cli due to another issue,
> > so,
> > > I'm unable to test parquet-mr easily. However, I read the source code
> and
> > > it appears to me that `bitWidth` (and not length) is placed before the
> > > encoded data.
> > >
> > > BytesInput bytes = concat(BytesInput.from(bytesHeader),
> rleEncodedBytes);
> > > *Source :*
> > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L173
> > >
> > > the reader also does the same here:
> > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesReader.java#L58
> > >
> > > If I follow this implementation, Parquet-CPP is able to read the
> values I
> > > wrote using my implementation of dictionary encoding. Notably, If I
> > ignored
> > > setting the bitwidth in front, parquet-cpp works and parquet-mr tries
> to
> > > read using a bitWidth of 1 which would work for my current test case
> (I'm
> > > encoding 0-9 as values) but probably not correct?
> > >
> > > Kindly help me understand the correct algorithm for RLE-Dictionary
> > encoding
> > > and what should be set before the encoded data.
> > >
> > > Is it the case here that the spec has diverged from the implementation.
> > >
> > > Many thanks,
> > > Gamaken (Balaji K).
> > >
> >
>

Re: RLE-Dictionary encoding spec diverged from implementation?

Posted by gamaken k <ga...@gmail.com>.
Thanks Jorge! I read through your PR, but it does not clarify, for me, what
precedes the runs of encoded data. Is it length (as the spec says) or is it
bit-width (as the implementation says) ?
Your PR calls that bit-width should be a parameter of the decoder. If you
meant this should be an argument to the decoder function, I wonder why? To
me, it seems that is what is written along with the encoded data (at least
going by code).


On Fri, Dec 3, 2021 at 7:04 AM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> I agree that the spec is a bit confusing.
>
> I recently had to go through this exercise and left a PR with a more
> verbose description of RLE [1] aimed at mitigating this.
>
> [1] https://github.com/apache/parquet-format/pull/170
>
>
>
>
>
> On Fri, Dec 3, 2021, 15:26 gamaken k <ga...@gmail.com> wrote:
>
> > Hello everyone,
> >
> > As mentioned in a previous post, I'm writing an implementation for the
> > parquet specification in .net. My goal is to be cross-compatible with
> > parquet-cpp and parquet-mr.
> >
> > My question today is with respect to RLE-Dictionary encoding. The spec
> > says the "length of the encoded-data" is placed before the
> "encoded-data".
> >
> > ```
> > rle-bit-packed-hybrid: <length> <encoded-data>
> > length := length of the <encoded-data> in bytes stored as 4 bytes little
> > endian (unsigned int32)
> > encoded-data := <run>*
> > ```
> > *RLE-Dictionary-Encoding spec:*
> >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> >
> > If I did this, parquet-cpp is unable to read my data correctly (I get
> > garbage values). I'm unable to set up parquet-cli due to another issue,
> so,
> > I'm unable to test parquet-mr easily. However, I read the source code and
> > it appears to me that `bitWidth` (and not length) is placed before the
> > encoded data.
> >
> > BytesInput bytes = concat(BytesInput.from(bytesHeader), rleEncodedBytes);
> > *Source :*
> >
> >
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L173
> >
> > the reader also does the same here:
> >
> >
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesReader.java#L58
> >
> > If I follow this implementation, Parquet-CPP is able to read the values I
> > wrote using my implementation of dictionary encoding. Notably, If I
> ignored
> > setting the bitwidth in front, parquet-cpp works and parquet-mr tries to
> > read using a bitWidth of 1 which would work for my current test case (I'm
> > encoding 0-9 as values) but probably not correct?
> >
> > Kindly help me understand the correct algorithm for RLE-Dictionary
> encoding
> > and what should be set before the encoded data.
> >
> > Is it the case here that the spec has diverged from the implementation.
> >
> > Many thanks,
> > Gamaken (Balaji K).
> >
>

Re: RLE-Dictionary encoding spec diverged from implementation?

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
I agree that the spec is a bit confusing.

I recently had to go through this exercise and left a PR with a more
verbose description of RLE [1] aimed at mitigating this.

[1] https://github.com/apache/parquet-format/pull/170





On Fri, Dec 3, 2021, 15:26 gamaken k <ga...@gmail.com> wrote:

> Hello everyone,
>
> As mentioned in a previous post, I'm writing an implementation for the
> parquet specification in .net. My goal is to be cross-compatible with
> parquet-cpp and parquet-mr.
>
> My question today is with respect to RLE-Dictionary encoding. The spec
> says the "length of the encoded-data" is placed before the "encoded-data".
>
> ```
> rle-bit-packed-hybrid: <length> <encoded-data>
> length := length of the <encoded-data> in bytes stored as 4 bytes little
> endian (unsigned int32)
> encoded-data := <run>*
> ```
> *RLE-Dictionary-Encoding spec:*
>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
>
> If I did this, parquet-cpp is unable to read my data correctly (I get
> garbage values). I'm unable to set up parquet-cli due to another issue, so,
> I'm unable to test parquet-mr easily. However, I read the source code and
> it appears to me that `bitWidth` (and not length) is placed before the
> encoded data.
>
> BytesInput bytes = concat(BytesInput.from(bytesHeader), rleEncodedBytes);
> *Source :*
>
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L173
>
> the reader also does the same here:
>
> https://github.com/apache/parquet-mr/blob/01a5d074829ad4cf4de1f662d54fe7bceb4bef63/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesReader.java#L58
>
> If I follow this implementation, Parquet-CPP is able to read the values I
> wrote using my implementation of dictionary encoding. Notably, If I ignored
> setting the bitwidth in front, parquet-cpp works and parquet-mr tries to
> read using a bitWidth of 1 which would work for my current test case (I'm
> encoding 0-9 as values) but probably not correct?
>
> Kindly help me understand the correct algorithm for RLE-Dictionary encoding
> and what should be set before the encoded data.
>
> Is it the case here that the spec has diverged from the implementation.
>
> Many thanks,
> Gamaken (Balaji K).
>