You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Andrew Sherman (Code Review)" <ge...@cloudera.org> on 2018/10/04 20:09:42 UTC

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Andrew Sherman has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11582


Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So a run is encoded
as 2 bytes, but the actual cost of a run of small bit width values in
the middle of a literal is really 3 bytes, as a new indicator is needed
after the run.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 24 for single bit values, and of length 16 for 2 bit wide values.
All other bit widths will use the existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 497 insertions(+), 254 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1134/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 23 Oct 2018 20:13:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 484 insertions(+), 256 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/6
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 3: Code-Review+1

(3 comments)

If you do not want the implement the "use shorter min_repeated_run_length if the last run was a repeated run" logic then lgtm.

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: length 16 for single bit values. All other bit widths will use the
            : existing length 8 runs.
            : 
> I agree with your calculations. I think sometimes min_run_length=24 is bett
>This could be avoided by using smaller
>min_repeated_run_length if the last run was a repeated run:

Did you think about the logic above? I think that it would be very close to optimal encoding, and would not need extreme code changes (e.g. there could be a min_repeated_run_length_after_literal_run_, and a min_repeated_run_length_after_repeated_run_, and FlushLiteralRun / FlushRepeatedRun could switch between them).

About the "typical use case": I agree that interrupting repeated runs should be more common (but we shouldn't cause a regression in other cases).

I think that the most typical use case where we can win a few bytes is when one value is much less frequent than another (e.g.  most values in a column are NULL, while some are not), which leads to alternating short literal runs and longer repeated runs. If the ratio is around 1/8 - 1/24,  most of the repeated runs will be short enough to win with encoding as literal.


http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-encoding.h@266
PS3, Line 266: // Using min_repeated_run_length=16 is always better than 8.
I prefer to phrase it as "never worse than 8". For alternating long repeated runs and runs of 8 it is the same.


http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-encoding.h@267
PS3, Line 267: better than 8
Maybe "better than 16" would be better.



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 24 Oct 2018 11:18:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-test.cc@216
PS3, Line 216: c
Oops, I missed this one:
nit: capital C



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 24 Oct 2018 11:51:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: By default RleEncoder will now use run length encoding for runs of
            : length 24 for single bit values, and of length 16 for 2 bit wide values.
            : All other bit widths will use the existing length 8 runs.
> So you want me to write something like this in the jira?
Yes, it looks good, but I think that the worst case (alternating L8 R16) would be an even better example.


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@264
PS4, Line 264: insertin
> I think this is correct as is but I am changing it to the clearer
You are right, I misunderstood something.


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@265
PS4, Line 265: /  1 byte for the encoded run length
> I think before this change a long literal was not always the worst case, bu
I think that is still not always the worst case - the example you wrote for my comment in the commit message is a case when the literal_max_size is less then bytes used in the end.



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 31 Oct 2018 13:46:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Thomas Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@203
PS2, Line 203: the previous run is flushed out
> Done
I think that this didn't make it into the latest version of your patch



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Fri, 26 Oct 2018 17:17:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-test.cc@148
PS2, Line 148: Rle
> nit: RleTest
Done, missed this before



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 23 Oct 2018 19:39:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 495 insertions(+), 255 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Thomas Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/11582/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/1//COMMIT_MSG@31
PS1, Line 31: All unit tests pass
I assume that you ran the e2e tests, and also that you checked that there is an existing e2e tests that involve using RleEncoder to write values of bit width 1 and 2 (eg. in tests/query_tests/test_insert_parquet.py or similar) and if not could you add some?


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@208
PS1, Line 208: min_repeated_run_length
mention that it must be a multiple of 8, less than MAX_RUN_LENGTH_BUFFER


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@209
PS1, Line 209:  which is the best choice
seems like from your DCHECK(TestInfo::is_test()) that this is the required choice for non-test code


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@242
PS1, Line 242: MinRepeatedRunLength(bit_width)
should this be min_repeated_run_length_, in case its set to a non-default value?


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@380
PS1, Line 380: run_length
min_repeated_run_length_?



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 08 Oct 2018 22:28:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3462/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 14 Nov 2018 18:51:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11582/5/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/5/be/src/util/rle-encoding.h@261
PS5, Line 261:     int num_runs = static_cast<int>(BitUtil::Ceil(num_values, MAX_VALUES_PER_LITERAL_RUN));
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 30 Oct 2018 22:23:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 1:

(5 comments)

Another patch should be coming soon

http://gerrit.cloudera.org:8080/#/c/11582/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/1//COMMIT_MSG@31
PS1, Line 31: All unit tests pass
> I assume that you ran the e2e tests, and also that you checked that there i
I did run e2e tests, and now I have also checked that they encode values of bit width 1 and 2. Bit width 1 is used when writing definition levels (i.e. when a value is null or not). Bit width 2 is used when encoding a dictionary with 3 or 4 entries.


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@208
PS1, Line 208: min_repeated_run_length
> mention that it must be a multiple of 8, less than MAX_RUN_LENGTH_BUFFER
Done


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@209
PS1, Line 209:  which is the best choice
> seems like from your DCHECK(TestInfo::is_test()) that this is the required 
Done


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@242
PS1, Line 242: MinRepeatedRunLength(bit_width)
> should this be min_repeated_run_length_, in case its set to a non-default v
I've changed this function again after thinking it through more carefully, and it no longer uses MinRepeatedRunLength() in this way.


http://gerrit.cloudera.org:8080/#/c/11582/1/be/src/util/rle-encoding.h@380
PS1, Line 380: run_length
> min_repeated_run_length_?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 15 Oct 2018 23:52:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 499 insertions(+), 255 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/7
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h@259
PS8, Line 259: MaxBufferSize
> I am not 100% sure here, but I think that this could be expressed in an eas
OK, interesting. I think this is what you mean (see below).
I'm saying this is not clearer to humans (but it was worth trying)

  static int MaxBufferSize(int bit_width, int num_values) {
    int data_bytes;
    int overhead;
    if (bit_width > 2) {
      // The largest encoded size is for all long literals with no repeated runs.
      int bytes_per_run = (bit_width * MAX_VALUES_PER_LITERAL_RUN) / 8;
      int num_runs =
          static_cast<int>(BitUtil::Ceil(num_values, MAX_VALUES_PER_LITERAL_RUN));
      data_bytes = bytes_per_run * num_runs;
      overhead = num_runs;
    } else {
      data_bytes = static_cast<int>(BitUtil::Ceil(num_values * bit_width, 8));
      if (bit_width == 1) {
        // Worst case is we use 4 bytes for every 3 of the input e.g. L8 R16 L8 R16 L8.
        overhead = 1 + static_cast<int>(BitUtil::Ceil(data_bytes, 3));
      } else { // bit_width == 2
        // Worst case is we use 3 bytes for every 2 of the input e.g. L8 R8 L8 R8 L8.
        overhead = 1 + static_cast<int>(BitUtil::Ceil(data_bytes, 2));
      }
    }
    return std::max(MinBufferSize(bit_width), data_bytes + overhead);
  }


http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h@261
PS8, Line 261: encoded
> Maybe repeated would be better here?
Done


http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-test.cc@636
PS8, Line 636: buffer_len
> I think that this var could have a more descriptive name like expected_max_
Done


http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-test.cc@642
PS8, Line 642: bug
> typo: big :)
:-(



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 05 Nov 2018 18:45:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Thomas Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 9: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 14 Nov 2018 18:47:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 8: Code-Review+1

(4 comments)

Some typos and optional changes, lgtm otherwise.

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h@259
PS8, Line 259: MaxBufferSize
I am not 100% sure here, but I think that this could be expressed in an easier to understand way:
- calculate num_bytes (or num_bytes_plain)
- calculate the overhead depending on bit_width: on byte/MAX_VALUES_PER_LITERAL_RUN or 1 byte / 2 or 3 bytes for bit width 2 and 1
- return std::max(MinBufferSize(bit_width), num_bytes + overhead);


http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h@261
PS8, Line 261: encoded
Maybe repeated would be better here?


http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-test.cc@636
PS8, Line 636: buffer_len
I think that this var could have a more descriptive name like expected_max_buffer_len.


http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-test.cc@642
PS8, Line 642: bug
typo: big :)



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 05 Nov 2018 15:59:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(5 comments)

Thanks Thomas and Csaba for reviews so far

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: By default RleEncoder will now use run length encoding for runs of
            : length 24 for single bit values, and of length 16 for 2 bit wide values.
            : All other bit widths will use the existing length 8 runs.
> >This could be avoided by using smaller
I agree that this might be a better way, but I am happy with the simple improvement we get from this change.


http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-encoding.h@266
PS3, Line 266:  1 byte for the repeated value
> I prefer to phrase it as "never worse than 8". For alternating long repeate
Done


http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-encoding.h@267
PS3, Line 267: 
> Maybe "better than 16" would be better.
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@203
PS2, Line 203: the previous run is flushed out
> Done
Done for real this time


http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/11582/3/be/src/util/rle-test.cc@216
PS3, Line 216: c
> Oops, I missed this one:
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 29 Oct 2018 21:14:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1277/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 05 Nov 2018 19:44:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(4 comments)

Thanks Csaba

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: By default RleEncoder will now use run length encoding for runs of
            : length 24 for single bit values, and of length 16 for 2 bit wide values.
            : All other bit widths will use the existing length 8 runs.
> Ok, the biggest win (50% size for alternating runs of 8 in the 1 bit case) 
So you want me to write something like this in the jira?

If you know the structure of the data then better encodings are
possible. For example with bit_width=1, using min_run_length=24 is
better in the case where we avoid interrupting a literal run.

Using the notation of 'RXX' for a repeated run of length XX (so R16 is a
run of lngth 16), and 'LYY' for a literal run of length YY.

                                 L24 R16 L24 R16 L24
min_run_length 8                 4   2   4   2   4
min_run_length 16 (new default)  4   2   4   2   4
min_run_length 24                4   2   3   2   3 (one long literal run)

So it is possible to optimize by detecting this situation and avoiding
breaking a long literal run for a run of length 16.


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@250
PS4, Line 250: iter
> Can you simplify this expression? MAX_VALUES_PER_LITERAL_RUN must be divisi
Good idea


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@264
PS4, Line 264: insertin
> Aren't we double counting the indicator byte here? My assumption is that th
I think this is correct as is but I am changing it to the clearer
literal_max_size = num_runs * (1 + bytes_per_run)
-- 1 is the indicator
-- bytes_per_run is the encoded bytes


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@265
PS4, Line 265: /  1 byte for the encoded run length
> This seems to assume that a single big literal run is the worsts case - can
I think before this change a long literal was not always the worst case, but now it is.



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 30 Oct 2018 22:22:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 510 insertions(+), 255 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/5
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1060/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 16 Oct 2018 16:13:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1202/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 29 Oct 2018 21:48:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 14 Nov 2018 18:51:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1222/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 30 Oct 2018 23:16:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 582 insertions(+), 256 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/8
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1242/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Thu, 01 Nov 2018 00:07:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: By default RleEncoder will now use run length encoding for runs of
            : length 24 for single bit values, and of length 16 for 2 bit wide values.
            : All other bit widths will use the existing length 8 runs.
About the solution in general:
If I do not misunderstand something then it will lead to worse encoding if <24/16 repeated runs alternate with larger repeated runs.

bit width 1:
repeated lengths:               24 16 24 16 24
bytes needed with old solution: 2  2  2  2  2
bytes needed with new solution: 2  3  2  3  2

bit width 2:
repeated lengths:               16 8  16 8  16
bytes needed with old solution: 2  2  2  2  2
bytes needed with new solution: 2  3  2  3  2

This could be avoided by using smaller min_repeated_run_length if the last run was a repeated run: 16 for 1 bit width, 8 for 2 bit width (and all other bit widths).


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@211
PS2, Line 211: mu;tiple
typo: multiple


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@475
PS2, Line 475: // updates num_buffered_values
This comment does not seam true here.


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@549
PS2, Line 549: we
typo



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 17 Oct 2018 22:24:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 500 insertions(+), 255 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Thomas Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@203
PS2, Line 203: the previous run is flushed out
I think that it would be helpful to update this comment with a brief discussion of the use of the circular buffer and the way that you flush 8 values at a time when you have a not long enough repeated run.


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@348
PS2, Line 348: most values
most MAX_RUN_LENGTH_BUFFER values


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@350
PS2, Line 350: _
nit: period, here and elsewhere


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-test.cc@148
PS2, Line 148: Rle
nit: RleTest



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Fri, 19 Oct 2018 21:45:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 14 Nov 2018 22:58:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(7 comments)

Thanks for the reviews

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: By default RleEncoder will now use run length encoding for runs of
            : length 24 for single bit values, and of length 16 for 2 bit wide values.
            : All other bit widths will use the existing length 8 runs.
> About the solution in general:
I agree with your calculations. I think sometimes min_run_length=24 is better at bit width=1
I’m going to use the notation of R16 for a repeated run of length 16, and L16 for a literal run of length 16.
So your example (bit width 1) is:
repeated lengths:               R24 R16 R24 R16 R24
min_run_length 8 (old)          2   2   2   2   2
min_run_length 24 (new)         2   3   2   3   2
min_run_length 16               2   2   2   2   2
 
The case where min_run_length=24 is better is where we avoid interrupting a literal run 
                                L24 R16 L24 R16 L24
Old solution                    4   2   4   2   4
min_run_length 24               4   2   3   2   3 (one long literal run)
min_run_length 16               4   2   4   2   4
 
The case with R8 interrupting a literal:
                                L24 R8  L24 R8  L24
Old solution                    4   2   4   2   4
min_run_length 24               4   1   3   1   3 (one long literal run)
min_run_length 16               4   1   3   1   3 (one long literal run)
 
The cases for bit_width = 2 are similar.
 
My thinking originally was that the case of a run interrupting a literal would be more common than alternating runs.
But I think you are right that I should only change things where there is always a win.
I will use min_run_length=16 for bit width 1, and that will be the only case where min_run_length != 8.


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@203
PS2, Line 203: the previous run is flushed out
> I think that it would be helpful to update this comment with a brief discus
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@211
PS2, Line 211: mu;tiple
> typo: multiple
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@348
PS2, Line 348: most values
> most MAX_RUN_LENGTH_BUFFER values
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@350
PS2, Line 350: _
> nit: period, here and elsewhere
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@475
PS2, Line 475: // updates num_buffered_values
> This comment does not seam true here.
I think it is true, even when update_indicator_byte = false.
But the comment is bad, should be 
// updates num_buffered_values_
(I am not used to that trailing underscore)


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@549
PS2, Line 549: we
> typo
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 22 Oct 2018 23:24:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Reviewed-on: http://gerrit.cloudera.org:8080/11582
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 583 insertions(+), 256 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 11
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 583 insertions(+), 256 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/9
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/944/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Thu, 04 Oct 2018 20:41:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 9: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/8/be/src/util/rle-encoding.h@259
PS8, Line 259: MaxBufferSize
> OK, interesting. I think this is what you mean (see below).
Hmm, shouldn't data_bytes be the same on both branches? I think that we are actually a bit pessimistic in the bit_width > 2 case, because the last "partial" run is calculated as if it was a full MAX_VALUES_PER_LITERAL_RUN run. I do not think that giving a closer estimation is very important though, so I do not want to nitpick about this too much.



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Mon, 05 Nov 2018 21:37:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 4:

(4 comments)

I went through code again and my impression is that MaxBufferSize does not do its job properly.

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: length 16 for single bit values. All other bit widths will use the
            : existing length 8 runs.
            : 
> I agree that this might be a better way, but I am happy with the simple imp
Ok, the biggest win (50% size for alternating runs of 8 in the 1 bit case) is already gained with this change.

Please reflect in the Jira that there is still some potential in improving the RLE encoding.


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@250
PS4, Line 250: 1 + 
Can you simplify this expression? MAX_VALUES_PER_LITERAL_RUN must be divisible by 8 - this could be checked by an assert, and the Ceil + static cast could be removed.


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@264
PS4, Line 264: num_runs
Aren't we double counting the indicator byte here? My assumption is that this num_runs is meant to add the +1 byte/ literal run.


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@265
PS4, Line 265: return std::max(MinBufferSize(bit_width), literal_max_size);
This seems to assume that a single big literal run is the worsts case - can't this underestimate the number of bytes needed due to 
IMPALA-6658?



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 30 Oct 2018 15:55:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1221/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 30 Oct 2018 23:11:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: length 16 for single bit values. All other bit widths will use the
            : existing length 8 runs.
            : 
> Yes, it looks good, but I think that the worst case (alternating L8 R16) wo
OK


http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/4/be/src/util/rle-encoding.h@265
PS4, Line 265: return std::max(MinBufferSize(bit_width), literal_max_size);
> I think that is still not always the worst case - the example you wrote for
You are right, and I have updated MaxBufferSize (and added a test) in next patch



-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Wed, 31 Oct 2018 23:36:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1220/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>
Gerrit-Comment-Date: Tue, 30 Oct 2018 23:05:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6658: improve Parquet RLE for low bit widths

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So a run is encoded
as 2 bytes, but the actual cost of a run of small bit width values in
the middle of a literal is really 3 bytes, as a new indicator is needed
after the run.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 24 for single bit values, and of length 16 for 2 bit wide values.
All other bit widths will use the existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
---
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
2 files changed, 507 insertions(+), 255 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/11582/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <th...@cmu.edu>