You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2019/01/07 18:37:57 UTC

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12172


Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................

IMPALA-7979: Enhance decoders to support value-skipping

This commit adds value-skipping functionality to the decoders.
Value-skipping is useful when we know that we won't need the
following N values, so instead of decoding and dropping them, we
can just "jump" through them.

This feature is a prerequisite for Parquet page skipping (IMPALA-5843).

I added backend tests for all the decoders.

Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
---
M be/src/exec/parquet/CMakeLists.txt
A be/src/exec/parquet/parquet-bool-decoder-test.cc
M be/src/exec/parquet/parquet-bool-decoder.cc
M be/src/exec/parquet/parquet-bool-decoder.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
10 files changed, 515 insertions(+), 16 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/72/12172/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc
File be/src/exec/parquet/parquet-bool-decoder-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc@75
PS1, Line 75: 100 falses, 100 trues, 100 falses
> Changed to write 300 values.
Sorry, I am still not 100% ok with the testdata, as it will only use repeated runs in the RLE case. A run could be added that adds alternating data (i % 2 == 0).


http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/exec/parquet/parquet-bool-decoder-test.cc
File be/src/exec/parquet/parquet-bool-decoder-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/exec/parquet/parquet-bool-decoder-test.cc@73
PS3, Line 73: TestPlainSkip
The name seems a bit misleading, as the test checks both plain and RLE encoded booleans.


http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/rle-test.cc@301
PS3, Line 301:       ValidateRleSkip(seq, bit_width, min_run_length, 0, 7);
optional: these random looking numbers suggests to me that it would make sense to actually run random tests. These may bump to edge cases that the current test avoid at the moment. A few fix tests could be kept to guarantee code coverage.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Jan 2019 16:40:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 2:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc
File be/src/exec/parquet/parquet-bool-decoder-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc@75
PS1, Line 75: 100 falses, 100 trues, 100 falses
> I think that the testdata is a bit too short, because  ParquetBoolDecoder u
Changed to write 300 values.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc@84
PS1, Line 84: 0, 8
> Is there a reason for always using %8 lengths?
No particular reason, adjusted test cases to the new count of values, added test cases with numbers.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.h
File be/src/exec/parquet/parquet-bool-decoder.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.h@44
PS1, Line 44:   ///TODO: add e2e tests when page filtering is implemented (IMPALA-5843).
> Maybe include a JIRA in the TODO?
Done


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc
File be/src/exec/parquet/parquet-bool-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc@78
PS1, Line 78: =
> doesn't really matter, but "-="?
Done


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc@83
PS1, Line 83:       if (UNLIKELY(num_unpacked_values_ < num_remaining)) return false;
> I think that RLE and PLAIN are different in this regard, because in case of
Added an if with UNLIKELY macro.
I don't expect SkipValues() to be called too frequently, so the runtime overhead should be negligible.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc@89
PS1, Line 89:     // This can result in sub-optimal decoding later, because 'literal_buffer_' might
> I'd find it a bit more readable if this returned from within the PLAIN bran
Done


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.h
File be/src/util/bit-stream-utils.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.h@137
PS1, Line 137:   /// 'num_values_to_skip * bit_width' is either divisible by 8, or
             :   /// 'num_values_to_skip' equals to the count of the remaining bit-packed values.
> Can you add DCHECKs for these assumptions in the implementation?
It's not necessarily divisible by 8, so I cannot add a DCHECK for that alone.
It might equals to the count of the remaining bit-packed values. But BatchedBitReader doesn't have the information about how many values are stored in its byte buffer. It only knows the size of its byte buffer.

E.g. when storing bools (bit-width 1), and the buffer size is 1 byte, it can store any number of bools between 1 and 8. Therefore 'num_values_to_skip' can be any number between 1 and 8.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.h@139
PS1, Line 139:   bool SkipBatch(int bit_width, int num_values_to_skip);
> Maybe this should return a value to determine whether the skip was successf
Changed it to return a bool. Made one of the DCHECKs to return false when we want to skip too many bytes.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h
File be/src/util/bit-stream-utils.inline.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h@107
PS1, Line 107:   DCHECK_GT(bit_width, 0);
> I see that the same DCHECK is used in UnpackAndDecodeBatch, but I think tha
Done


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h@109
PS1, Line 109:   DCHECK_GT(num_values_to_skip, 0);
> I would not allow num_values_to_skip == 0.
Done


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h@112
PS1, Line 112:   if (skip_bytes > buffer_end_ - buffer_pos_) return false;
> I think this DCHECK could be hit on a corrupt file, e.g. where there was le
Converted it to an 'if'.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h@557
PS1, Line 557: um_remaining,
> Is there a reason for using int64_t for 'num_values'? DictDecoder seems to 
When we want to decode values into a buffer, we are limited by memory for how many values we can decode.

But there is no limitation for skipping. Probably we won't need to skip so many values, but in theory I think it's possible with RLE-encoded columns.

But honestly I didn't put too many thoughts into it.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h@557
PS1, Line 557: um_remaining,
> This is an int64_t. Maybe we should just use int64_t for all the counts in 
Done. Changed it other places in this function.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h@567
PS1, Line 567: 
> Could remove level of nesting here.
Restructured the code.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-encoding.h@689
PS1, Line 689:   int32_t num_remaining = num_literals_to_skip;
> I think we could avoid having num_skipped if we just maintain num_remaining
Done


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-test.cc@99
PS1, Line 99: nt cnt = 0;
> I would prefer BatchedBitReader tests to be moved  to a separate test file,
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 11 Jan 2019 17:32:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1814/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 18 Jan 2019 17:30:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12172/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12172/2/be/src/util/rle-encoding.h@702
PS2, Line 702:     num_remaining -= num_literals_to_skip - num_to_skip;
Shouldn't this just be num_remaining -= num_to_skip? Do we have a test?



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 11 Jan 2019 21:27:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 16:31:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12172

to look at the new patch set (#4).

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................

IMPALA-7979: Enhance decoders to support value-skipping

This commit adds value-skipping functionality to the decoders.
Value-skipping is useful when we know that we won't need the
following N values, so instead of decoding and dropping them, we
can just "jump" through them.

This feature is a prerequisite for Parquet page skipping (IMPALA-5843).

I added backend tests for all the decoders. Backed tests related to
bitpacking are moved to the newly created bit-stream-utils-test.

Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
---
M be/src/exec/parquet/CMakeLists.txt
A be/src/exec/parquet/parquet-bool-decoder-test.cc
M be/src/exec/parquet/parquet-bool-decoder.cc
M be/src/exec/parquet/parquet-bool-decoder.h
M be/src/util/CMakeLists.txt
A be/src/util/bit-stream-utils-test.cc
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
12 files changed, 735 insertions(+), 130 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/72/12172/4
-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1726/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Jan 2019 19:13:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12172/4/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/4/be/src/util/rle-test.cc@191
PS4, Line 191: esult 
> Is it really useful to print the values? At this point the same length + bi
Yeah, you are right, now it prints values.size() and seed as well.


http://gerrit.cloudera.org:8080/#/c/12172/4/be/src/util/rle-test.cc@275
PS4, Line 275: 
> It would make sense to do this randomly too, but I am leaning towards creat
I like this idea, and it is very important that the decoders work correctly, so I implemented this approach.

Benchmarking would be interesting, and we also talked about it offline, the conclusion was that it would be a good warm-up task for a newcomer.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Jan 2019 16:59:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc
File be/src/exec/parquet/parquet-bool-decoder-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc@75
PS1, Line 75: 00000000 11111111 11111111 00000000
I think that the testdata is a bit too short, because  ParquetBoolDecoder uses a 128 bit buffer, which will never have to be "refilled" during the test.

Btw thanks for creating this test, it is great if a decoder gets unit tests beside the EE tests.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc@84
PS1, Line 84: 0, 8
Is there a reason for always using %8 lengths?


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc
File be/src/exec/parquet/parquet-bool-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc@83
PS1, Line 83:       unpacked_value_idx_ = num_remaining;
> This doesn't detect a decode error if UnpackBatch() returns < num_remaining
I think that RLE and PLAIN are different in this regard, because in case of PLAIN/BIT_PACKED, the size of the buffer can be computed from the number of expected values, so we can exclude the possibility of unsuccessful decoding earlier. I would generally replace such checks with DCHECKs to avoid runtime overhead.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.h
File be/src/util/bit-stream-utils.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.h@137
PS1, Line 137:   /// 'num_values_to_skip * bit_width' is either divisible by 8, or
             :   /// 'num_values_to_skip' equals to the count of the remaining bit-packed values.
Can you add DCHECKs for these assumptions in the implementation?


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h
File be/src/util/bit-stream-utils.inline.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h@107
PS1, Line 107:   DCHECK_GE(bit_width, 0);
I see that the same DCHECK is used in UnpackAndDecodeBatch, but I think that bit_width==0 does not make sense so this should be DCHECK_GT.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h@109
PS1, Line 109:   DCHECK_GE(num_values_to_skip, 0);
I would not allow num_values_to_skip == 0.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h@557
PS1, Line 557: num_remaining
> This is an int64_t. Maybe we should just use int64_t for all the counts in 
Is there a reason for using int64_t for 'num_values'? DictDecoder seems to always use int for the number of elements.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-test.cc@99
PS1, Line 99: TestSkipping
I would prefer BatchedBitReader tests to be moved  to a separate test file, as these tests are not RLE related.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 10 Jan 2019 15:38:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1835/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Jan 2019 17:41:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/bit-stream-utils-test.cc
File be/src/util/bit-stream-utils-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/bit-stream-utils-test.cc@17
PS3, Line 17: 
optional: I am unsure whether we should do it in this change, but it would make sense to move bit-stream-utils / rle-encoding / dictionary-encoding files to exec/parquet. This code is only used during Parquet decoding (+test and benchmarks), so I would logically group them with Parquet stuff.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Jan 2019 13:55:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 12:21:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/bit-stream-utils-test.cc
File be/src/util/bit-stream-utils-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/bit-stream-utils-test.cc@91
PS3, Line 91: void TestSkipping(const uint8_t* buffer, const int len, const int bit_width,
Might be worth adding a brief comment explaining the testing strategy, e.g. "Tests SkipBatch() by comparing the results of a reader with skipping against a reader that doesn't skip"


http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/rle-test.cc@301
PS3, Line 301:       ValidateRleSkip(seq, bit_width, min_run_length, 0, 7);
> optional: these random looking numbers suggests to me that it would make se
Yeah, I like randomised tests. I don't think this code is so complex that it's an absolute must-have but it seems worth the effort.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Jan 2019 21:05:37 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12172/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12172/2/be/src/util/rle-encoding.h@702
PS2, Line 702:     num_remaining -= num_literals_to_skip - num_to_skip;
> Shouldn't this just be num_remaining -= num_to_skip? Do we have a test?
Oops, turned out my test arrays were too small, so we never exercised this path.

After this incident I measured code coverage with Gcov and added new tests. Now all lines should be covered by the tests.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 14 Jan 2019 14:47:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12172

to look at the new patch set (#2).

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................

IMPALA-7979: Enhance decoders to support value-skipping

This commit adds value-skipping functionality to the decoders.
Value-skipping is useful when we know that we won't need the
following N values, so instead of decoding and dropping them, we
can just "jump" through them.

This feature is a prerequisite for Parquet page skipping (IMPALA-5843).

I added backend tests for all the decoders. Backed tests related to
bitpacking are moved to the newly created bit-stream-utils-test.

Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
---
M be/src/exec/parquet/CMakeLists.txt
A be/src/exec/parquet/parquet-bool-decoder-test.cc
M be/src/exec/parquet/parquet-bool-decoder.cc
M be/src/exec/parquet/parquet-bool-decoder.h
M be/src/util/CMakeLists.txt
A be/src/util/bit-stream-utils-test.cc
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
12 files changed, 662 insertions(+), 130 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/72/12172/2
-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12172

to look at the new patch set (#5).

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................

IMPALA-7979: Enhance decoders to support value-skipping

This commit adds value-skipping functionality to the decoders.
Value-skipping is useful when we know that we won't need the
following N values, so instead of decoding and dropping them, we
can just "jump" through them.

This feature is a prerequisite for Parquet page skipping (IMPALA-5843).

I added backend tests for all the decoders. Backed tests related to
bitpacking are moved to the newly created bit-stream-utils-test.

Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
---
M be/src/exec/parquet/CMakeLists.txt
A be/src/exec/parquet/parquet-bool-decoder-test.cc
M be/src/exec/parquet/parquet-bool-decoder.cc
M be/src/exec/parquet/parquet-bool-decoder.h
M be/src/util/CMakeLists.txt
A be/src/util/bit-stream-utils-test.cc
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
12 files changed, 765 insertions(+), 130 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/72/12172/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 4: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12172/4/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/4/be/src/util/rle-test.cc@191
PS4, Line 191: values
Is it really useful to print the values? At this point the same length + bit_width will lead to the same values, so printing values.size() would be enough to reproduce the issue. If the values will also become random, then a seed value could be printed.


http://gerrit.cloudera.org:8080/#/c/12172/4/be/src/util/rle-test.cc@275
PS4, Line 275: MakeSequenceBitWidth
It would make sense to do this randomly too, but I am leaning towards creating a follow up Jira, which would also deal with adding benchmarks for value skipping to benchmarks/rle-benchmark.cc

I do not want to hold up the page index efforts with this, but I am curios about the performance of skipping vs materializing for different selectivity ratios.

About a possible implementation:
A seed value could be added as parameter, and runs could be added with random [1, max_run_length] length until total length is reached. A similar implementation already exists in benchmarks/rle-benchmark.cc FillWithRle().



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Jan 2019 10:56:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12172

to look at the new patch set (#3).

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................

IMPALA-7979: Enhance decoders to support value-skipping

This commit adds value-skipping functionality to the decoders.
Value-skipping is useful when we know that we won't need the
following N values, so instead of decoding and dropping them, we
can just "jump" through them.

This feature is a prerequisite for Parquet page skipping (IMPALA-5843).

I added backend tests for all the decoders. Backed tests related to
bitpacking are moved to the newly created bit-stream-utils-test.

Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
---
M be/src/exec/parquet/CMakeLists.txt
A be/src/exec/parquet/parquet-bool-decoder-test.cc
M be/src/exec/parquet/parquet-bool-decoder.cc
M be/src/exec/parquet/parquet-bool-decoder.h
M be/src/util/CMakeLists.txt
A be/src/util/bit-stream-utils-test.cc
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
12 files changed, 681 insertions(+), 130 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/72/12172/3
-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc
File be/src/exec/parquet/parquet-bool-decoder-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder-test.cc@75
PS1, Line 75: 100 falses, 100 trues, 100 alternat
> Sorry, I am still not 100% ok with the testdata, as it will only use repeat
Added 100 alternating values


http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/exec/parquet/parquet-bool-decoder-test.cc
File be/src/exec/parquet/parquet-bool-decoder-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/exec/parquet/parquet-bool-decoder-test.cc@73
PS3, Line 73: TestDecodeAnd
> The name seems a bit misleading, as the test checks both plain and RLE enco
Renamed this test


http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/bit-stream-utils-test.cc
File be/src/util/bit-stream-utils-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/bit-stream-utils-test.cc@17
PS3, Line 17: 
> optional: I am unsure whether we should do it in this change, but it would 
For now I'll it here. I don't know if we could re-use some of these stuff for ORC or some other formats, probably not.


http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/bit-stream-utils-test.cc@91
PS3, Line 91: // against a reader that doesn't skip.
> Might be worth adding a brief comment explaining the testing strategy, e.g.
Done


http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

http://gerrit.cloudera.org:8080/#/c/12172/3/be/src/util/rle-test.cc@301
PS3, Line 301:   vector<int> values{1, 0};
> Yeah, I like randomised tests. I don't think this code is so complex that i
Yeah, I was also thinking about some fuzz testing, but wasn't sure how to output the failed test cases in order to let users reproduce the issue for debugging.

I extended the string 'description' in ValidateRleSkip() to output all of its parameters, including the values as well.

Other approach would be to not use a random seed, so failures would be deterministic and reproducible.

Anyway, I kept this test case also.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 18 Jan 2019 16:49:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3662/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 12:21:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1786/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 14 Jan 2019 15:16:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Jan 2019 21:08:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 1:

(9 comments)

Thanks for pulling this out, it was helpful to be able to look at some of the subtleties here.

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.h
File be/src/exec/parquet/parquet-bool-decoder.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.h@44
PS1, Line 44:   ///TODO: add e2e tests when page filtering is implemented.
Maybe include a JIRA in the TODO?


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc
File be/src/exec/parquet/parquet-bool-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc@78
PS1, Line 78:  
doesn't really matter, but "-="?


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc@83
PS1, Line 83:       unpacked_value_idx_ = num_remaining;
This doesn't detect a decode error if UnpackBatch() returns < num_remaining, right? It looks like on the RLE branch below we do detect such errors, so we should be consistent (I don't generally think that these decoders need to be strictly validating, but we should be consistent where possible).


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/exec/parquet/parquet-bool-decoder.cc@89
PS1, Line 89:   return true;
I'd find it a bit more readable if this returned from within the PLAIN branch.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.h
File be/src/util/bit-stream-utils.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.h@139
PS1, Line 139:   void SkipBatch(int bit_width, int num_values_to_skip);
Maybe this should return a value to determine whether the skip was successful.

I guess it's not strictly necessary if we're willing to be loose about validation since we don't actually read the values, but it's worth documenting if we decide to go down that path.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h
File be/src/util/bit-stream-utils.inline.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/bit-stream-utils.inline.h@112
PS1, Line 112:   DCHECK_LE(skip_bytes, buffer_end_ - buffer_pos_);
I think this DCHECK could be hit on a corrupt file, e.g. where there was less packed data than the reported number of values.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h@557
PS1, Line 557: num_remaining
This is an int64_t. Maybe we should just use int64_t for all the counts in this function

(There's some preexisting issues in other functions like this but best to avoid them here).


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/dict-encoding.h@567
PS1, Line 567: else {
Could remove level of nesting here.


http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12172/1/be/src/util/rle-encoding.h@689
PS1, Line 689:   int32_t num_skipped = 0;
I think we could avoid having num_skipped if we just maintain num_remaining instaead.

In the above function num_consumed was a bit more useful since it helped calculate the output position.



-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 09 Jan 2019 17:48:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1774/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 11 Jan 2019 18:08:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................


Patch Set 5: Code-Review+1

Thanks for the further randomization!

I am waiting for Tim to give a +2.


-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Jan 2019 17:40:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7979: Enhance decoders to support value-skipping

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12172 )

Change subject: IMPALA-7979: Enhance decoders to support value-skipping
......................................................................

IMPALA-7979: Enhance decoders to support value-skipping

This commit adds value-skipping functionality to the decoders.
Value-skipping is useful when we know that we won't need the
following N values, so instead of decoding and dropping them, we
can just "jump" through them.

This feature is a prerequisite for Parquet page skipping (IMPALA-5843).

I added backend tests for all the decoders. Backed tests related to
bitpacking are moved to the newly created bit-stream-utils-test.

Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Reviewed-on: http://gerrit.cloudera.org:8080/12172
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/parquet/CMakeLists.txt
A be/src/exec/parquet/parquet-bool-decoder-test.cc
M be/src/exec/parquet/parquet-bool-decoder.cc
M be/src/exec/parquet/parquet-bool-decoder.h
M be/src/util/CMakeLists.txt
A be/src/util/bit-stream-utils-test.cc
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
12 files changed, 765 insertions(+), 130 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12172
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib848f1bd71735fe84e8064daf700417b32589f57
Gerrit-Change-Number: 12172
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>