You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org> on 2017/08/29 21:59:43 UTC

[Impala-ASF-CR] Rough draft for IMPALA-5628

Bikramjeet Vig has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7822

Change subject: Rough draft for IMPALA-5628
......................................................................

Rough draft for IMPALA-5628

Request for general comments on approch.
Many changes pending, will keep updating.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-plain-test.cc
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/util/dict-encoding.h
12 files changed, 255 insertions(+), 73 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 5:

(6 comments)

Mostly just cleanup and simplification comments now.

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h@45
PS5, Line 45: INTERNAL_TYPE
I should have mentioned this earlier, but I think we mostly use CamelCase for type names inside templates. We're not totally consistent throughout the codebase but that's the general pattern.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@328
PS5, Line 328: inline int ParquetPlainEncoder::Encode(
It looks like we could also deduplicate the Encode/Decode methods for FIXED_LEN_BYTE_ARRAY with the same technique.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@350
PS5, Line 350: Decode<Decimal4Value,parquet::Type::FIXED_LEN_BYTE_ARRAY>(const uint8_t* buffer,
nit: missing space


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: izeof(D
> for var len decimals, its encoded size is bytes required to store size + le
Thanks for the explanation.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@112
PS5, Line 112: //    Decimal4Value: General test case,
I find this a bit confusing still because the values being encoded aren't in the table. Could we take it even further and have something like:

  struct DecimalTestCase {
    int decimal_byte_size;
    int128_t decimal_val;
    int expected_size;
  }

That way all of the inputs to each test would be in the same place.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@200
PS5, Line 200: sizeof(int32_t)
Is there any reason to keep the encoding overhead separate from the encoded size? I would have thought we would just check the total size.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 5
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Oct 2017 00:36:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#4).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 622 insertions(+), 259 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#6).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 619 insertions(+), 273 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 6
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#11).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M testdata/data/README
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
19 files changed, 634 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types
......................................................................


Patch Set 3:

> What's the next steps here? Is this blocked on something?

will be submitting a full fledged patch tomorrow that adds support only for Binary (IMPALA-2494) since adding support for all will make this single patch quite huge. 
will add support for others in separate patches after this gets in.


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 3
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Oct 2017 22:48:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 4:

(12 comments)

Did a first pass over it - thought I should flush out comments since you've been waiting quite a while for feedback here.

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277
PS4, Line 1277:         switch (node.element->type) {
This case is only going to get bigger with the follow on patches - it would be best to make it a separate function.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@213
PS4, Line 213:   template <typename LOGICAL_TYPE, parquet::Type::type PHYSICAL_TYPE>
Logical type and physical type I think aren't quite right, since e.g. "StringValue" isn't a logical type. It's really decoding/encoding between Parquet physical types and Impala's internal representation.

Maybe INTERNAL_TYPE and PARQUET_PHYSICAL_TYPE? Or PARQUET_TYPE seems ok since that's what parquet calls it.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376
PS4, Line 376: inline int ParquetPlainEncoder::Decode<Decimal4Value, parquet::Type::BYTE_ARRAY>(
I think you can avoid stamping this out if you leave it still parameterized by the internal Impala type, since the logic is the same in all cases. There may be an opportunity to reduce the redundancy above too, since there are functions above with identical bodies.

I tried this out on a dummy program and it looks like it's possible:

  #include <stdio.h>

  template <typename T, typename S>
  int foo(T* a, S* b) {
    return 0;
  }

  template <>
  int foo(int* a, int* b) { return 1; }

  template <typename T>
  int foo(T* a, double* b) { return 2; }

  int main(int argc, char**argv) {
    int a, b, c;
    double x, y, z;
    printf("%d\n", foo(&a, &b));
    printf("%d\n", foo(&a, &x));
  }


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66
PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type,
let's make it static - we don't want to export the symbol to be linked with code outside this file.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70
PS4, Line 70: (
unnecessary parens


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173
PS4, Line 173:     // TODO: Is this check too strict?
I don't see why this shouldn't match the file metadata, this seems valid to me.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182
PS4, Line 182:  parquet::SchemaElement*
Why not pass by const ref like above?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209
PS4, Line 209:         stringstream ss;
Can you convert to using Substitute() while we're here? We've been very gradually converting to using that for cases like this.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176
PS4, Line 176:   TestType<Decimal4Value, parquet::Type::BYTE_ARRAY>(Decimal4Value(test_val),
It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. Mostly the tests are the same except for the expected size, right? Could we make them table-driven so that we test all the same cases but just have different expected output?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: int32_t
Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above? Or if this is the better way to express it, should we change it above?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212
PS4, Line 212:   for (int i = 1; i <=16; ++i) {
nit: missing space


http://gerrit.cloudera.org:8080/#/c/7822/4/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS4: 
Can you update the README to describe how these files were generated.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Oct 2017 00:52:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#8).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
18 files changed, 629 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 8
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-stats.inline.h@89
PS2, Line 89:   switch(parquet_type){
> Do you mean to have a Decode wrapper around the templatized Decode methods?
The former, so that the interface of Decode() is simpler. How this is implemented seems more a concern of the decoder than the column stats.


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@237
PS2, Line 237: ByteSize
> Do you mean to have something like 
I think this particular call here will always return sizeof(int32_t) (line 220). I'd just put that here, since your change explicitly documents that as a template parameter.


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@338
PS2, Line 338: template <>
> thats kinda difficult because DecimalUtil::DecodeFromFixedLenByteArray is a
I'm not sure I'm following: it looks like the next three methods are exactly the same. Couldn't you move them into a new method DecodeDecimalValue<T>(const uint8_t* buffer, const uint8_t* buffer_end, int fixed_len_size, T* v) and then call it and return in one line here? I may be missing something though :)



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Oct 2017 00:33:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Reviewed-on: http://gerrit.cloudera.org:8080/7822
Reviewed-by: Bikramjeet Vig <bi...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M testdata/data/README
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
19 files changed, 634 insertions(+), 283 deletions(-)

Approvals:
  Bikramjeet Vig: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 12
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1437/


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Nov 2017 01:11:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1435/


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Nov 2017 19:03:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 4:

(3 comments)

Few more comments related to previous ones.

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@104
PS4, Line 104: PHYSICAL_TYPE
Isn't this PHYSICAL_TYPE the internal type, which is called LOGICAL_TYPE in the decoder below?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@199
PS4, Line 199: PHYSICAL_TYPE
It doesn't seem like the whole decoding class needs to be templated by this parameter, since it only affects the behaviour of the Reset() function that actually reads the dictionary.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@323
PS4, Line 323: DictDecoder<Decimal16Value, parquet::Type::BYTE_ARRAY>::GetNextValue(
The logic here is also the same as above - we should find a way to avoid the duplication.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 16 Oct 2017 22:13:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 7:

> Looks good to me and it looks like you addressed Dan's comments.
 > Dan, did you want to take another look?

No, I don't plan to look through the details.

How confident are we that we have sufficient test coverage with the added tests? If we're confident then, Tim, please do the +2 (or ask for a second review from another reviewer if you think it's warranted).


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:40:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#7).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
18 files changed, 633 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 9: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1435/


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Nov 2017 23:02:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types
......................................................................


Patch Set 3:

What's the next steps here? Is this blocked on something?


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 3
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Oct 2017 21:58:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 8:

(2 comments)

@Tim, can you take a last look at the changes and carry over the +2 if they are ok?

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int Encode(const InternalType& t, int encoded_byte_size, uint8_t* buffer,
> Oh ok, thanks for the explanation. I think we should also add a comment men
Done


http://gerrit.cloudera.org:8080/#/c/7822/8/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS8: 
> We should also mention how these files were generated in the README
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 8
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Sat, 04 Nov 2017 00:23:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 11: Code-Review+2

Carrying over +2, rebased, fixed clang errors


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Nov 2017 01:11:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#5).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 568 insertions(+), 256 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 5
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] Rough draft for IMPALA-5628

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change.

Change subject: Rough draft for IMPALA-5628
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 1254:         // where to put that check? refer dcheck added below
> This could possibly go with the other metadata checks, after initializing t
makes sense, will do


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 89:   switch(parquet_type){
> This switch on the parquet type looks like it may fit better in the Parquet
Do you mean to have a Decode wrapper around the templatized Decode methods? or just keep the current single templated Decode and switch on parquet_type inside each specialized Decode?


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

PS2, Line 237: ByteSize
> These calls could now be removed and replaced with the bytesize derived fro
Do you mean to have something like 
- template <parquet::Type::type PARQUET_TYPE> int ParquetPlainEncoder::ByteSize() , and then use overloading to select which type is passed, not sure how much perf impact will there be for overloading here

OR

- template <typename T, parquet::Type::type PARQUET_TYPE> int ParquetPlainEncoder::ByteSize()


Line 338: template <>
> Can this be deduplicated?
thats kinda difficult because DecimalUtil::DecodeFromFixedLenByteArray is also templated and extracting the code before that into a common inlined function is not worth it because we need the 3 variables defined there in our call to DecodeFromFixedLenByteArray and to return, so would have pass 3 pointers to that method (and probably call it init or something like that) which might make the code ugly


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h@45
PS5, Line 45: INTERNAL_TYPE
> I should have mentioned this earlier, but I think we mostly use CamelCase f
Done


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@235
PS5, Line 235: template <>
             : inline int ParquetPlainEncoder::ByteSize(const Decimal4Value&) {
             :   DCHECK(false);
             :   return -1;
             : }
             : template <>
             : inline int ParquetPlainEncoder::ByteSize(const Decimal8Value&) {
             :   DCHECK(false);
             :   return -1;
             : }
             : template <>
             : inline int ParquetPlainEncoder::ByteSize(const Decimal16Value&) {
             :   DCHECK(false);
             :   return -1;
             : }
fixed code duplication here too.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@284
PS5, Line 284: template <>
             : inline int ParquetPlainEncoder::Encode(
             :     const int8_t& v, int fixed_len_size, uint8_t* buffer) {
             :   int32_t val = v;
             :   memcpy(buffer, &val, sizeof(int32_t));
             :   return ByteSize(v);
             : }
             : 
             : template <>
             : inline int ParquetPlainEncoder::Encode(
             :     const int16_t& v, int fixed_len_size, uint8_t* buffer) {
             :   int32_t val = v;
             :   memcpy(buffer, &val, sizeof(int32_t));
             :   return ByteSize(v);
             : }
Also, duplicated methods for both int8 and int16, since both are written out as int32


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@328
PS5, Line 328: inline int ParquetPlainEncoder::Encode(
> It looks like we could also deduplicate the Encode/Decode methods for FIXED
Done


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@350
PS5, Line 350: Decode<Decimal4Value,parquet::Type::FIXED_LEN_BYTE_ARRAY>(const uint8_t* buffer,
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@112
PS5, Line 112: //    Decimal4Value: General test case,
> I find this a bit confusing still because the values being encoded aren't i
As discussed, moving these test cases back to its original form


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@200
PS5, Line 200: sizeof(int32_t)
> Is there any reason to keep the encoding overhead separate from the encoded
see comment above.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 5
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Oct 2017 00:29:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#9).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M testdata/data/README
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
19 files changed, 639 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 7: Code-Review+1

Looks good to me and it looks like you addressed Dan's comments. Dan, did you want to take another look?


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:12:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 6: Code-Review+1

(3 comments)

LGTM modulo some minor things. Will let Lars take another look.

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@333
PS6, Line 333: inline int EncodeDecimal(const T& v, int fixed_len_size, uint8_t* buffer){
nit: missing space


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@34
PS6, Line 34: int
int32_t for consistency?


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@79
PS6, Line 79: LOGICAL_TYPE
LogicalType here and below



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 6
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Oct 2017 22:53:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has uploaded a new patch set (#3).

Change subject: IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types
......................................................................

IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types

Added support for reading Parquet decimal Types encoded in :
1) Int32
2) Int64
3) Binary (Variable size byte array)

A few points to look at:

1) the test table I generated to read INT64 encoded decimals shows null
values for the whole row if a decimal column exists with precision less
than or eqaul to 9 and encoded in Int64. If I alter table and change it
to 10 or above, it displays the data correctly. One can infer that there
is some issue with impala doing some internal conversion after reading
it from the parquet reader, since a precision less than 10 should be
stored using a Decimal4 (stores value in int32) but it gets a Decimal8
from the parquet reader. Any idea on what might be going wrong?

2) Currently the test for ParquetPlainEncoder contains templatized test
methods that encode and decode a passed value. Since we had encode and
decode methods for all types that we supported in both writer and reader,
this was fine. now to support the types added in this commit, where
only the reader supports it (hence only decode method exists),
I would either have to write a completely new test method or specialize
the existing test method for every permutation of <Decimal, Parquet::type>.
In any case a lot of new code will be added, which option do you think is better?

3) Waiting on confirmation that column stats should ne stored with the
exact same format (same <Converted Type, Primitive Type> as the column).
If we do get a confirmation that this is the correct. I would have to pass
a SchemaElement object of that column to ColumnStatsBase::ReadFromThrift
for handeling the case where the type is <Decimal, FIXED_LEN_BYTE_ARRAY>.
We would be able to use the type_length attribute to use the correctly
sized DecimalXValue in DecodePlainValue for reading the stat.

4) Will perf regression test be good enough or do you think we need
to do perf tests for the newly added types too?

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 550 insertions(+), 244 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 4:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277
PS4, Line 1277:         switch (node.element->type) {
> This case is only going to get bigger with the follow on patches - it would
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376
PS4, Line 376: inline int ParquetPlainEncoder::Decode<Decimal4Value, parquet::Type::BYTE_ARRAY>(
> I think you can avoid stamping this out if you leave it still parameterized
Unfortunately, the call to Decode is explicit in our code, eg Decode<T,S> therefore it will always look for a specialization of those template parameters.
Similarly, in the dummy program you wrote, if I explicitly call foo<int, double>(&a, &x), this will look for a specialization of the double templated  method and return 0;

Since partial specialization is not possible, the only way to reduce redundant code will be to do what Lars suggested earlier, i.e., create a helper method that is only templatized on <typename T> and call that for each full specialization.
I will make changes accordingly in the next iteration of this patch.


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@237
PS2, Line 237: 
> I think this particular call here will always return sizeof(int32_t) (line 
should we do that everywhere else in this file wherever we specialize a method(both Decode and encode)?


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@338
PS2, Line 338:   return fixed_len_size;
> I'm not sure I'm following: it looks like the next three methods are exactl
Oh I see what you mean now. Done!


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66
PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type,
> let's make it static - we don't want to export the symbol to be linked with
Done. enclosed in an anonymous namespace.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70
PS4, Line 70: (
> unnecessary parens
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173
PS4, Line 173:     // TODO: Is this check too strict?
> I don't see why this shouldn't match the file metadata, this seems valid to
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182
PS4, Line 182:  parquet::SchemaElement*
> Why not pass by const ref like above?
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209
PS4, Line 209:         stringstream ss;
> Can you convert to using Substitute() while we're here? We've been very gra
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176
PS4, Line 176:   TestType<Decimal4Value, parquet::Type::BYTE_ARRAY>(Decimal4Value(test_val),
> It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. 
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: int32_t
> Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above
for var len decimals, its encoded size is bytes required to store size + least num of bytes required to store num.

in this case bytes required to store numeric_limits<int32_t>::max() is sizeof(int32_t) which is less than the sizeof(Decimal8Value) since Decimal8Value is stored using int64

I will add a comment at the top of these byte_array test to clarify what expected size is. would you recommend I add more comments before limit tests?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212
PS4, Line 212:   for (int i = 1; i <=16; ++i) {
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@104
PS4, Line 104: PHYSICAL_TYPE
> Isn't this PHYSICAL_TYPE the internal type, which is called LOGICAL_TYPE in
reverted to 'T' to minimize change from previous code


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@199
PS4, Line 199: PHYSICAL_TYPE
> It doesn't seem like the whole decoding class needs to be templated by this
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@323
PS4, Line 323: DictDecoder<Decimal16Value, parquet::Type::BYTE_ARRAY>::GetNextValue(
> The logic here is also the same as above - we should find a way to avoid th
removed, as it is no longer needed since the template param is removed and moved to Reset() only


http://gerrit.cloudera.org:8080/#/c/7822/4/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS4: 
> Can you update the README to describe how these files were generated.
this file was generated a long while back and I picked it up from an abandoned patch here https://gerrit.cloudera.org/#/c/5115/ 

Currently looking for a better way to generate this file using the java parquet client.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Oct 2017 20:34:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 9: Code-Review+2

LGTM. Probably best to wait until after the weekend to merge it though :).


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Sat, 04 Nov 2017 00:29:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc@207
PS6, Line 207: InternalType, parquet::Type::type PARQUET_TYPE
> you should explain these template parameters
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@40
PS6, Line 40: IMPALA_TO_PARQUET_TYPES
I think it makes sense to rename this to INTERNAL_TO_PARQUET_TYPES to be consistent with our naming of template parameters.


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@88
PS6, Line 88: InternalType
> let's explain what that is
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@177
PS6, Line 177:   static int DecodeByParquetType(const uint8_t* buffer, const uint8_t* buffer_end, int fixed_len_size,
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@333
PS6, Line 333: inline int EncodeDecimal(const T& v, int fixed_len_size, uint8_t* buffer){
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@34
PS6, Line 34: int
> int32_t for consistency?
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@79
PS6, Line 79: LOGICAL_TYPE
> LogicalType here and below
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 6
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Oct 2017 00:32:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 11: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Nov 2017 04:34:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc@207
PS6, Line 207: InternalType, parquet::Type::type PARQUET_TYPE
you should explain these template parameters


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@88
PS6, Line 88: InternalType
let's explain what that is


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@177
PS6, Line 177:   static int DecodeByParquetType(const uint8_t* buffer, const uint8_t* buffer_end, int fixed_len_size,
nit: long line



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 6
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Oct 2017 21:42:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h@391
PS7, Line 391: fixed_len_size
> Looked again. The variable name (and recycling the argument storage) is con
Done


http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int EncodeVarLenDecimal(const DECIMAL_TYPE& t, int fixed_len_size, uint8_t* buffer){
> I took another look at the standard and it says that the minimum number of 
This test suite basically takes a test value and its expected size and first encodes it then decodes it, in order to make minimal changes to the test suite I reused the  function signature and passed the "minimum number of bytes required to store the unscaled value" as the expected size which is passed to the encode methods names as  "fixed_len_size". As per you suggestion in a previous comment, I believe this will be more clear if i change the name to encoded_byte_size.

Also, I gave an explanation of  what "fixed_len_size" for decimals stored as BYTE_ARRAY as a comment at line 46. I believe it'll be better to move that comment right above this method instead.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 23:21:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 7:

(2 comments)

I think we'll need to do some more work on testing for the int32/64 patches - we don't have pre-existing test files from parquet-mr. I think we'll have to generate some more test files with parquet-mr for the other cases, and we could consider turning that code into a data generator to generate more test files. From what I could tell Hive doesn't have a knob to generate some of these alternative output encodings.

 I feel ok with the coverage since we have end-to-end tests then more exhaustive unit tests for the various ways of encoding it.

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h@391
PS7, Line 391: fixed_len_size
Looked again. The variable name (and recycling the argument storage) is confusing. Maybe 'encoded_byte_size'?


http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int EncodeVarLenDecimal(const DECIMAL_TYPE& t, int fixed_len_size, uint8_t* buffer){
I took another look at the standard and it says that the minimum number of bytes required to store the unscaled value should be used: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

I think this means that we should not be including any preceding "0" bytes. I.e. we should not have a fixed_len_size argument and instead determine the size based on the number of leading zero bytes in the value.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:53:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 8: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int Encode(const InternalType& t, int encoded_byte_size, uint8_t* buffer,
> This test suite basically takes a test value and its expected size and firs
Oh ok, thanks for the explanation. I think we should also add a comment mentioning that this doesn't exactly implement the Parquet spec, since the parquet spec would require that it remove leading zero bytes.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 8
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 23:29:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7822/8/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS8: 
We should also mention how these files were generated in the README



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 8
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@apache.org>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 23:30:35 +0000
Gerrit-HasComments: Yes