You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2018/07/20 11:54:43 UTC

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11000


Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

The Decimal type in Parquet is a logical type. That means
the Parquet file stores some physical/primitive type that
is annotated by the DECIMAL tag to make it behave like
decimals.

The allowed physical types for decimals are INT32, INT64,
FIXED, and BINARY. Before this commit Impala could only
read decimals stored as FIXED or BINARY.

Spark decided to write decimals as INT32 or INT64 when
their precision allows it:
(1 <= precision <= 9) ==> INT32
(10 <= precision <= 18) ==> INT64

I updated our column readers to accept INT32 and INT64
as valid physical types for decimals.

Testing:
* extended parquet-plain-test.cc
* added Parquet files generated by Spark 2.3.1
  and updated test_scanners.py

Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-plain-test.cc
M testdata/data/README
A testdata/data/decimal_stored_as_int32.parquet
A testdata/data/decimal_stored_as_int64.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
9 files changed, 82 insertions(+), 12 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/11000/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 2: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Jul 2018 16:01:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62
PS1, Line 62:   static int ByteSize(const InternalType& v) { return sizeof(InternalType); }
> I'm ok with leaving, but maybe just add an additional comment like "Used in
Added this sentence to the comment of this function.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@196
PS2, Line 196:     // We require that the scale and byte length be set.
> Maybe leave a comment explaining that we don't support automatically conver
I think it is more like IMPALA-2515, since it is more about byte size than precision mismatch between the Parquet file metadata and the table metadata in the metastore.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@199
PS2, Line 199: ile '$0' column
> this might be a bit confusing since the term "precision" when used in the c
Changed the comment.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@203
PS2, Line 203: rquetPlainEncoder::DecimalSize(
> can you add check for schema_element.type_length, since this section of the
Based on https://github.com/apache/parquet-format/blob/518e206c3e6586b76e8315d5f62a8666ed62fa90/src/main/thrift/parquet.thrift#L347 the type_length has different meaning for FIXED_LEN_BYTE_ARRAY and for the other types.

For other types it doesn't even need to be specified.

The check at L216 is a bit different. It takes the precision from the metastore (slot_desc()->type().precision), and calculates the minimum byte size (expected_len) to store that precision.

Impala is quite strict in the sense that it only allows properly sized decimals. IMPALA-2515 is the JIRA to loose this restriction.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@234
PS2, Line 234: 
             :     if (!is_converted_type_decimal) {
             :       // TODO: is this validation useful? It is not required at all to read the data and
             :       // might only serve to reject otherwise perfectly readable files.
             :       ErrorMsg msg(TErrorCode::PARQUET_BAD_CONVERTED_TYPE, filename,
             :           schema_element.name);
             :       RETURN_IF_ERROR(state->LogOrReturnError(msg));
             :     }
             :   } else if (schema_element.__isset.scale || schema_element.__isset.precision
             :       || is_converted_type_decimal) {
             :     ErrorMsg msg(TErrorCode::PARQUET_INCOMPATIBLE_DECIMAL, filename, schema_element.name,
             :      
> can you also add similar checks for precision as well.
This of the code part checks if there is a precision mismatch between Parquet file metadata and table metadata regardless of the underlying physical type.

My checks up there (e.g. sizeof(int32_t) != slot_desc->type().GetByteSize()) also checks if the precision fits into Impala's internal type (int32_t or int64_t). In other words, if Parquet physical type is INT32, then the decimal with the given precision must fit into an int32_t.

If the precision doesn't fit it returns an error just like in L217.

Am I missing something? Exactly what checks do you think of?

RETURN_IF_ERROR(state->LogOrReturnError(msg)) only returns on cancellation, non-recoverable errors, and when abort_on_error() is true. Do we want that in all cases? I think having less bytes than what the precision needs should return an error status from ValidateColumn().


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc@38
PS2, Line 38: Handle special case of encoding decimal types stored as BYTE_ARRAY
> update comment
IN32 and INT64 are now also mentioned.



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Aug 2018 14:49:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 6: Code-Review+2

(1 comment)

Carrying +2

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py@303
PS4, Line 303:  
> I've ran it locally, but it showed no comments. Maybe 'IMPALA-7317: loosen 
Yes, IMPALA-7317 loosened it. Anyway I adjusted the indentation to make flake8 happy even with E128.



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Aug 2018 17:05:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py@293
PS1, Line 293:   def _create_table_from_file(self, table_name, unique_database):
> It would be nice to use this for other tests (no requirement, just an obser
I have created a related ticket in the past:
https://issues.apache.org/jira/browse/IMPALA-6709

I agree that it would be nice to change other tests to use this function. I would prefer to do this in a different commit.


http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py@298
PS1, Line 298:     hdfs_file = '/test-warehouse/%s' % filename
             :     check_call(['hdfs', 'dfs', '-copyFromLocal', '-f', local_file, hdfs_file])
I am a bit worried about using /test-warehouse/. Your file names are unlikely to be used elsewhere, but someone may create a test with a more common file name, which could lead to side-effects. My suggestion is to put these files to the directory of the unique database: "/test-warehouse/{0}.db/{1}".format(unique_database, filename)

Another option would be to load these files to hdfs during data load (see create-load-data.sh: load-custom-data). This could make the tests faster, but would make developing new tests less convenient.


http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py@302
PS1, Line 302:     self.client.execute('create table %s like parquet "%s" stored as parquet' %
+1 for doing this with CREATE TABLE LIKE PARQUET, because it is easy to forget adding tests for it



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Jul 2018 13:31:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/115/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Jul 2018 16:07:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 2: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62
PS1, Line 62:   static int ByteSize(const InternalType& v) { return sizeof(InternalType); }
> Yeah, I was also a bit confused about it.
I'm ok with leaving, but maybe just add an additional comment like "Used in some template function implementations to determine the encoded byte size for fixed-length types."


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@196
PS2, Line 196:     if (schema_element.type == parquet::Type::INT32 &&
Maybe leave a comment explaining that we don't support automatically converting to the wider type (I think IMPALA-7087 is the JIRA tracking this).


http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py@293
PS1, Line 293:   def _create_table_from_file(self, table_name, unique_database):
> I'll do this refactor in an other commit.
wfm



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Jul 2018 16:24:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py@303
PS4, Line 303: (
flake8: E128 continuation line under-indented for visual indent


http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py@305
PS4, Line 305: (
flake8: E128 continuation line under-indented for visual indent



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Aug 2018 17:24:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 4: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@203
PS2, Line 203:  == parquet::Type::INT64 &&
> Based on https://github.com/apache/parquet-format/blob/518e206c3e6586b76e83
I see. Thanks for the explanation.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@234
PS2, Line 234: 
             :     // The other decimal metadata should be there but we don't need it.
             :     if (!schema_element.__isset.precision) {
             :       ErrorMsg msg(TErrorCode::PARQUET_MISSING_PRECISION, filename, schema_element.name);
             :       RETURN_IF_ERROR(state->LogOrReturnError(msg));
             :     } else {
             :       if (schema_element.precision != slot_desc->type().precision) {
             :         // TODO: we could allow a mismatch and do a conversion at this step.
             :         ErrorMsg msg(TErrorCode::PARQUET_WRONG_PRECISION, filename, schema_element.name,
             :             schema_element.precision, slot_desc->type().precision);
             :         RETURN_IF_ERROR(state->LogOrReturnError(msg));
             :      
> This of the code part checks if there is a precision mismatch between Parqu
Sounds good, Thanks for the explanation.



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Aug 2018 16:51:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py@303
PS4, Line 303: (
> We're still having a discussion about these on the mailing list, feel free 
I've ran it locally, but it showed no comments. Maybe 'IMPALA-7317: loosen flake8 rules' eliminated these warnings.



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Aug 2018 13:20:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Bikramjeet Vig, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11000

to look at the new patch set (#3).

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

The Decimal type in Parquet is a logical type. That means
the Parquet file stores some physical/primitive type that
is annotated by the DECIMAL tag to make it behave like
decimals.

The allowed physical types for decimals are INT32, INT64,
FIXED, and BINARY. Before this commit Impala could only
read decimals stored as FIXED or BINARY.

Spark decided to write decimals as INT32 or INT64 when
their precision allows it:
(1 <= precision <= 9) ==> INT32
(10 <= precision <= 18) ==> INT64

I updated our column readers to accept INT32 and INT64
as valid physical types for decimals.

Testing:
* extended parquet-plain-test.cc
* added Parquet files generated by Spark 2.3.1
  and updated test_scanners.py

Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-plain-test.cc
M testdata/data/README
A testdata/data/decimal_stored_as_int32.parquet
A testdata/data/decimal_stored_as_int64.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
9 files changed, 107 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/11000/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2903/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Aug 2018 17:06:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

The Decimal type in Parquet is a logical type. That means
the Parquet file stores some physical/primitive type that
is annotated by the DECIMAL tag to make it behave like
decimals.

The allowed physical types for decimals are INT32, INT64,
FIXED, and BINARY. Before this commit Impala could only
read decimals stored as FIXED or BINARY.

Spark decided to write decimals as INT32 or INT64 when
their precision allows it:
(1 <= precision <= 9) ==> INT32
(10 <= precision <= 18) ==> INT64

I updated our column readers to accept INT32 and INT64
as valid physical types for decimals.

Testing:
* extended parquet-plain-test.cc
* added Parquet files generated by Spark 2.3.1
  and updated test_scanners.py

Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Reviewed-on: http://gerrit.cloudera.org:8080/11000
Reviewed-by: Zoltan Borok-Nagy <bo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-plain-test.cc
M testdata/data/README
A testdata/data/decimal_stored_as_int32.parquet
A testdata/data/decimal_stored_as_int64.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
9 files changed, 109 insertions(+), 15 deletions(-)

Approvals:
  Zoltan Borok-Nagy: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc@1536
PS1, Line 1536: DCHECK_EQ(sizeof(Decimal4Value::StorageType), slot_desc->type().GetByteSize());
i dont remember where the file level column metadata validation occurs in code, but can you add this check there in case the metadata is corrupted and/or does not match the hms column metadata


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62
PS1, Line 62:   static int ByteSize(const InternalType& v) { return sizeof(InternalType); }
> I'm still confused about what this function does. Can we rename to somethin
This methods represents the encoded byte size if the internalType is written to parquet. I guess the confusion lies where in there is special casing for when this is used for encoding/decoding of fixed/variable len types.
From what i understand this used to get the encoded size of an internalType only for fixed len types, except for String type.
-------------------------------
returning -1 makes no sense and it only does that for cases which are impossible, hence the DCHECK(false) associated with them.
--------------------------------
The part in HdfsParquetTableWriter is correct since a plain_encoded_value_size_ < 0 means that its a variable len encoded type and since impala supports only string type to be written as a variable len encoded type, it calls on ParquetPlainEncoder::ByteSize to get it encode byte size which.


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@182
PS1, Line 182:   /// Decodes t from 'buffer', reading up to the byte before 'buffer_end'. 'buffer'
             :   /// need not be aligned. If PARQUET_TYPE is FIXED_LEN_BYTE_ARRAY then 'fixed_len_size'
             :   /// is the size of the object. Otherwise, it is unused.
             :   /// Returns the number of bytes read or -1 if the value was not decoded successfully.
I think it will be super helpful to list which types are decoded by this template method and which ones have specialization. What do you all think?


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@208
PS1, Line 208: // Only used for testing to test decimals stored as INT32.
             :   // Otherwise Impala stores decimals as FIXED_LEN_BYTE_ARRAY.
a bit misleading, this implies that impala only supports this type for testing (that too not sure if for reading or writing). and the second line implies this method is only used for writing.


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc@195
PS1, Line 195:  if (slot_desc->type().type == TYPE_DECIMAL) {
here it is! you should add checks here to make sure int32 and int64 encoded decimal types have the right precision.



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Jul 2018 18:49:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 1:

(2 comments)

My main question arose out of me trying to understand the implications of the changes to ByteSize() - I feel like there's something weird there.

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62
PS1, Line 62:   static int ByteSize(const InternalType& v) { return sizeof(InternalType); }
I'm still confused about what this function does. Can we rename to something like "InternalByteSize()". What does it returning -1 mean also?

It's weird that it also sometimes seems to be used to return the encoded byte size - that seems wrong.

E.g. I'm looking at this logic in HdfsParquetTableWriter and wondering if it's wrong:

      *bytes_needed = plain_encoded_value_size_ < 0 ?
          ParquetPlainEncoder::ByteSize<T>(*v) :
          plain_encoded_value_size_;


http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py@293
PS1, Line 293:   def _create_table_from_file(self, table_name, unique_database):
It would be nice to use this for other tests (no requirement, just an observation :)).



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Jul 2018 16:07:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc@1536
PS1, Line 1536: DCHECK_EQ(sizeof(Decimal4Value::StorageType), slot_desc->type().GetByteSize());
> i dont remember where the file level column metadata validation occurs in c
I added the checks as there but leave these DCHECKs here as well if you don't mind.


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62
PS1, Line 62:   static int ByteSize(const InternalType& v) { return sizeof(InternalType); }
> This methods represents the encoded byte size if the internalType is writte
Yeah, I was also a bit confused about it.
So, should I leave it as it is?


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@182
PS1, Line 182:   /// Decodes t from 'buffer', reading up to the byte before 'buffer_end'. 'buffer'
             :   /// need not be aligned. If PARQUET_TYPE is FIXED_LEN_BYTE_ARRAY then 'fixed_len_size'
             :   /// is the size of the object. Otherwise, it is unused.
             :   /// Returns the number of bytes read or -1 if the value was not decoded successfully.
> I think it will be super helpful to list which types are decoded by this te
I agree, I added a table based on SUPPORTED_PHYSICAL_TYPES in parquet-metadata-utils.cc, the Decode() functions in parquet-common.h, and the switch statement in HdfsParquetTableWriter::Init().


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@208
PS1, Line 208: / disallow it.
             : template <> int ParquetPlainEncoder::ByteSize(const ColumnType
> a bit misleading, this implies that impala only supports this type for test
True, I think I was a bit confused when I wrote these comments.


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc@195
PS1, Line 195:  if (slot_desc->type().type == TYPE_DECIMAL) {
> here it is! you should add checks here to make sure int32 and int64 encoded
Done.


http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py@293
PS1, Line 293:   def _create_table_from_file(self, table_name, unique_database):
> I have created a related ticket in the past:
I'll do this refactor in an other commit.


http://gerrit.cloudera.org:8080/#/c/11000/1/tests/query_test/test_scanners.py@298
PS1, Line 298:     hdfs_file = '/test-warehouse/{0}.db/{1}'.format(unique_database, filename)
             :     check_call(['hdfs', 'dfs', '-copyFromLocal', '-f', local_file, hdfs_file])
> I am a bit worried about using /test-warehouse/. Your file names are unlike
I chose the first option.



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Jul 2018 15:35:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11000

to look at the new patch set (#2).

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

The Decimal type in Parquet is a logical type. That means
the Parquet file stores some physical/primitive type that
is annotated by the DECIMAL tag to make it behave like
decimals.

The allowed physical types for decimals are INT32, INT64,
FIXED, and BINARY. Before this commit Impala could only
read decimals stored as FIXED or BINARY.

Spark decided to write decimals as INT32 or INT64 when
their precision allows it:
(1 <= precision <= 9) ==> INT32
(10 <= precision <= 18) ==> INT64

I updated our column readers to accept INT32 and INT64
as valid physical types for decimals.

Testing:
* extended parquet-plain-test.cc
* added Parquet files generated by Spark 2.3.1
  and updated test_scanners.py

Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-plain-test.cc
M testdata/data/README
A testdata/data/decimal_stored_as_int32.parquet
A testdata/data/decimal_stored_as_int64.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
9 files changed, 103 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/11000/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@199
PS2, Line 199: right precision
this might be a bit confusing since the term "precision" when used in the context of parquet refers to the explicit "precision" field defined in the metadata


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@203
PS2, Line 203: slot_desc->type().GetByteSize()
can you add check for schema_element.type_length, since this section of the code deals more with verifying the column metadata. I am not sure if the check at L216 is the same.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@234
PS2, Line 234:    // The other decimal metadata should be there but we don't need it.
             :     if (!schema_element.__isset.precision) {
             :       ErrorMsg msg(TErrorCode::PARQUET_MISSING_PRECISION, filename, schema_element.name);
             :       RETURN_IF_ERROR(state->LogOrReturnError(msg));
             :     } else {
             :       if (schema_element.precision != slot_desc->type().precision) {
             :         // TODO: we could allow a mismatch and do a conversion at this step.
             :         ErrorMsg msg(TErrorCode::PARQUET_WRONG_PRECISION, filename, schema_element.name,
             :             schema_element.precision, slot_desc->type().precision);
             :         RETURN_IF_ERROR(state->LogOrReturnError(msg));
             :       }
             :     }
can you also add similar checks for precision as well.

Use RETURN_IF_ERROR(state->LogOrReturnError(msg)); to make sure it is consistent with other precision checks


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc@38
PS2, Line 38: Handle special case of encoding decimal types stored as BYTE_ARRAY
update comment



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Jul 2018 18:09:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Aug 2018 20:21:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/133/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Aug 2018 15:20:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Bikramjeet Vig, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11000

to look at the new patch set (#6).

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

The Decimal type in Parquet is a logical type. That means
the Parquet file stores some physical/primitive type that
is annotated by the DECIMAL tag to make it behave like
decimals.

The allowed physical types for decimals are INT32, INT64,
FIXED, and BINARY. Before this commit Impala could only
read decimals stored as FIXED or BINARY.

Spark decided to write decimals as INT32 or INT64 when
their precision allows it:
(1 <= precision <= 9) ==> INT32
(10 <= precision <= 18) ==> INT64

I updated our column readers to accept INT32 and INT64
as valid physical types for decimals.

Testing:
* extended parquet-plain-test.cc
* added Parquet files generated by Spark 2.3.1
  and updated test_scanners.py

Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-plain-test.cc
M testdata/data/README
A testdata/data/decimal_stored_as_int32.parquet
A testdata/data/decimal_stored_as_int64.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
9 files changed, 109 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/11000/6
-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/134/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Aug 2018 15:29:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Csaba Ringhofer, Bikramjeet Vig, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11000

to look at the new patch set (#4).

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

The Decimal type in Parquet is a logical type. That means
the Parquet file stores some physical/primitive type that
is annotated by the DECIMAL tag to make it behave like
decimals.

The allowed physical types for decimals are INT32, INT64,
FIXED, and BINARY. Before this commit Impala could only
read decimals stored as FIXED or BINARY.

Spark decided to write decimals as INT32 or INT64 when
their precision allows it:
(1 <= precision <= 9) ==> INT32
(10 <= precision <= 18) ==> INT64

I updated our column readers to accept INT32 and INT64
as valid physical types for decimals.

Testing:
* extended parquet-plain-test.cc
* added Parquet files generated by Spark 2.3.1
  and updated test_scanners.py

Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-plain-test.cc
M testdata/data/README
A testdata/data/decimal_stored_as_int32.parquet
A testdata/data/decimal_stored_as_int64.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
9 files changed, 109 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/11000/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11000/4/tests/query_test/test_scanners.py@303
PS4, Line 303: (
> flake8: E128 continuation line under-indented for visual indent
We're still having a discussion about these on the mailing list, feel free to ignore or chime in. If you want to fix the warning, you can run the following command locally to critique your current HEAD commit:

  ./bin/jenkins/critique-gerrit-review.py --dryrun



-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Aug 2018 17:27:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5542: Impala cannot scan Parquet decimal stored as int64 t/int32 t

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/161/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Aug 2018 17:26:09 +0000
Gerrit-HasComments: No