You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Attila Jeges (Code Review)" <ge...@cloudera.org> on 2019/04/30 14:13:00 UTC

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Attila Jeges has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13189


Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................

IMPALA-7370: DATE: Read/Write to parquet.

This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer.

Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.

Change-Id: I67da03754531660bc8de3b6935580d46deae1814
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/util/bit-packing.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
D testdata/workloads/functional-query/queries/QueryTest/date-text-only-support.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/common/impala_connection.py
M tests/custom_cluster/test_parquet_page_index.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_insert_parquet.py
21 files changed, 329 insertions(+), 130 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/13189/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <at...@cloudera.com>

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................

IMPALA-7370: DATE: Read/Write to parquet.

This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer. CREATE TABLE LIKE PARQUET
statements associated with data files that contain dates are also
supported.

Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.

This representation introduces a parquet interoperability issue
between Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
  up to 1582-10-05 and Gregorian calendar for dates starting with
  1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
  calendar backward to dates preceding its official introduction in
  1582-10-15.
This means that pre-1582-10-15 dates written to a parquet table by
Hive will be read back incorrectly by Impala and vice versa.

Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.

Change-Id: I67da03754531660bc8de3b6935580d46deae1814
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/util/bit-packing.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.parquet
A testdata/data/out_of_range_date.parquet
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
D testdata/workloads/functional-query/queries/QueryTest/date-text-only-support.test
A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/common/impala_connection.py
M tests/custom_cluster/test_parquet_page_index.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_scanners.py
28 files changed, 435 insertions(+), 148 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/13189/4
-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3086/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 20:02:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 3: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13189/3/be/src/exec/parquet/parquet-column-stats.inline.h
File be/src/exec/parquet/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/13189/3/be/src/exec/parquet/parquet-column-stats.inline.h@144
PS3, Line 144: template <typename T>
             : template <parquet::Type::type ParquetType>
             : inline bool ColumnStats<T>:
nit: If T would be not be a class template, then this function also wouldn't need to be a class function.


http://gerrit.cloudera.org:8080/#/c/13189/3/be/src/exec/parquet/parquet-column-stats.inline.h@178
PS3, Line 178:   const DateValue* result = reinterpret_cast<DateValue*>(slot);
nit: This could be moved to the start of the function, and DecodePlainValueNoValidation() could expect T* instead of void* in the slot. This would make the solution with the template 1-2 line shorter than the original. :)


http://gerrit.cloudera.org:8080/#/c/13189/3/fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java
File fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java:

http://gerrit.cloudera.org:8080/#/c/13189/3/fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java@282
PS3, Line 282:     if (prim.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT32 &&
I guess this was needed to make CREATE TABLE LIKE PARQUET work with dates - can you mention this in the commit message? I consider this to be a somewhat separate feature from the reading/writing of data.



-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 03 May 2019 15:37:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 6: Code-Review+2

Carry +2


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 19:17:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13189/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/13189/3/tests/query_test/test_scanners.py@359
PS3, Line 359:     """
> nit: close quoute at the end of the previous line.
'test_timestamp_out_of_range' in L330-333 also uses this style for multiline comments.



-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 12:44:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3059/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 03 May 2019 16:03:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4157/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 14:31:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 2:

(5 comments)

I did a quick run through on this review. Seems fine just have some minor observations.

http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1267
PS2, Line 1267:     // Right now, scanning DATE values is only supported for TEXT and PARQUET fileformats.
I feel that listing unsupported formats could be done right above findUnsupportedDateFsPartition() instead of here. Leaving it here would make it harder for someone who introduces new fileformat support for date to find all the occurences of comments to re-write.


http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1341
PS2, Line 1341:  if (ff != HdfsFileFormat.TEXT && ff != HdfsFileFormat.PARQUET) {
              :         return part;
              :       }
nit: can this be merged into one line?


http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
File testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test:

http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test@7
PS2, Line 7: ---- QUERY
nit: Would it make sense to merge this step with the previous one as it doesn't test anything other than not throwing an error.


http://gerrit.cloudera.org:8080/#/c/13189/2/tests/common/impala_connection.py
File tests/common/impala_connection.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/tests/common/impala_connection.py@285
PS2, Line 285:   r = None
             :     try:
             :       r = self.__fetch_results(handle, profile_format=profile_format)
             :     finally:
             :       if r is None:
             :         # Try to close the query handle but ignore any exceptions not to overwrite the
             :         # original exception raised by '__fetch_results'.
             :         try:
             :           self.close_query(handle)
             :         except Exception:
             :           pass
             :       else:
             :         self.close_query(handle)
             :         return r
Is this an intentional change for "date on parquet"? Isn't this an unrelated fix?


http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py@669
PS2, Line 669:     # Expected values for functional.date_tbl
nit: for me this comment doesn't add any value



-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Apr 2019 15:38:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 2:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG@12
PS2, Line 12: Parquet uses DATE logical type for dates. DATE logical type annotates
            : an INT32 that stores the number of days from the Unix epoch, 1 January
            : 1970.
> Gregorian stuff + Hive could be mentioned here, and I would also prefer to 
Done. Added some e2e tests too.


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc@957
PS2, Line 957:   // The range was already checked during the int32_t->DateValue conversion, which
             :   // sets the date to invalid if it was out of range.
> optional:
I'll leave it like this for now. We can figure out a better way later.


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h
File be/src/exec/parquet/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h@171
PS2, Line 171:   DateValue* result = reinterpret_cast<DateValue*>(slot);
             :   int size = buffer.size();
             :   const uint8_t* data = reinterpret_cast<const uint8_t*>(buffer.data());
             :   if (ParquetPlainEncoder::Decode<DateValue, parquet::Type::INT32>(data, data + size,
             :       size, result) == -1) {
             :     return false;
             :   }
> This looks very similar to line 148-155. It would be nice to extract it int
Done


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h@370
PS2, Line 370: i
> nit: "i" suggests a loop variable to me, so I would prefer a name like tmp 
Done


http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py@408
PS2, Line 408:   ("PARQUET_DATE_OUT_OF_RANGE", 134,
             :    "Parquet file '$0' column '$1' contains an out of range date. "
             :    "The valid date range is 0000-01-01..9999-12-31."),
> Can you create file with a corrupt value to cover related paths?
I've added the e2e test. This test revealed a missing bit in my implementation, thanks!


http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1267
PS2, Line 1267:     // Right now, scanning DATE values is only supported for TEXT and PARQUET fileformats.
> I feel that listing unsupported formats could be done right above findUnsup
Done


http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1341
PS2, Line 1341: ff != HdfsFileFormat.TEXT && ff != HdfsFileFormat.PARQUET
> A function like "IsTypeSupported()" or "GetSupportedType()" could be added 
Done


http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1341
PS2, Line 1341:  if (ff != HdfsFileFormat.TEXT && ff != HdfsFileFormat.PARQUET) {
              :         return part;
              :       }
> nit: can this be merged into one line?
Done


http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
File testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test:

http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test@7
PS2, Line 7: ---- QUERY
> nit: Would it make sense to merge this step with the previous one as it doe
Added expected results to the previous section. I think it makes more sense than merging the two test cases together.


http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@236
PS2, Line 236: # Note: dictionary filtering currently does not work on dates
> IMPALA-4994 could be mentioned here, it would make it easier to cleanup rel
Done


http://gerrit.cloudera.org:8080/#/c/13189/2/tests/common/impala_connection.py
File tests/common/impala_connection.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/tests/common/impala_connection.py@285
PS2, Line 285:   r = None
             :     try:
             :       r = self.__fetch_results(handle, profile_format=profile_format)
             :     finally:
             :       if r is None:
             :         # Try to close the query handle but ignore any exceptions not to overwrite the
             :         # original exception raised by '__fetch_results'.
             :         try:
             :           self.close_query(handle)
             :         except Exception:
             :           pass
             :       else:
             :         self.close_query(handle)
             :         return r
> Is this an intentional change for "date on parquet"? Isn't this an unrelate
I decided to include this fix here as it broke some of the hs2 DATE tests in test_date_queries.py.

I could create a separate patch set only for this, although probably that would be an overkill.


http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py@667
PS2, Line 667:     values.
> nit: Date could be mentioned in this comment + the reason for handling it s
Done


http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py@669
PS2, Line 669:     # Expected values for functional.date_tbl
> nit: for me this comment doesn't add any value
Removed it.



-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 03 May 2019 15:15:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 7: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 May 2019 00:36:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................

IMPALA-7370: DATE: Read/Write to parquet.

This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer. CREATE TABLE LIKE PARQUET
statements associated with data files that contain dates are also
supported.

Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.

This representation introduces a parquet interoperability issue
between Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
  up to 1582-10-05 and Gregorian calendar for dates starting with
  1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
  calendar backward to dates preceding its official introduction in
  1582-10-15.
This means that pre-1582-10-15 dates written to a parquet table by
Hive will be read back incorrectly by Impala and vice versa.

Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.

Change-Id: I67da03754531660bc8de3b6935580d46deae1814
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/util/bit-packing.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.parquet
A testdata/data/out_of_range_date.parquet
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
D testdata/workloads/functional-query/queries/QueryTest/date-text-only-support.test
A testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date.test
A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/common/impala_connection.py
M tests/custom_cluster/test_parquet_page_index.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_scanners.py
29 files changed, 465 insertions(+), 148 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/13189/6
-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4157/


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 19:12:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................

IMPALA-7370: DATE: Read/Write to parquet.

This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer.

Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.

This representation introduces a parquet interoperability issue
between Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
  up to 1582-10-05 and Gregorian calendar for dates starting with
  1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
  calendar backward to dates preceding its official introduction in
  1582-10-15.
This means that pre-1582-10-15 dates written to a parquet table by
Hive will be read back incorrectly by Impala and vice versa.

Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.

Change-Id: I67da03754531660bc8de3b6935580d46deae1814
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/util/bit-packing.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.parquet
A testdata/data/out_of_range_date.parquet
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
D testdata/workloads/functional-query/queries/QueryTest/date-text-only-support.test
A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/common/impala_connection.py
M tests/custom_cluster/test_parquet_page_index.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_scanners.py
28 files changed, 431 insertions(+), 148 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/13189/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4160/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 19:17:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................

IMPALA-7370: DATE: Read/Write to parquet.

This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer. CREATE TABLE LIKE PARQUET
statements associated with data files that contain dates are also
supported.

Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.

This representation introduces a parquet interoperability issue
between Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
  up to 1582-10-05 and Gregorian calendar for dates starting with
  1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
  calendar backward to dates preceding its official introduction in
  1582-10-15.
This means that pre-1582-10-15 dates written to a parquet table by
Hive will be read back incorrectly by Impala and vice versa.

Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.

Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Reviewed-on: http://gerrit.cloudera.org:8080/13189
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/util/bit-packing.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.parquet
A testdata/data/out_of_range_date.parquet
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
D testdata/workloads/functional-query/queries/QueryTest/date-text-only-support.test
A testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date.test
A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/common/impala_connection.py
M tests/custom_cluster/test_parquet_page_index.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_scanners.py
29 files changed, 465 insertions(+), 148 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 8
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................

IMPALA-7370: DATE: Read/Write to parquet.

This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer.

Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.

Change-Id: I67da03754531660bc8de3b6935580d46deae1814
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/util/bit-packing.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
D testdata/workloads/functional-query/queries/QueryTest/date-text-only-support.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
M tests/common/impala_connection.py
M tests/custom_cluster/test_parquet_page_index.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_insert_parquet.py
21 files changed, 326 insertions(+), 128 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/13189/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 2:

> Uploaded patch set 2.

Rebased


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Apr 2019 14:25:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13189/1/tests/common/impala_connection.py
File tests/common/impala_connection.py:

http://gerrit.cloudera.org:8080/#/c/13189/1/tests/common/impala_connection.py@294
PS1, Line 294: e
flake8: E722 do not use bare except'


http://gerrit.cloudera.org:8080/#/c/13189/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/13189/1/tests/query_test/test_insert_parquet.py@74
PS1, Line 74: ;
flake8: E703 statement ends with a semicolon


http://gerrit.cloudera.org:8080/#/c/13189/1/tests/query_test/test_insert_parquet.py@666
PS1, Line 666: d
flake8: E303 too many blank lines (2)



-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Apr 2019 14:14:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3000/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Apr 2019 14:56:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 6:

Added the missing hive2-pre-gregorian-date.test file.


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 19:16:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3082/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 10:16:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 14:31:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3003/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Apr 2019 15:23:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG@12
PS2, Line 12: Parquet uses DATE logical type for dates. DATE logical type annotates
            : an INT32 that stores the number of days from the Unix epoch, 1 January
            : 1970.
Gregorian stuff + Hive could be mentioned here, and I would also prefer to have a test file written by Hive and a test that demonstrates this issue. I do not want to add special handling in this patch (and it can turn out that this doesn't cause problems for users), but a follow up Jira could be created for possible solutions (e.g. using flags + info about the writer to adjust old dates).


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc@957
PS2, Line 957:   // The range was already checked during the int32_t->DateValue conversion, which
             :   // sets the date to invalid if it was out of range.
optional:
This effectively means that the value is checked twice. It is possible that doing memcpy to the slot regardless of validity and checking it afterwards would be a bit faster.

This doesn't have to be done in this patch, we can dig deeper into performance later. I have some ideas for making the whole validation stuff faster, e.g. check for the existence of any invalid values in the first pass and copy them unchanged, and do a second pass that sets invalid ones to NULL only if needed.

(on a second thought, the performance of plain decoding DATEs may not be too important, as the file needs to cover a very large range to not fit into the dictionary).


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h
File be/src/exec/parquet/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h@171
PS2, Line 171:   DateValue* result = reinterpret_cast<DateValue*>(slot);
             :   int size = buffer.size();
             :   const uint8_t* data = reinterpret_cast<const uint8_t*>(buffer.data());
             :   if (ParquetPlainEncoder::Decode<DateValue, parquet::Type::INT32>(data, data + size,
             :       size, result) == -1) {
             :     return false;
             :   }
This looks very similar to line 148-155. It would be nice to extract it into a template function (if it is straightforward, maybe I missed something).


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h@370
PS2, Line 370: i
nit: "i" suggests a loop variable to me, so I would prefer a name like tmp or val (the latter is used in the next function EncodeToInt32())


http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py@408
PS2, Line 408:   ("PARQUET_DATE_OUT_OF_RANGE", 134,
             :    "Parquet file '$0' column '$1' contains an out of range date. "
             :    "The valid date range is 0000-01-01..9999-12-31."),
Can you create file with a corrupt value to cover related paths?


http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1341
PS2, Line 1341: ff != HdfsFileFormat.TEXT && ff != HdfsFileFormat.PARQUET
A function like "IsTypeSupported()" or "GetSupportedType()" could be added to fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java

This would make it easier to add new file format + SQL type pairs, as we wouldn't need to hunt down every related part in the code.


http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@236
PS2, Line 236: # Note: dictionary filtering currently does not work on dates
IMPALA-4994 could be mentioned here, it would make it easier to cleanup related comments once it is implemented.


http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py@667
PS2, Line 667:     values.
nit: Date could be mentioned in this comment + the reason for handling it separately from other types.



-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Apr 2019 16:01:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 19:17:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )

Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 4: Code-Review+2

(1 comment)

I'm fine with the changes. Normally I would give only a +1 on this as I don't have deep knowledge around this code, but since Csaba already gave a +1 I think this is free to go.

http://gerrit.cloudera.org:8080/#/c/13189/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/13189/3/tests/query_test/test_scanners.py@359
PS3, Line 359:     """
nit: close quoute at the end of the previous line.



-- 
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 May 2019 12:33:38 +0000
Gerrit-HasComments: Yes