You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2019/01/21 17:11:41 UTC

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Csaba Ringhofer has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12247


Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 10000 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 512 insertions(+), 66 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/2
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Reviewed-on: http://gerrit.cloudera.org:8080/12247
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Csaba Ringhofer: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Ivanfi (Code Review)" <ge...@cloudera.org>.
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC
> Does Parquet-MR also write INT64 timestamps un-normalized?
Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka Instant and LocalDateTime semantics.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Feb 2019 17:17:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2573/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Mar 2019 17:46:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#13).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 570 insertions(+), 107 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/13
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728:         if (iequals(value, "INT96_NANO")) {
> Shouldn't this typo have been caught by a test?
You are right, I think that I messed something up like running the tests with an older version of impala running (my last change was adding "S" to the end of query options). I will return soon when the tests pass.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Jan 2019 16:30:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#10).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 585 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/10
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 18:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3998/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Apr 2019 00:28:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3994/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Apr 2019 16:46:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#15).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/15
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1862/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 21:18:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 14:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2612/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Apr 2019 15:53:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#6).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (tha way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 539 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/6
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
> No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalize
The main problem with normalizing to UTC from Impala's standpoint is the performance cost of the UTC->localtime conversion when reading it back. Doing the timezone conversion for all timestamps in all rows is very costly compared to other tasks during Parquet scanning.


http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: tha
> nit: the
Done


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579
PS8, Line 579: result_
> What about deleting the member 'result_', and only have it here as a local 
result_ shouldn't be a local variable, as we return a pointer to it. This is the expected interface by BaseColumnWriter::AppendRow() - it has no template for the type of the column, so it handles values with a void* that points to the value that should be inserted to the current row, and is expected to live until we step to the next row.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60
PS8, Line 60: Return
> nit: Returns
Done


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC
> Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka 
As far as I know Parquet-MR does not do any timezone conversion and leaves this task to the caller, e.g. Hive.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154
PS8, Line 154: kudu::int128_t nanos128 =
             :     static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC
             :     + time_.fractional_seconds();
             : 
             :   if (nanos128 <  std::numeric_limits<int64_t>::min()
             :       || nanos128 >  std::numeric_limits<int64_t>::max()) return false;
> I think we can still avoid using int128_t.
I created a ticket for benchmarking and optimizing these new functions: IMPALA-8268  Performance is not too important at the moment as parquet_timestamp_type is a development query option and should be mainly used to test whether the new timestamps can be read by other Hadoop components.

I think that the most costly things in this function are time_.fractional_seconds() and UtcToUnixTime()'s time_.total_seconds(), as these need int64 integer division. These could be avoided by not using UtcToUnixTime and converting to nanoseconds from day_ + time_ directly, but this has to be done carefully near the edge values.


http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test:

http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102
PS8, Line 102: ---- QUERY
             : create table int96_nanos (ts timestamp) stored as parquet;
             : ====
             : ---- QUERY
             : # Insert edge values as "normal" int96 timestamps that can represent all values.
             : set parquet_timestamp_type=INT96_NANOS;
             : insert into int96_nanos values
> nit: you dont't need to start a new QUERY block for each query when you don
Thanks for the tip!



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 13:35:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#4).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 10000 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 516 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/4
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Ivanfi (Code Review)" <ge...@cloudera.org>.
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC
> As far as I know Parquet-MR does not do any timezone conversion and leaves 
That's true, but it has a metadata field to tell the caller the desired semantics, which can be either UTC-normalized or not.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 14:11:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#7).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (tha way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 540 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/7
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Ivanfi (Code Review)" <ge...@cloudera.org>.
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not use logical types
> Then I think you should change the name of the function.
Personally I'm fine with the current name, it sets the converted type when it makes sense. I don't know how this could be described without making the function name overly long.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728:         if (iequals(value, "INT96_NANO")) {
> It should be "INT96_NANOS"
Shouldn't this typo have been caught by a test?



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Jan 2019 15:44:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Ivanfi (Code Review)" <ge...@cloudera.org>.
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not use logical types
> I mean I don't see that this function sets the converted type anywhere.
Ah, yes, good point, I haven't noticed we are only dealing with LocalDateTime semantics here and regardless of the precision, there is no converted type for that.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Jan 2019 16:01:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 12:

Patch 12 fixes the clang tidy error in timestamp-test + rebases and resolves conflicts related to query options.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Mar 2019 13:57:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#8).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (tha way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 542 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/8
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
Whouldn't it be better to convert to UTC? You write later that old readers also assume that the data is written in UTC.


http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: tha
nit: the


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579
PS8, Line 579: result_
What about deleting the member 'result_', and only have it here as a local variable?

Now it looks strange that we invoke a member function here and pass a member variable to it. Also, this way we will have some aliasing inside ConvertTimestamp() that might prevent some compiler optimizations, but I'm not really sure honestly.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60
PS8, Line 60: Return
nit: Returns


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC
Does Parquet-MR also write INT64 timestamps un-normalized?


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154
PS8, Line 154: kudu::int128_t nanos128 =
             :     static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC
             :     + time_.fractional_seconds();
             : 
             :   if (nanos128 <  std::numeric_limits<int64_t>::min()
             :       || nanos128 >  std::numeric_limits<int64_t>::max()) return false;
I think we can still avoid using int128_t.


http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test:

http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102
PS8, Line 102: ---- QUERY
             : create table int96_nanos (ts timestamp) stored as parquet;
             : ====
             : ---- QUERY
             : # Insert edge values as "normal" int96 timestamps that can represent all values.
             : set parquet_timestamp_type=INT96_NANOS;
             : insert into int96_nanos values
nit: you dont't need to start a new QUERY block for each query when you don't check the results.
E.g. it could be:

 ====
 ---- QUERY
 create table...
 set ...
 insert into ...
 create table ...
 ====



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Feb 2019 15:35:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 16: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3994/


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Apr 2019 21:53:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2316/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 13:46:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#11).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 584 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/11
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1837/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Jan 2019 18:45:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Ivanfi (Code Review)" <ge...@cloudera.org>.
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
> Whouldn't it be better to convert to UTC? You write later that old readers 
No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalized to UTC, only the TIMESTAMP WITH LOCAL TIME ZONE type shall be. Please see this document: https://goo.gl/VV88c5



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Feb 2019 17:16:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 10:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2558/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Mar 2019 15:10:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 15:

Patch set 15 fixes a copy-paste error. I guess that I ran timestamp-test without rebuilding before uploading that last change, as it was broken before this fix.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Apr 2019 14:58:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#9).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 537 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/9
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 12:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2572/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Mar 2019 14:22:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 15: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Apr 2019 15:25:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py@180
PS10, Line 180: o
flake8: E225 missing whitespace around operator



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Mar 2019 14:28:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 9:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@17
PS9, Line 17: INT96_NANOS
I think the NANOS suffix is somewhat misleading as it suggests that this is just ns precision with a larger range. Since we also want to discourage use of this I suggest to name it INT96_DEPRECATED (or INT96_IMPALA since this is where it came from).


http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@42
PS9, Line 42: net
typo


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h@61
PS9, Line 61: that the type
nit: that the input type..


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc@150
PS9, Line 150:   parquet::TimestampType timestamp_type;
nit: declare these before using them?


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@713
PS9, Line 713:   int64_t tm_min_micros, tm_min_millis;
I'd declare that when you use it


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@721
PS9, Line 721:   // Add 250ns and check the value is rounded down
Maybe prefix each of these comments with a comment or assert of the current value of min_time. Otherwise one has to count how many times we added 250ns all the time.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@737
PS9, Line 737:   EXPECT_TRUE(min_date.FloorUtcToUnixTimeMicros(&tm_min_micros));
This test does not correspond to the comment above, does it? I.e., it is rounding towards -?


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: 250us
The switch between us and ns confused me for a bit, not sure if there's a good way to make it stand out more.

Also, some of these blocks add to the current value of min_date, while other blocks reset it. Maybe you can make more clear by not writing to it and using new variables instead (here and in the other blocks)


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@239
PS9, Line 239: 2262
The famous Impala Year-2262 problem being conceived and I get to be part of it. :)


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@375
PS9, Line 375: 1970.01.01
nit: 1970-01-01


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@98
PS9, Line 98: 24 * 60 * 60
While you're here, this could probably be a constant, too, given it's used in other places.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@133
PS9, Line 133: 
Nit: I think some of these newlines could go.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@156
PS9, Line 156:     static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC
nit: indent 4 spaces


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@160
PS9, Line 160:       || nanos128 >  std::numeric_limits<int64_t>::max()) return false;
nit: If it doesn't fit on one line, it should have braces {}


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py@848
PS9, Line 848:       ColumnStats('ts', -1001001, 1001001, 0)
Can you add a test for the case that values don't fit into 64bit nanos?


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180
PS9, Line 180: normpath
Why is "normpath" needed here?


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180
PS9, Line 180:   table_dir = tmp_dir + "/" + os.path.basename(os.path.normpath(hdfs_path))
use os.path.join



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Mar 2019 17:40:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 6:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2097/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Feb 2019 18:01:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1853/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 15:20:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2101/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Feb 2019 19:18:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 14:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2609/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Apr 2019 13:00:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

(1 comment)

About the last 3 patch sets:

patch set 6 changes nanos->millis/micros conversions to round towards minus infinity instead of the nearest integer

patch set 7 fixes a comment by Zoltan Borok-Nagy and a copy-paste error in patch 6

patch set 8 is a rebase + conflict resolution. Rebasing was necessary because of Impala-lzo changes broke the build.


About checking the written Parquet files with Parquet tool:
thanks for the tool - I used it to check the written files manually, but I would prefer to integrate it into Impala in a separate commit. A jira to track this: IMPALA-8197

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: successful and return true. If the timestamp is invalid 
> It seemed premature optimization to me, so I did some measurements and look
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Feb 2019 19:18:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 7:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2100/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Feb 2019 18:32:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

(7 comments)

About 'parquet-tools dump': it would be great to use it during Impala EE tests, but adding it as a dependency would also mean +1 way Impala builds can break, so I am not sure at the moment.

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@400
PS3, Line 400: !=
> Shouldn't it be '=='?
It is intentionally !=. -1 means variable length, and plain_encoded_value_size_ should remain -1 in that case.

Actually this code led to DCHECK error during the writing of some decimals, which was not caught by the tests I ran locally. After running to issues on jenkins, I moved this logic Int64TimestampColumnWriterBase.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: Overrides of this function are expected to set 'result_'
> I think an output parameter would be more conventional.
I prefer the current solution - my guess is that adding an additional argument to a non-inline functions will make it slower, while writing to result_ is cheaper as 'this' is used during the function call anyway.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not use logical types
> Ah, yes, good point, I haven't noticed we are only dealing with LocalDateTi
Done


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@135
PS3, Line 135: (NANOS_PER_MILLI / 2)
> Might be worth to add a short comment about that we do the rounding here.
The whole rounding logic is under discussion - Impala uses rounding to microsec when writing Kudu, but I think that truncating towards negative infinity would be better. I sent an email to dev@impala.apache.org about this topic, I plan to wait for an agreement there before changing these functions.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@151
PS3, Line 151:   kudu::int128_t nanos128 =
             :     static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC
             :     + time_.fractional_seconds();
             : 
             :   if (nanos128 <  std::numeric_limits<int64_t>::min()
             :       || nanos128 >  std::numeric_limits<int64_t>::max()) return false;
> I think we can avoid using int128_t here. We now what are the min and max u
I agree, but I am waiting for a decision in rounding vs truncating, see line 135.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728:         if (iequals(value, "INT96_NANO")) {
> You are right, I think that I messed something up like running the tests wi
Done


http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift@1
PS3, Line 1: 
> nit: accidental new line
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 20:50:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 17:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2694/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Apr 2019 00:51:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 16: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Apr 2019 16:46:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 18: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Apr 2019 05:27:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: Overrides of this function are expected to set 'result_'
> I prefer the current solution - my guess is that adding an additional argum
It seemed premature optimization to me, so I did some measurements and looked at the generated assembly.

Both 'this' and the output parameter are passed in via registers.

In the current case you need less registers, but you need to access 'result_' via 'this' plus an offset.

In the other case, you need one more register to pass the output parameter, but you don't need to add an offset to 'this'.

I couldn't measure any performance difference between the two approaches.

So if performance was the only reason to do this, I'd rather choose the conventional way.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 24 Jan 2019 17:13:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 11:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@17
PS9, Line 17: INT96_NANOS
> I think the NANOS suffix is somewhat misleading as it suggests that this is
I am open to a new name, but I want to consult with Hive / Parquet people first. This is a development query option at the moment, so I think that it is not a big issue to change the names in the future.

About discouraging people from using INT96_NANOS: currently this is the only timestamp representation that can be read by other Hadoop components. I would wait for a Parquet-MR release with logical type support + some cross-component testing / benchmarking before advising anyone to use the new types.


http://gerrit.cloudera.org:8080/#/c/12247/9//COMMIT_MSG@42
PS9, Line 42: not
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.h@61
PS9, Line 61: that the inpu
> nit: that the input type..
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/exec/parquet/parquet-metadata-utils.cc@150
PS9, Line 150:   parquet::TimeUnit time_unit;
> nit: declare these before using them?
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@713
PS9, Line 713:     // Test padding on double digits
> I'd declare that when you use it
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@721
PS9, Line 721:     ASSERT_TRUE(ParseFormatTokens(&dt_ctx))  << "TC: " << i;
> Maybe prefix each of these comments with a comment or assert of the current
I have rewritten min_date assignments to use explicit timestamps.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@737
PS9, Line 737:   TimestampValue min_date = TimestampValue::Parse("1400-01-01");
> This test does not correspond to the comment above, does it? I.e., it is ro
I think that "floor" in the name should make it clear that it rounds toward -. Note that ToUnixTime() also rounds towards -.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: EQ(MI
> The switch between us and ns confused me for a bit, not sure if there's a g
I have rewritten the code to always reset min_date. I think that it is easier to understand now, but I can create new variables if you think that it would further improve readability.


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@239
PS9, Line 239: 2262
> The famous Impala Year-2262 problem being conceived and I get to be part of
:D


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.h@375
PS9, Line 375: 
> nit: 1970-01-01
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@98
PS9, Line 98: SECONDS_PER_
> While you're here, this could probably be a constant, too, given it's used 
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@133
PS9, Line 133:   return true;
> Nit: I think some of these newlines could go.
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@156
PS9, Line 156: 
> nit: indent 4 spaces
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-value.inline.h@160
PS9, Line 160:   }
> nit: If it doesn't fit on one line, it should have braces {}
Done


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/12247/9/tests/query_test/test_insert_parquet.py@848
PS9, Line 848:     insert_stmt = """insert into {0} values
> Can you add a test for the case that values don't fit into 64bit nanos?
Done


http://gerrit.cloudera.org:8080/#/c/12247/10/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/12247/10/tests/query_test/test_insert_parquet.py@824
PS10, Line 824:         ("1969-12-31 23:59:59.999999999"),
This was broken when rounding towards nearest was replaced with flooring.


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180
PS9, Line 180: .path.no
> Why is "normpath" needed here?
It removes the trailing "/"


http://gerrit.cloudera.org:8080/#/c/12247/9/tests/util/get_parquet_metadata.py@180
PS9, Line 180:   table_dir = os.path.join(tmp_dir, os.path.basename(os.path.normpath(hdfs_path)))
> use os.path.join
Done


http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/12247/10/tests/util/get_parquet_metadata.py@180
PS10, Line 180:  
> flake8: E225 missing whitespace around operator
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Mar 2019 14:58:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#12).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 570 insertions(+), 107 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/12
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2614/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Apr 2019 19:15:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#17).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/17
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 5:

Some update:
There are two design decisions under discussion at the moment, one is the type of rounding to use when some precision is lost, and the other is whether to convert out-of-range int64 nanos to NULL silently or to abort with an error in this case.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 30 Jan 2019 15:20:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 15:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2672/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Apr 2019 15:36:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/1836/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 21 Jan 2019 17:54:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not use logical types
> Personally I'm fine with the current name, it sets the converted type when 
I mean I don't see that this function sets the converted type anywhere.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Jan 2019 15:56:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#3).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 10000 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 512 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/3
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: EQ(MI
> I feel I'm slightly in favor of adding proper distinct variables instead of
All the names I came up with were awkward, so I reused the same name but put all the tests to different blocks.


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc@743
PS13, Line 743: Add 250ns
> The comments should now reflect that you're setting it explicitly instead o
Done


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h@196
PS13, Line 196: 1'000'000'000LL
> We have NANOS_PER_SEC in timestamp-value-inline.h, use here, too, and elsew
NANOS_PER_DAY comes from walltime.h, which is included in most places, but I did not want to include it in a frequently used header like this.



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Apr 2019 12:18:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 14:

> Build Failed
 > 
 > https://jenkins.impala.io/job/gerrit-code-review-checks/2609/ :
 > Initial code review checks failed. See linked job for details on
 > the failure.

The error during code review check seems to be unrelated:
E: Failed to fetch http://us-west-2.ec2.archive.ubuntu.com/ubuntu/pool/universe/o/objenesis/libobjenesis-java_2.2-1_all.deb  504  Gateway Time-out [IP: 34.210.25.51 80]


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Apr 2019 15:16:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#5).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded to
  the nearest ms/us during writing. Values that would be rounded
  to year 10000 are rounded down to avoid converting valid values
  to invalid ones.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (as Impala always
wrote timestamps), and this information is  expressed in the new
Parquet logical types by setting isAdjustedToUTC to false. The old
logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 515 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 18: Code-Review+2

Carry +2


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Apr 2019 08:53:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Ivanfi (Code Review)" <ge...@cloudera.org>.
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

Would it be possible to run `parquet-tools dump` as a part of the tests to check whether the timestamp values can be read back correctly by parquet-mr?


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Jan 2019 13:00:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 9: Code-Review+1

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
> The main problem with normalizing to UTC from Impala's standpoint is the pe
OK, thanks for the explanation for both of you!


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579
PS8, Line 579: result_
> result_ shouldn't be a local variable, as we return a pointer to it. This i
Ah, I completely missed that.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC
> That's true, but it has a metadata field to tell the caller the desired sem
Sorry, I was really thinking about Hive and Spark.

But it doesn't really matter, since now we have this 'isAdjustedToUTC' field. And Zoltan also said in one of his comments that "'pure' TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalized to UTC"


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154
PS8, Line 154: // TODO: consider optimizing this (IMPALA-8268)
             :   kudu::int128_t nanos128 =
             :     static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC
             :     + time_.fractional_seconds();
             : 
             :   if (nanos128 <  std::numeric_limits<int64_t>::min()
> I created a ticket for benchmarking and optimizing these new functions: IMP
OK, thanks!


http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test:

http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102
PS8, Line 102: ---- QUERY
             : create table int96_nanos (ts timestamp) stored as parquet;
             : ====
             : ---- QUERY
             : # Insert edge values as "normal" int96 timestamps that can represent all values.
             : set parquet_timestamp_type=INT96_NANOS;
             : insert into int96_nanos values
> Thanks for the tip!
You're welcome!



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 15:43:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Ivanfi (Code Review)" <ge...@cloudera.org>.
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 5:

> About 'parquet-tools dump': it would be great to use it during
 > Impala EE tests, but adding it as a dependency would also mean +1
 > way Impala builds can break, so I am not sure at the moment.

I am very confident that we should have this extra check.

Adding parquet-tools to the build toolchain is not that difficult as it does not have to be installed. I created a small Maven project that runs it without installing it (in the conventional sense at least). The whole project consists of a single POM file and nothing else. Feel free to add it to Implala and modify it to fit your needs: https://github.com/zivanfi/parquet-tools-executor


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Thu, 24 Jan 2019 14:04:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@400
PS3, Line 400: !=
Shouldn't it be '=='?


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@583
PS3, Line 583: Overrides of this function are expected to set 'result_'
I think an output parameter would be more conventional.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/exec/parquet/parquet-metadata-utils.cc@158
PS3, Line 158:   // converted_type is not set because older readers that do not use logical types
Then I think you should change the name of the function.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@135
PS3, Line 135: (NANOS_PER_MILLI / 2)
Might be worth to add a short comment about that we do the rounding here.


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/runtime/timestamp-value.inline.h@151
PS3, Line 151:   kudu::int128_t nanos128 =
             :     static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC
             :     + time_.fractional_seconds();
             : 
             :   if (nanos128 <  std::numeric_limits<int64_t>::min()
             :       || nanos128 >  std::numeric_limits<int64_t>::max()) return false;
I think we can avoid using int128_t here. We now what are the min and max unix_timeseconds that can be represented in the limited range.

And if we are at the "boundary" (e.g. 2262-04-11 23:47:16), we now what are the limits of the fractional_seconds (e.g. 854775807).


http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12247/3/be/src/service/query-options.cc@728
PS3, Line 728:         if (iequals(value, "INT96_NANO")) {
It should be "INT96_NANOS"


http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/12247/3/common/thrift/ImpalaInternalService.thrift@1
PS3, Line 1: 
nit: accidental new line



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Jan 2019 15:00:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 11:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2559/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Mar 2019 15:30:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#14).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 611 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/14
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 13: Code-Review+2

(3 comments)

Only had some nits, otherwise this looks good to me.

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/9/be/src/runtime/timestamp-test.cc@740
PS9, Line 740: EQ(MI
> I have rewritten the code to always reset min_date. I think that it is easi
I feel I'm slightly in favor of adding proper distinct variables instead of re-using min_date.


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-test.cc@743
PS13, Line 743: Add 250ns
The comments should now reflect that you're setting it explicitly instead of adding. Also check the one for max* below


http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/12247/13/be/src/runtime/timestamp-value.h@196
PS13, Line 196: 1'000'000'000LL
We have NANOS_PER_SEC in timestamp-value-inline.h, use here, too, and elsewhere in this file?



-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Mar 2019 23:37:45 +0000
Gerrit-HasComments: Yes