You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2019/03/01 13:03:35 UTC

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Hello Zoltan Borok-Nagy, Zoltan Ivanfi, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12247

to look at the new patch set (#9).

Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is net set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
M tests/query_test/test_insert_parquet.py
M tests/util/get_parquet_metadata.py
18 files changed, 537 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/12247/9
-- 
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi...@cloudera.com>