You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/04/09 15:49:00 UTC
[jira] [Commented] (IMPALA-5051) Add support to write INT64 timestamps to the parquet writer

    [ https://issues.apache.org/jira/browse/IMPALA-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813556#comment-16813556 ] 

ASF subversion and git services commented on IMPALA-5051:
---------------------------------------------------------

Commit 39413a18117acde1822d9f084ab30c748ce837bc in impala's branch refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=39413a1 ]

IMPALA-5051: Add INT64 timestamp write support in Parquet

Add query option "parquet_timestamp_type" that chooses the
Parquet type used when writing TIMESTAMP columns. This is an
experimental feature at the moment, because these types are not
widely adopted in other Hadoop components yet. For this reason
the query option is added as "development" level, and the default
behavior is not changed.

The following options can be used:
INT96_NANOS (default):
  This is the same as the old behavior, can represent any
  timestamp that can be handled by Impala.
INT64_MILLIS, INT64_MICROS:
  Can encode the whole [1400..10000) range handled by Impala
  at the cost of reduced precision. Values are rounded towards
  minus infinity during writing.
INT64_NANOS:
  Can encode a reduced range without losing nanosecond precision:
  [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807]
  Values outside this range are converted to NULLs without warning.

The change was done completely in the backend and all TIMESTAMP
columns are written using the type set in the query option.
An alternative design would have been to implement some parts
in the fronted by adding TIMESTAMP->BIGINT conversion functions
to the query plan, which would make it easier to add the possibility
of per-column setting in the future. I choose the current design
because it seemed much simpler and there are no clear plans for the
per-column setting. Most of the code will be still useful if we
decide to go the other way in the future.

All types are written without conversion to UTC (the way Impala
always wrote timestamps), and this information is expressed in the
new Parquet logical types by setting isAdjustedToUTC to false. The
old logical type (converted_type) is not set, because old readers do
not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and
TIMESTAMP_MICROS are written in UTC. These readers can still read
int64 timestamp columns as INT_64.

Testing:
- added unit tests for new TimestampValue->int64 functions
- add EE tests for checking values / min-max stats / metadata
  written for int64 Parquet timestamps
- ran core tests

Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Reviewed-on: http://gerrit.cloudera.org:8080/12247
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>


> Add support to write INT64 timestamps to the parquet writer
> -----------------------------------------------------------
>
>                 Key: IMPALA-5051
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5051
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Lars Volker
>            Assignee: Csaba Ringhofer
>            Priority: Major
>
> This requires updating parquet.thrift to a version that includes the TIMESTAMP_MICROS logical type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org