You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/10/26 20:14:00 UTC
[jira] [Commented] (IMPALA-10215) Implement INSERT INTO for non-partitioned Iceberg tables (Parquet)

    [ https://issues.apache.org/jira/browse/IMPALA-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220968#comment-17220968 ] 

ASF subversion and git services commented on IMPALA-10215:
----------------------------------------------------------

Commit 981ef104654e187ae18b25f31df2fd324e687643 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=981ef10 ]

IMPALA-10215: Implement INSERT INTO for non-partitioned Iceberg tables (Parquet)

This commit adds support for INSERT INTO statements against Iceberg
tables when the table is non-partitioned and the underlying file format
is Parquet.

We still use Impala's HdfsParquetTableWriter to write the data files,
though they needed some modifications to conform to the Iceberg spec,
namely:
 * write Iceberg/Parquet 'field_id' for the columns
 * TIMESTAMPs are encoded as INT64 micros (without time zone)

We use DmlExecState to transfer information from the table sink
operators to the coordinator, then updateCatalog() invokes the
AppendFiles API to add files atomically. DmlExecState is encoded in
protobuf, communication with the Frontend uses Thrift. Therefore to
avoid defining Iceberg DataFile multiple times they are stored in
FlatBuffers.

The commit also does some corrections on Impala type <-> Iceberg type
mapping:
 * Impala TIMESTAMP is Iceberg TIMESTAMP (without time zone)
 * Impala CHAR is Iceberg FIXED

Testing:
 * Added INSERT tests to iceberg-insert.test
 * Added negative tests to iceberg-negative.test
 * I also did some manual testing with Spark. Spark is able to read
   Iceberg tables written by Impala until we use TIMESTAMPs. In that
   case Spark rejects the data files because it only accepts TIMESTAMPS
   with time zone.
 * Added concurrent INSERT tests to test_insert_stress.py

Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e
Reviewed-on: http://gerrit.cloudera.org:8080/16545
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Implement INSERT INTO for non-partitioned Iceberg tables (Parquet)
> ------------------------------------------------------------------
>
>                 Key: IMPALA-10215
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10215
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Impala should be able to insert into non-partitioned Iceberg table when the underlying data file format is Parquet.
> INSERT OVERWRITE and CTAS is out-of-scope for this sub-task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org