You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org> on 2017/03/01 02:26:13 UTC
[Impala-ASF-CR] PREVIEW: IMPALA-3742: partitions INSERTs into Kudu tables

Thomas Tauber-Marshall has uploaded a new patch set (#2).

Change subject: PREVIEW: IMPALA-3742: partitions INSERTs into Kudu tables
......................................................................

PREVIEW: IMPALA-3742: partitions INSERTs into Kudu tables

Bulk inserts into Kudu are currently painful because we
just send rows randomly, which creates a lot of work for
Kudu since it partitions and sorts data before writing,
causing writes to be slow.

We can alleviate this by sending the rows to Kudu already
partitioned and sorted. This patch partitions the rows to
insert according to Kudu's partitioning scheme. A followup
patch will deal with sorting.

It accomplishes this by inserting an exchange node into the
plan before the insert and then passing down an Expr to the
DataStreamSender that calls into the Kudu client to determine
the partition for each row.

This has the added benefit of creating a general interface to
pass arbitrary partitioning functions to DataStreamSender
as Exprs.

This patch is a PREVIEW so we can decide if we're happy with
the partitioning API Kudu has proposed and get that in on
the Kudu side. It does not have any tests, and has not been
tested for performance.

It also currently only works for tables with a single partition
column, due to difficulties with passing arguments into the
partitioning Expr. Some potential solutions:
1) Stamp out versions of the KuduPartitioning functions for
   different numbers of partitioning columns up to a limit.
   This would be simple but would place a hard limit on the number
   of partitioning columns in tables we can apply this optimization to.
2) Use the UDF varargs support. This would require casting all of
   the partitioning columns up to a common type.
3) Add a significant new feature to our UDF API that could be used
   here, eg. add support for complex types such as Arrays, make
   it possible to do varargs with a type of AnyVal, introduce a
   BINARY type, etc. These would all be a significant amount of
   work, but potentially useful outside of this project.
4) Abandon the idea od passing a partitioning Expr, eg. something
   like the first version of this review, but cleaned up.
5) Something else entirely.

Change-Id: Ic10b3295159354888efcde3df76b0edb24161515
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr.cc
A be/src/exprs/partitioning-functions.cc
A be/src/exprs/partitioning-functions.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M bin/impala-config.sh
M common/thrift/Partitions.thrift
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/KuduTable.java
M fe/src/main/java/org/apache/impala/planner/DataPartition.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
15 files changed, 281 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/37/6037/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic10b3295159354888efcde3df76b0edb24161515
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>