You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2018/04/20 20:28:17 UTC

[Impala-ASF-CR] IMPALA-6899: Optimize the HDFS commands used in dataload

Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/10120

to look at the new patch set (#3).

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................

IMPALA-6899: Optimize the HDFS commands used in dataload

HDFS commandline calls can be expensive due to JVM
startup and other costs. Since most HDFS commandline
calls can take multiple paths, one way to reduce
execution time is to consolidate multiple HDFS
commands into a single HDFS call. Since HDFS put
commands will follow symbolic links and can copy
recursively, this can allow for further consolidation
by creating the full directory structure and
copying it in a single HDFS call.

This does several of these optimizations throughout
the dataload codepath. It saves a few seconds here
and there:
Loading Hive Builtins: 1:10 -> 0:30
Loading custom schemas: 0:35 -> 0:20
Loading Hive UDFs: 0:45 -> 0:25

Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
---
M testdata/bin/copy-udfs-udas.sh
M testdata/bin/create-load-data.sh
M testdata/bin/load-hive-builtins.sh
3 files changed, 112 insertions(+), 106 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/10120/3
-- 
To view, visit http://gerrit.cloudera.org:8080/10120
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10120
Gerrit-PatchSet: 3
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>