You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2018/04/23 23:38:17 UTC

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Hello Philip Zeyliger, Impala Public Jenkins,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/10167

to review the following change.


Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................

IMPALA-6899: Optimize the HDFS commands used in dataload

HDFS commandline calls can be expensive due to JVM
startup and other costs. Since most HDFS commandline
calls can take multiple paths, one way to reduce
execution time is to consolidate multiple HDFS
commands into a single HDFS call. Since HDFS put
commands will follow symbolic links and can copy
recursively, this can allow for further consolidation
by creating the full directory structure and
copying it in a single HDFS call.

This does several of these optimizations throughout
the dataload codepath. It saves a few seconds here
and there:
Loading Hive Builtins: 1:10 -> 0:30
Loading custom schemas: 0:35 -> 0:20
Loading Hive UDFs: 0:45 -> 0:25

Conflicts:
testdata/bin/copy-udfs-udas.sh - conflict due to
"Loosen hive-exec.jar glob pattern..."

Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Reviewed-on: http://gerrit.cloudera.org:8080/10120
Reviewed-by: Philip Zeyliger <ph...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
(cherry picked from commit da363a99a4b1afff91600c71650e26932be9350a)
---
M testdata/bin/copy-udfs-udas.sh
M testdata/bin/create-load-data.sh
M testdata/bin/load-hive-builtins.sh
3 files changed, 131 insertions(+), 122 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/10167/1
-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................


Patch Set 1: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Apr 2018 01:22:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................


Patch Set 1: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2354/


-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Apr 2018 20:22:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................


Patch Set 1:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2356/


-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Apr 2018 21:28:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................


Patch Set 1:

Conflict regarding the hive-exec.jar part of copy-udfs-udas.sh. Running some tests for 2.x.


-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Apr 2018 23:40:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................

IMPALA-6899: Optimize the HDFS commands used in dataload

HDFS commandline calls can be expensive due to JVM
startup and other costs. Since most HDFS commandline
calls can take multiple paths, one way to reduce
execution time is to consolidate multiple HDFS
commands into a single HDFS call. Since HDFS put
commands will follow symbolic links and can copy
recursively, this can allow for further consolidation
by creating the full directory structure and
copying it in a single HDFS call.

This does several of these optimizations throughout
the dataload codepath. It saves a few seconds here
and there:
Loading Hive Builtins: 1:10 -> 0:30
Loading custom schemas: 0:35 -> 0:20
Loading Hive UDFs: 0:45 -> 0:25

Conflicts:
testdata/bin/copy-udfs-udas.sh - conflict due to
"Loosen hive-exec.jar glob pattern..."

Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Reviewed-on: http://gerrit.cloudera.org:8080/10120
Reviewed-by: Philip Zeyliger <ph...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
(cherry picked from commit da363a99a4b1afff91600c71650e26932be9350a)
Reviewed-on: http://gerrit.cloudera.org:8080/10167
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
---
M testdata/bin/copy-udfs-udas.sh
M testdata/bin/create-load-data.sh
M testdata/bin/load-hive-builtins.sh
3 files changed, 131 insertions(+), 122 deletions(-)

Approvals:
  Joe McDonnell: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: merged
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................


Patch Set 1:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2354/


-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Apr 2018 16:22:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................


Patch Set 1:

The test failure looks like IMPALA-6740. My previous gerrit-verify-dryrun-external run didn't see this issue, retrying.


-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Apr 2018 21:28:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-6899: Optimize the HDFS commands used in dataload

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/10167 )

Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload
......................................................................


Patch Set 1: Code-Review+2

Tests ran without issue, moving forward with backport.


-- 
To view, visit http://gerrit.cloudera.org:8080/10167
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Gerrit-Change-Number: 10167
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Apr 2018 16:22:03 +0000
Gerrit-HasComments: No