You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Philip Zeyliger (Code Review)" <ge...@cloudera.org> on 2017/10/18 21:56:35 UTC

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Philip Zeyliger has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8320


Change subject: IMPALA-6070: Parallel data load.
......................................................................

IMPALA-6070: Parallel data load.

This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In
parallel, these take about 37 minutes, dominated by functional-query. Serially,
these take about 30 minutes more, namely the 13 minutes of tpcds and 16
minuites of tpcds. This works out nicely because CPU usage during data load is
very low in aggregate. (We don't sustain more than 1 CPU of load, whereas build
machines are likely to have many CPUs.)

To do this, I added support to run-step.sh to have a notion of a backgroundable
task, and support waiting for all tasks.

I also increased the heapsize of our HiveServer2 server. When datasets were
being loaded in parallel, we ran out of memory at 256MB of heap.

The resulting log output is currently like so (but without the timestamps):

15:58:04      Started Loading functional-query data in background; pid 8105.
15:58:04      Started Loading TPC-H data in background; pid 8106.
15:58:04  Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04      Started Loading TPC-DS data in background; pid 8107.
15:58:04  Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04  Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31    Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec)
16:14:33    Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec)
16:35:08    Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec)

Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
---
M testdata/bin/create-load-data.sh
M testdata/bin/run-hive-server.sh
M testdata/bin/run-step.sh
3 files changed, 40 insertions(+), 5 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/8320/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Hello Jim Apple, Joe McDonnell, Alex Behm, Zach Amsden, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8320

to look at the new patch set (#2).

Change subject: IMPALA-6070: Parallel data load.
......................................................................

IMPALA-6070: Parallel data load.

This commit loads functional-query, TPC-H data, and TPC-DS data in
parallel. In parallel, these take about 37 minutes, dominated by
functional-query. Serially, these take about 30 minutes more, namely the
13 minutes of tpcds and 16 minutes of tpcds. This works out nicely
because CPU usage during data load is very low in aggregate. (We don't
sustain more than 1 CPU of load, whereas build machines are likely to
have many CPUs.)

To do this, I added support to run-step.sh to have a notion of a
backgroundable task, and support waiting for all tasks.

I also increased the heapsize of our HiveServer2 server. When datasets
were being loaded in parallel, we ran out of memory at 256MB of heap.

The resulting log output is currently like so (but without the
timestamps):

15:58:04  Started Loading functional-query data in background; pid 8105.
15:58:04  Started Loading TPC-H data in background; pid 8106.
15:58:04  Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04  Started Loading TPC-DS data in background; pid 8107.
15:58:04  Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04  Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31    Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec)
16:14:33    Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec)
16:35:08    Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec)

I tested dataloading with the following command on an 8-core, 32GB
machine. I saw 19GB of available memory during my run:
  ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format

Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
---
M testdata/bin/create-load-data.sh
M testdata/bin/run-hive-server.sh
M testdata/bin/run-step.sh
3 files changed, 44 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/8320/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Michael Brown (Code Review)" <ge...@cloudera.org>.
Michael Brown has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 2: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 15:34:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................

IMPALA-6070: Parallel data load.

This commit loads functional-query, TPC-H data, and TPC-DS data in
parallel. In parallel, these take about 37 minutes, dominated by
functional-query. Serially, these take about 30 minutes more, namely the
13 minutes of tpcds and 16 minutes of tpcds. This works out nicely
because CPU usage during data load is very low in aggregate. (We don't
sustain more than 1 CPU of load, whereas build machines are likely to
have many CPUs.)

To do this, I added support to run-step.sh to have a notion of a
backgroundable task, and support waiting for all tasks.

I also increased the heapsize of our HiveServer2 server. When datasets
were being loaded in parallel, we ran out of memory at 256MB of heap.

The resulting log output is currently like so (but without the
timestamps):

15:58:04  Started Loading functional-query data in background; pid 8105.
15:58:04  Started Loading TPC-H data in background; pid 8106.
15:58:04  Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04  Started Loading TPC-DS data in background; pid 8107.
15:58:04  Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04  Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31    Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec)
16:14:33    Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec)
16:35:08    Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec)

I tested dataloading with the following command on an 8-core, 32GB
machine. I saw 19GB of available memory during my run:
  ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format

Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Reviewed-on: http://gerrit.cloudera.org:8080/8320
Reviewed-by: Jim Apple <jb...@apache.org>
Reviewed-by: Michael Brown <mi...@cloudera.com>
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M testdata/bin/create-load-data.sh
M testdata/bin/run-hive-server.sh
M testdata/bin/run-step.sh
3 files changed, 44 insertions(+), 5 deletions(-)

Approvals:
  Jim Apple: Looks good to me, but someone else must approve
  Michael Brown: Looks good to me, but someone else must approve
  Alex Behm: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 3
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 16:41:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75
PS1, Line 75:   HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 &
> People like Alex are those whom I was most concerned about, as I know he us
:). I'm still using that good-old machine, mem should be fine (fingers crossed).



-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Oct 2017 00:47:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1378/


-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Oct 2017 20:25:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 1:

(2 comments)

Changes like these tend to be slow and painful to test, so I'm in favor of not parallelizing additional things in this patch. Additional steps can be improved later.

http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@33
PS1, Line 33: 
What testing did you do? Does the data load still run on a non-beefy local machine?


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75
PS1, Line 75:   HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 &
> This looks like it will also increase HADOOP_HEAPSIZE when not doing a para
I'd prefer to keep this change. Our Hive server tends to OOM pretty easily when doing anything non-trivial with Hive on our mini cluster.



-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Oct 2017 00:07:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 2: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Oct 2017 00:00:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 2:

(9 comments)

Thanks for the reviews!

I observed memory when watching this, and on my 32GB machine, I always has ~20GB available.

I agree with Alex on adding in more things: there are similar changes that can continue to help here, but I'm doing them one at a time.

http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@9
PS1, Line 9: This commit loads functional-query, TPC-H data, and TPC-DS data in
> nit: Can you wrap this at the red line provided by gerrit? I think it is 72
Done. "gqip" does it in vi. It looks like it's 72 chars.


http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@12
PS1, Line 12: 13 minut
> nit: minutes
Done


http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@33
PS1, Line 33: 16:14:33    Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec)
> What testing did you do? Does the data load still run on a non-beefy local 
Define non-beefy?

My desktop is 32 GB and 8 cores. This ran fine.


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@480
PS1, Line 480:   # Run some steps in parallel, with run-step-backgroundable / run-step-wait-all.
> Could add a comment about what you decided to background and what you decid
Done.


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@492
PS1, Line 492:     LOAD_NESTED_ARGS="--cm-host $CM_HOST"
> I don't see any reason this also couldn't run in parallel.
Yes, but I've not tested this one.


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@505
PS1, Line 505:       load-data "functional-query" "core" "hbase/none"
             : fi
             : 
             : if $KUDU_IS_SUPPORTED; then
             :   # Tests depend on the kudu data being clean, so load
> It should be possible to do the same thing for these. That will only save a
Yes. I am testing this one, but I'll do a separate patch for it.


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75
PS1, Line 75:   HADOOP_HEAPSIZE="512" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 &
> :). I'm still using that good-old machine, mem should be fine (fingers cros
512 works, so that's what I've changed it to. I'm not investigating using -Xms -Xmx to give this more flexibility (but even less predictability).


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh
File testdata/bin/run-step.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@53
PS1, Line 53: 
> nit: only one empty line, to match context
Done


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@84
PS1, Line 84:   RUN_STEP_MSGS=()
> Do you want to reset MSGS, too?
Good catch. Done.



-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Sat, 21 Oct 2017 21:32:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@492
PS1, Line 492:   run-step "Loading auxiliary workloads" load-aux-workloads.log load-aux-workloads
I don't see any reason this also couldn't run in parallel.



-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Oct 2017 23:44:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75
PS1, Line 75:   HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 &
> I'd prefer to keep this change. Our Hive server tends to OOM pretty easily 
People like Alex are those whom I was most concerned about, as I know he used to develop Impala on a machine without much memory.

If Alex is OK with this, I am, too.



-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Oct 2017 00:09:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 2: Code-Review+1

LGTM. not +2ing so others have a chance to weigh in as to whether you have addressed their comments.


-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-Comment-Date: Sat, 21 Oct 2017 22:23:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@505
PS1, Line 505:   # Tests depend on the kudu data being clean, so load the data from scratch.
             :   run-step "Loading Kudu functional" load-kudu.log \
             :         load-data "functional-query" "core" "kudu/none/none" force
             :   run-step "Loading Kudu TPCH" load-kudu-tpch.log \
             :         load-data "tpch" "core" "kudu/none/none" force
It should be possible to do the same thing for these. That will only save about 4 minutes, but this runs even when loading from a snapshot.



-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Oct 2017 23:16:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel data load.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 )

Change subject: IMPALA-6070: Parallel data load.
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@9
PS1, Line 9: This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In
nit: Can you wrap this at the red line provided by gerrit? I think it is 72 characters. Emacs will wrap it for you at the right space with ctrl-q, if you choose.


http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@12
PS1, Line 12: minuites
nit: minutes


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@480
PS1, Line 480:   run-step-backgroundable "Loading functional-query data" load-functional-query.log \
Could add a comment about what you decided to background and what you decided not to, and why?


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75
PS1, Line 75:   HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 &
> I'm currently testing to see if 512 is enough.
This looks like it will also increase HADOOP_HEAPSIZE when not doing a parallel load, which is a shame. Do you see a way around that?


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh
File testdata/bin/run-step.sh:

http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@53
PS1, Line 53: 
nit: only one empty line, to match context


http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@84
PS1, Line 84:   RUN_STEP_PIDS=()
Do you want to reset MSGS, too?



-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Oct 2017 23:17:45 +0000
Gerrit-HasComments: Yes