You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Philip Zeyliger (Code Review)" <ge...@cloudera.org> on 2017/10/22 21:34:30 UTC

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Philip Zeyliger has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8354


Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................

IMPALA-6070: Parallel compute_table_stats.py

Uses a thread pool to issue many compute stats commands in parallel to
Impala, rather than doing it serially. Where it was obvious, I combined
multiple stats commands into fewer, to reduce the number
of "show databses" and serialized "show tables" commands.

This speeds up the compute stats step in data loading significantly. My
measurements for testdata/bin/compute-table-stats.sh running before and
after this change, with the Impala daemons restarted (cold) or not
restarted (warm) on an 8-core, 32GB RAM machine were:

old, cold: 7m44s
new, cold: 1m42s

old, warm: 1m23s
new, warm:   48s

The data load in the full test build behaves in a cold fashion. It's
typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to
run this compute stats step for 9 or 10 minutes. With this change, this
will come down to about 2 minutes.

Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
---
M testdata/bin/compute-table-stats.sh
M tests/util/compute_table_stats.py
2 files changed, 58 insertions(+), 29 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/8354/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.
David Knupp has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@116
PS2, Line 116:   def client_factory():
Would it be possible to decorate client_factory as a @contextmanager, something like:

  from contextlib import contextmanager
  
  @contextmanager 
  def client_factory():
    impala_client = ImpalaBeeswaxClient(options.impalad,
        use_kerberos=options.use_kerberos, use_ssl=options.use_ssl)
    impala_client.connect()
    yield impala_client
    impala_client.close_connection()

...such that usage becomes:

  with client_factory() as impala_client:
    # etc...

Then you could then do away with the try/finally for closing the connection.



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 17:50:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 2: Code-Review+1

Thanks for adding the logging. Let's MikeB do the final +2.


-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 17:46:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.
David Knupp has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py@37
PS1, Line 37:     statement = "compute stats %s" % (db_table,)
> They do. At first I thought I didn't care, but you're right, and I've fixed
Thank you! I wish we could get rid of the spurious print statements in our code. If we ever have to be python3 compatible, those will cause us a lot of grief.



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 17:52:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.
David Knupp has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@44
PS2, Line 44:       if not continue_on_error: raise e
Two nits: even though we tend to do this a lot, PEP 8 frowns on inline if statements.

Also, I believe that if you're not handling an exception, but rather just logging something and re-raising, then a bare raise statement is preferred. I'm not even sure you need to reference the base Exception class in this case, since you're catching all exceptions. E.g.

  try:
    fail()
  except:
    logging.exception("oops!")
    raise


http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@48
PS2, Line 48: 
I'm curious -- why is it necessary to define the client factory inside of the if __name__ == "__main__" block, and then pass a reference to it here? Wouldn't defining the method in the top-level module namespace be just as effective? It might be obvious, but I'm missing it.


http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@71
PS2, Line 71: _factory, db, table, continue_on_error,)
It seems to me like you don't need the intersection. You already have either set(all_tables) or set(table_names). Why not just:

  for table in selected_tables:

?



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 3
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 18:32:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@71
PS2, Line 71: end(pool.apply_async(compute_stats_table
> Thanks for the clarification. I actually missed that you do the same thing 
This script ignores errors by default, and that's not including silent issues like the one you mention. For its purposes: computing table stats on everything, it seems to do the trick. If I were shipping a "stats computation script", I wouldn't ship this one. (For one, it computes stats even if they've been computed.)

In fact, the current invocation tries to compute stats on a view in Kudu and fails, but the failure is ignored.



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 4
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 19:28:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Hello Michael Brown, David Knupp, Alex Behm, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8354

to look at the new patch set (#4).

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................

IMPALA-6070: Parallel compute_table_stats.py

Uses a thread pool to issue many compute stats commands in parallel to
Impala, rather than doing it serially. Where it was obvious, I combined
multiple stats commands into fewer, to reduce the number
of "show databses" and serialized "show tables" commands.

This speeds up the compute stats step in data loading significantly. My
measurements for testdata/bin/compute-table-stats.sh running before and
after this change, with the Impala daemons restarted (cold) or not
restarted (warm) on an 8-core, 32GB RAM machine were:

old, cold: 7m44s
new, cold: 1m42s

old, warm: 1m23s
new, warm:   48s

The data load in the full test build behaves in a cold fashion. It's
typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to
run this compute stats step for 9 or 10 minutes. With this change, this
will come down to about 2 minutes.

Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
---
M testdata/bin/compute-table-stats.sh
M tests/util/compute_table_stats.py
2 files changed, 59 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/8354/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 4
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Hello Alex Behm, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8354

to look at the new patch set (#2).

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................

IMPALA-6070: Parallel compute_table_stats.py

Uses a thread pool to issue many compute stats commands in parallel to
Impala, rather than doing it serially. Where it was obvious, I combined
multiple stats commands into fewer, to reduce the number
of "show databses" and serialized "show tables" commands.

This speeds up the compute stats step in data loading significantly. My
measurements for testdata/bin/compute-table-stats.sh running before and
after this change, with the Impala daemons restarted (cold) or not
restarted (warm) on an 8-core, 32GB RAM machine were:

old, cold: 7m44s
new, cold: 1m42s

old, warm: 1m23s
new, warm:   48s

The data load in the full test build behaves in a cold fashion. It's
typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to
run this compute stats step for 9 or 10 minutes. With this change, this
will come down to about 2 minutes.

Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
---
M testdata/bin/compute-table-stats.sh
M tests/util/compute_table_stats.py
2 files changed, 62 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/8354/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 2:

> Patch Set 2:
> 
> (1 comment)

Sure. I did the @contextmanager trick. I'm a little bit ambivalent about it, because it's kind of "fancy python", but I think it looks ok.


-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 18:08:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py@37
PS1, Line 37:     statement = "compute stats %s" % (db_table,)
> Do these prints come out clean or all garbled due to the parallelism?
They do. At first I thought I didn't care, but you're right, and I've fixed it.

I switched to using python logging, which has its own logging to prevent interleaving. As a bonus, we get timestamps if we want to debug how long things take, and thread ids to make it obvious that there's parallelism.


http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py@119
PS1, Line 119:     impala_client.connect()
> please add newline to separate more cleanly
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 17:29:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 2:

BTW,

https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/481/consoleFull

did fine. It had both data load parallelism and table stats parallelism. Ran in 3h17m. A baseline is, say, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/479/, which ran for 3h52m. So, between these two changes, I've picked up about 35 minutes.

22:51:43 Computing table stats (logging to /home/ubuntu/Impala/logs/data_loading/compute-table-stats.log)... 
22:53:28   Computing table stats OK (Took: 1 min 45 sec)


-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 2
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 17:34:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Hello Michael Brown, David Knupp, Alex Behm, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8354

to look at the new patch set (#3).

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................

IMPALA-6070: Parallel compute_table_stats.py

Uses a thread pool to issue many compute stats commands in parallel to
Impala, rather than doing it serially. Where it was obvious, I combined
multiple stats commands into fewer, to reduce the number
of "show databses" and serialized "show tables" commands.

This speeds up the compute stats step in data loading significantly. My
measurements for testdata/bin/compute-table-stats.sh running before and
after this change, with the Impala daemons restarted (cold) or not
restarted (warm) on an 8-core, 32GB RAM machine were:

old, cold: 7m44s
new, cold: 1m42s

old, warm: 1m23s
new, warm:   48s

The data load in the full test build behaves in a cold fashion. It's
typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to
run this compute stats step for 9 or 10 minutes. With this change, this
will come down to about 2 minutes.

Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
---
M testdata/bin/compute-table-stats.sh
M tests/util/compute_table_stats.py
2 files changed, 58 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/8354/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 3
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1379/


-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 4
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Oct 2017 20:28:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................

IMPALA-6070: Parallel compute_table_stats.py

Uses a thread pool to issue many compute stats commands in parallel to
Impala, rather than doing it serially. Where it was obvious, I combined
multiple stats commands into fewer, to reduce the number
of "show databses" and serialized "show tables" commands.

This speeds up the compute stats step in data loading significantly. My
measurements for testdata/bin/compute-table-stats.sh running before and
after this change, with the Impala daemons restarted (cold) or not
restarted (warm) on an 8-core, 32GB RAM machine were:

old, cold: 7m44s
new, cold: 1m42s

old, warm: 1m23s
new, warm:   48s

The data load in the full test build behaves in a cold fashion. It's
typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to
run this compute stats step for 9 or 10 minutes. With this change, this
will come down to about 2 minutes.

Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Reviewed-on: http://gerrit.cloudera.org:8080/8354
Reviewed-by: David Knupp <dk...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M testdata/bin/compute-table-stats.sh
M tests/util/compute_table_stats.py
2 files changed, 59 insertions(+), 30 deletions(-)

Approvals:
  David Knupp: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 5
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 4: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 4
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Oct 2017 23:54:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@44
PS2, Line 44:       if not continue_on_error: raise e
> Two nits: even though we tend to do this a lot, PEP 8 frowns on inline if s
Sure; fixed.

(This was pre-existing.)


http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@48
PS2, Line 48: 
> I'm curious -- why is it necessary to define the client factory inside of t
It's a closure. Doing it inline in the main block lets the closure access options.impalad, and so on. Obviously, you could create a class, which takes as init parameters all the options, and then make this a class method. Or you could use globals.

I've not changed anything here.


http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@71
PS2, Line 71: _factory, db, table, continue_on_error,)
> It seems to me like you don't need the intersection. You already have eithe
Let's say all_tables = ["a", "b", "c"] and selected_tables = ["b", "c", "d"]. This code is trying to avoid running compute stats on "d", which has been selected but isn't in all tables.


http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@116
PS2, Line 116:     yield impala_client
> Would it be possible to decorate client_factory as a @contextmanager, somet
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 3
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 18:42:01 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1377/


-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 4
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Oct 2017 20:24:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 1:

(2 comments)

Nice

http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py@37
PS1, Line 37:     print 'Executing: %s' % statement
Do these prints come out clean or all garbled due to the parallelism?


http://gerrit.cloudera.org:8080/#/c/8354/1/tests/util/compute_table_stats.py@119
PS1, Line 119:   compute_stats(client_factory, db_names=db_names,
please add newline to separate more cleanly



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 17:09:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.
David Knupp has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 4
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 21:09:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6070: Parallel compute table stats.py

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.
David Knupp has posted comments on this change. ( http://gerrit.cloudera.org:8080/8354 )

Change subject: IMPALA-6070: Parallel compute_table_stats.py
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py
File tests/util/compute_table_stats.py:

http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@48
PS2, Line 48: ror=False, par
> It's a closure. Doing it inline in the main block lets the closure access o
Thanks for the clarification.


http://gerrit.cloudera.org:8080/#/c/8354/2/tests/util/compute_table_stats.py@71
PS2, Line 71: end(pool.apply_async(compute_stats_table
> Let's say all_tables = ["a", "b", "c"] and selected_tables = ["b", "c", "d"
Thanks for the clarification. I actually missed that you do the same thing for dbs on L67. I wasn't considering the fact that selected_tables might contain a table that's not in all_tables, or that selected_dbs might contain a db that's not in all_dbs. So one last probably naive question. Does screening out values that don't exist in all_* represent a possible silent failure -- say, if someone invokes the script with --db_names followed by a misspelled value?



-- 
To view, visit http://gerrit.cloudera.org:8080/8354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Gerrit-Change-Number: 8354
Gerrit-PatchSet: 4
Gerrit-Owner: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Oct 2017 19:14:29 +0000
Gerrit-HasComments: Yes