You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2019/03/25 16:46:10 UTC

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Hello Lars Volker,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/12845

to review the following change.


Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................

Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp

This re-lands commit 6e8c330f40da087ca0d8ba844cd9d97a8e60ff67 which was
reverted in d3428a58d8f54d1a64d5aeb1af3f76b7ffcb53d0. The revert was due
to an assumption that this commit depended on the new version of re2
(which was correctly reverted due to a toolchain issue). In fact this
commit does not depend on any toolchain changes.

Original commit message follows
--------------------------------

This changes the built-in regexp-related UDFs to use THREAD_LOCAL
re2::RE instances instead of FRAGMENT_LOCAL.

Although re2::RE is thread-safe, it achieves that thread safety through
a certain amount of locking. Using thread-local regexps improves
performance substantially.

I ran a simple test query:

select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0

on a table with three underlying parquet files (thus getting 3 scanner
threads). Prior to this change, the query took ~60 seconds and burned
2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For
a query with more scanner threads, the improvement should be even more
dramatic.

The only potential downside of this change is slightly increased memory
consumption by having one RE instance per thread, but the REs themselves
should be small relative to all of the other per-scanner-thread memory.

Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
---
M be/src/exprs/string-functions-ir.cc
1 file changed, 6 insertions(+), 6 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/12845/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2533/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Mar 2019 17:28:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................


Patch Set 1: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12845/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12845/1//COMMIT_MSG@9
PS1, Line 9: This re-lands commit 6e8c330f40da087ca0d8ba844cd9d97a8e60ff67 which was
Thanks for re-submitting this, and apologies for the unnecessary revert. As a minor note, we generally seem ok with just reverting the revert and submit that, including the "Revert: Revert: IMPALA-..." commit message.



-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Mar 2019 17:02:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................


Patch Set 1:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3953/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 25 Mar 2019 20:02:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12845/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12845/1//COMMIT_MSG@9
PS1, Line 9: This re-lands commit 6e8c330f40da087ca0d8ba844cd9d97a8e60ff67 which was
> Thanks for re-submitting this, and apologies for the unnecessary revert. As
k, just wanted to give context on why it was re-landed. In the past I've seen reverts and resubmissions and been confused about the context when looking at the git history a year later.



-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 25 Mar 2019 20:02:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................


Patch Set 1:

Yeah FWIW I think Lars did the right thing in erring on the side of reverting in this case.


-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 25 Mar 2019 20:39:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................

Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp

This re-lands commit 6e8c330f40da087ca0d8ba844cd9d97a8e60ff67 which was
reverted in d3428a58d8f54d1a64d5aeb1af3f76b7ffcb53d0. The revert was due
to an assumption that this commit depended on the new version of re2
(which was correctly reverted due to a toolchain issue). In fact this
commit does not depend on any toolchain changes.

Original commit message follows
--------------------------------

This changes the built-in regexp-related UDFs to use THREAD_LOCAL
re2::RE instances instead of FRAGMENT_LOCAL.

Although re2::RE is thread-safe, it achieves that thread safety through
a certain amount of locking. Using thread-local regexps improves
performance substantially.

I ran a simple test query:

select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0

on a table with three underlying parquet files (thus getting 3 scanner
threads). Prior to this change, the query took ~60 seconds and burned
2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For
a query with more scanner threads, the improvement should be even more
dramatic.

The only potential downside of this change is slightly increased memory
consumption by having one RE instance per thread, but the REs themselves
should be small relative to all of the other per-scanner-thread memory.

Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Reviewed-on: http://gerrit.cloudera.org:8080/12845
Reviewed-by: Lars Volker <lv...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exprs/string-functions-ir.cc
1 file changed, 6 insertions(+), 6 deletions(-)

Approvals:
  Lars Volker: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> Yeah FWIW I think Lars did the right thing in erring on the side of reverting in this case.

yep, when in doubt, revert to keep builds green for sure


-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 25 Mar 2019 20:49:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] Re-land IMPALA-5393. Use THREAD LOCAL state for regexp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12845 )

Change subject: Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
......................................................................


Patch Set 1: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Gerrit-Change-Number: 12845
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 26 Mar 2019 00:52:20 +0000
Gerrit-HasComments: No