You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/03/19 03:29:00 UTC

[jira] [Commented] (IMPALA-5393) Regexp should use THREAD_LOCAL context rather than FRAGMENT_LOCAL

    [ https://issues.apache.org/jira/browse/IMPALA-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795611#comment-16795611 ] 

ASF subversion and git services commented on IMPALA-5393:
---------------------------------------------------------

Commit 6e8c330f40da087ca0d8ba844cd9d97a8e60ff67 in impala's branch refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=6e8c330 ]

IMPALA-5393. Use THREAD_LOCAL state for regexp

This changes the built-in regexp-related UDFs to use THREAD_LOCAL
re2::RE instances instead of FRAGMENT_LOCAL.

Although re2::RE is thread-safe, it achieves that thread safety through
a certain amount of locking. Using thread-local regexps improves
performance substantially.

I ran a simple test query:

select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0

on a table with three underlying parquet files (thus getting 3 scanner
threads). Prior to this change, the query took ~60 seconds and burned
2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For
a query with more scanner threads, the improvement should be even more
dramatic.

The only potential downside of this change is slightly increased memory
consumption by having one RE instance per thread, but the REs themselves
should be small relative to all of the other per-scanner-thread memory.

Change-Id: Ibc331151a302e755701cb08adb3e6f289d54c3a6
Reviewed-on: http://gerrit.cloudera.org:8080/12772
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Todd Lipcon <to...@apache.org>


> Regexp should use THREAD_LOCAL context rather than FRAGMENT_LOCAL
> -----------------------------------------------------------------
>
>                 Key: IMPALA-5393
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5393
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.8.0
>            Reporter: Doug Cameron
>            Assignee: Todd Lipcon
>            Priority: Minor
>             Fix For: Impala 3.3.0
>
>
> The RE2 library uses mutex locking around some internal state structures.  This causes severe lock contention and lack of CPU scaling on the regexp string function.
> Switching to THREAD_LOCAL context will remove the contention.
> We could add a query option to select between FRAGMENT_LOCAL vs THREAD_LOCAL but that seems overkill as the context is not huge.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org