You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Padma Penumarthy (JIRA)" <ji...@apache.org> on 2017/08/10 21:53:00 UTC
[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

    [ https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 9:52 PM:
------------------------------------------------------------------

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, we directly read the byte as character from PlatformDependent. Else, we decode UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a' 
1m 10 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
9.7 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
1m 6 sec

For all ASCII, since getByte is doing a bounds check every time we call it, I want to see if getting the bytes  in one shot is better. That did not help much with performance. In fact, it made it worse for 'a%' type of  match.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)

Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)

Next, I tried building charBuffer always (even if it is all ASCII) and use String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘a%’
46  (vs 9.794s baseline)

I tried Google RE2 library. Got much worse numbers than what we are getting with Java Regex Library.

Finally, I implemented simple character by character comparison functions for each of the special cases 
and got pretty good numbers.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, we directly read the byte as character from PlatformDependent. Else, we decode UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a' 
1m 10 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
9.7 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
1m 6 sec

For all ASCII, since getByte is doing a bounds check every time we call it, I want to see if getting the bytes  in one shot is better. That did not help much with performance. In fact, it made it worse for 'a%' type of  match.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)

Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)

Next, I tried building charBuffer always (even if it is all ASCII) and use String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘a%’
46  (vs 9.794s baseline)

I tried Google RE2 library. Got much worse numbers than what we are getting with Java Regex Library.

Finally, I implemented simple character by character comparison functions for each of the special cases 
and got pretty good numbers.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
11.34s (vs. 1m 6s baseline)













> Improve performance of filter operator for pattern matching
> -----------------------------------------------------------
>
>                 Key: DRILL-5697
>                 URL: https://issues.apache.org/jira/browse/DRILL-5697
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Flow
>    Affects Versions: 1.11.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for pattern matching. However, for cases like %abc (ends with abc), abc% (starts with abc), %abc% (contains abc), it is observed that implementing these cases with simple code instead of using regex library provides good performance boost (4-6x). Idea is to use special case code for simple, common cases and fall back to Java regex library for complicated ones. That will provide good performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)