You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/18 19:09:00 UTC
[jira] [Commented] (DRILL-5879) Optimize "Like" operator

    [ https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209835#comment-16209835 ] 

ASF GitHub Bot commented on DRILL-5879:
---------------------------------------

GitHub user sachouche opened a pull request:

    https://github.com/apache/drill/pull/1001

    JIRA DRILL-5879: Like operator performance improvements

    - Recently, custom code has been added to handle common search patterns (Like operator)
    - Contains, Starts With, and Ends With
    - Custom code is way faster than the generic RegEx based implementation
    - This pull request is another attempt to improve the Contains pattern since it is more CPU intensive
    
    Query: select <column-list> from &lt;table&gt; where colA like '%a%' or colA like '%xyz%';
    Improvement Opportunities
    Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
    Optimize the "contains" for-loop
    Implementation Details
    1)
    Added a new integer variable "asciiMode" to the VarCharHolder class
    The default value is -1 which indicates this info is not known
    Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
    The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
    The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
    2)
    The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
    Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
    Benchmarks
    Lineitem table 100GB
    Query: select l_returnflag, count from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
    Before changes: 33sec
    After changes : 27sec


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill yodlee-cherry-pick

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1001.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1001
    
----
commit c2b05b2e8665daf3f7b43d49c428539b3753595f
Author: Salim Achouche <sa...@gmail.com>
Date:   2017-10-18T18:40:18Z

    JIRA 5879: Like operator performance improvements

----


> Optimize "Like" operator
> ------------------------
>
>                 Key: DRILL-5879
>                 URL: https://issues.apache.org/jira/browse/DRILL-5879
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>         Environment: * 
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Minor
>             Fix For: 1.12.0
>
>
> Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';
> Improvement Opportunities
> # Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
> # Optimize the "contains" for-loop 
> Implementation Details
> 1)
> * Added a new integer variable "asciiMode" to the VarCharHolder class
> * The default value is -1 which indicates this info is not known
> * Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
> * The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
> * The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
> 2) 
> * The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
> * Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
> Benchmarks
> * Lineitem table 100GB
> * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
> * Before changes: 33sec
> * After changes    : 27sec



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)