You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "salim achouche (JIRA)" <ji...@apache.org> on 2017/10/16 21:08:00 UTC

[jira] [Created] (DRILL-5879) Optimize "Like" operator

salim achouche created DRILL-5879:
-------------------------------------

             Summary: Optimize "Like" operator
                 Key: DRILL-5879
                 URL: https://issues.apache.org/jira/browse/DRILL-5879
             Project: Apache Drill
          Issue Type: Improvement
          Components: Execution - Relational Operators
         Environment: * 
            Reporter: salim achouche
            Assignee: salim achouche
            Priority: Minor
             Fix For: 1.12.0


Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';

Improvement Opportunity
# Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
# Optimize the "contains" for-loop 

Implementation Detail
1)
* Added a new integer variable "asciiMode" to the VarCharHolder class
* The default value is -1 which indicates this info is not know
* Otherwise this value will be set to either 1 or 0
* The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
* The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations

2) 
* The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
* Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons

Benchmarks
* Lineitem table 100GB
* Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
* Before changes: 33sec
* After changes    : 27sec



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)