You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "salim achouche (JIRA)" <ji...@apache.org> on 2017/10/16 21:08:00 UTC
[jira] [Created] (DRILL-5879) Optimize "Like" operator
salim achouche created DRILL-5879:
-------------------------------------
Summary: Optimize "Like" operator
Key: DRILL-5879
URL: https://issues.apache.org/jira/browse/DRILL-5879
Project: Apache Drill
Issue Type: Improvement
Components: Execution - Relational Operators
Environment: *
Reporter: salim achouche
Assignee: salim achouche
Priority: Minor
Fix For: 1.12.0
Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';
Improvement Opportunity
# Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
# Optimize the "contains" for-loop
Implementation Detail
1)
* Added a new integer variable "asciiMode" to the VarCharHolder class
* The default value is -1 which indicates this info is not know
* Otherwise this value will be set to either 1 or 0
* The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
* The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
2)
* The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
* Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
Benchmarks
* Lineitem table 100GB
* Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
* Before changes: 33sec
* After changes : 27sec
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)