You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/18 19:09:00 UTC
[jira] [Commented] (DRILL-5879) Optimize "Like" operator
[ https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209835#comment-16209835 ]
ASF GitHub Bot commented on DRILL-5879:
---------------------------------------
GitHub user sachouche opened a pull request:
https://github.com/apache/drill/pull/1001
JIRA DRILL-5879: Like operator performance improvements
- Recently, custom code has been added to handle common search patterns (Like operator)
- Contains, Starts With, and Ends With
- Custom code is way faster than the generic RegEx based implementation
- This pull request is another attempt to improve the Contains pattern since it is more CPU intensive
Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
Before changes: 33sec
After changes : 27sec
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachouche/drill yodlee-cherry-pick
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1001.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1001
----
commit c2b05b2e8665daf3f7b43d49c428539b3753595f
Author: Salim Achouche <sa...@gmail.com>
Date: 2017-10-18T18:40:18Z
JIRA 5879: Like operator performance improvements
----
> Optimize "Like" operator
> ------------------------
>
> Key: DRILL-5879
> URL: https://issues.apache.org/jira/browse/DRILL-5879
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Relational Operators
> Environment: *
> Reporter: salim achouche
> Assignee: salim achouche
> Priority: Minor
> Fix For: 1.12.0
>
>
> Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';
> Improvement Opportunities
> # Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
> # Optimize the "contains" for-loop
> Implementation Details
> 1)
> * Added a new integer variable "asciiMode" to the VarCharHolder class
> * The default value is -1 which indicates this info is not known
> * Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
> * The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
> * The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
> 2)
> * The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
> * Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
> Benchmarks
> * Lineitem table 100GB
> * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
> * Before changes: 33sec
> * After changes : 27sec
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)