You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Zhuang Tianyi (Jira)" <ji...@apache.org> on 2020/05/03 21:04:00 UTC

[jira] [Created] (ARROW-8681) Improve like/nlike performance

Zhuang Tianyi created ARROW-8681:
------------------------------------

             Summary: Improve like/nlike performance
                 Key: ARROW-8681
                 URL: https://issues.apache.org/jira/browse/ARROW-8681
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust, Rust - DataFusion
    Affects Versions: 0.17.0
            Reporter: Zhuang Tianyi


Currently, the implementation of  `like_utf8` and `nlike_utf8` is based on regex, which is simple and readable, but poor at the performance.

 

I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this repo, I compare three like algorithm.
 # `like`(includes partial_like): this is the first naive version, using the recursive approach, which will cause terrible performance on special attack input, such as `a%a%a%a%a%a%a%a%b`.
 # `like_to_regex`: which is almost the same as the current implementation in arrow.
 # `like_optimize`: the like problem is similar to glob in shell, so a perfect solution is proposed in [https://research.swtch.com/glob] . The code in the research is written golang but I translate it to rust.

 

In my benchmark result, the recursive solution can be ignored due to bad time complexity lower bound.

the regex solution will cost about 1000x time including regex compiling, and about 4x time without regex compiling then solution 3. And It seems that the code complexity is acceptable.

Everyone can reproduce the benchmark result using this repo with a few code.

 

I have submitted a PR to TiKV to optimize the like performance ([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)], without UTF-8 support), and add collation support in [https://github.com/tikv/tikv/pull/6592], which can be simply port to data-fusion.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)