You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2021/04/26 12:42:04 UTC

[jira] [Reopened] (ARROW-8681) [Rust][DataFusion] Improve like/nlike performance

     [ https://issues.apache.org/jira/browse/ARROW-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Lamb reopened ARROW-8681:
--------------------------------

> [Rust][DataFusion] Improve like/nlike performance
> -------------------------------------------------
>
>                 Key: ARROW-8681
>                 URL: https://issues.apache.org/jira/browse/ARROW-8681
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>    Affects Versions: 0.17.0
>            Reporter: Zhuang Tianyi
>            Priority: Minor
>              Labels: performance
>
> Currently, the implementation of  `like_utf8` and `nlike_utf8` is based on regex, which is simple and readable, but poor at the performance.
>  
> I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this repo, I compare three like algorithm.
>  # `like`(includes partial_like): this is the first naive version, using the recursive approach, which will cause terrible performance on special attack input, such as `a%a%a%a%a%a%a%a%b`.
>  # `like_to_regex`: which is almost the same as the current implementation in arrow.
>  # `like_optimize`: the like problem is similar to glob in shell, so a perfect solution is proposed in [https://research.swtch.com/glob] . The code in the research is written golang but I translate it to rust.
>  
> In my benchmark result, the recursive solution can be ignored due to bad time complexity lower bound.
> the regex solution will cost about 1000x time including regex compiling, and about 4x time without regex compiling then solution 3. And It seems that the code complexity of solution 3 is acceptable.
> Everyone can reproduce the benchmark result using this repo with a few codes.
>  
> I have submitted a PR to TiKV to optimize the like performance ([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)], without UTF-8 support), and add collation support in [https://github.com/tikv/tikv/pull/6592], which can be simply port to data-fusion.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)