You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/06/10 18:49:38 UTC

[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: regexp_like fusing

Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152641627

   @gortiz Thanks for the detailed analysis. This is awesome, and very interesting!
   
   > ## We cannot use this optimization with Lucene
   > The cost is too high when the expression gets complex. It even with ` java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states.` when the expression have 10 branches.
   > 
   
   Curious how native FST handles it, and why is it so much faster than the Lucene FST. cc @atris 
   
   > ## In general `COL like 'whatever'` expression is 2x-3x worse than `regexp_like(COL, 'similar expression')`
   > Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when `BaseQueriesTest.getBrokerResponse` is called (which is great, because we can test the difference).
   
   I think the reason is that it is not apple-to-apple comparison. `col LIKE 'abc'` is actually equivalent to `regexp_like(col, '^abc$')`
   
   > ## Pinot regex execute faster wildcards than a single values
   > It is quite clear in the optimal cases:
   > 
   > ```
   > -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
   > BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
   > -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
   > BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op
   > ```
   > 
   > At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset.
   > 
   > ```
   > select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1
   > ```
   > 
   > Responds quite faster than
   > 
   > ```
   > select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1
   > ```
   > 
   > The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x.
   > 
   > I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it.
   
   Let's re-do the benchmark. I don't believe there can be 2000x difference. Also, this is very counter-intuitive, so I'd suggest let's figure out if it is actually from the java regex engine, or because of our code.
   
   > ### When there are several regexp in the actual expression, indexes are very useful.
   > Examples:
   > 
   > ```
   > BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    0.362  ms/op
   > BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    0.222  ms/op
   > BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  379.729  ms/op
   > 
   > BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    0.208  ms/op
   > BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    0.159  ms/op
   > BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  289.928  ms/op
   > ```
   
   I feel the inefficiency is from `regexp_like(DOMAIN_NAMES, 'domain0')` (exact match) which takes much longer than expected. We should check what is causing the huge performance difference


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org