You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/06/02 10:34:31 UTC

[GitHub] [pinot] gortiz opened a new pull request, #8818: text_match fusing

gortiz opened a new pull request, #8818:
URL: https://github.com/apache/pinot/pull/8818

   When executing queries like:
   ```sql
   select col1, col2 from Table where text_match(col1, '/r1/') or text_match(col1, '/r2/')
   ```
   Pinot has to scan the referred column twice. This PR creates an optimization that tries to fuse boolean algebra and `text_match` predicates. Specifically, as indicated in the Javadoc:
   - Queries `where text_match(col1, '/r1/') and text_match(col1, '/r2/')` will be translated to `where text_match(col1, '/(?=r1)(?=r2)/')`
   - Queries `where text_match(col1, '/r1/') or text_match(col1, '/r2/')` will be translated to `where text_match(col1, '/(?:r1)|(?:r2)/')`
   - Queries `where not text_match(col1, '/r1/')` will be translated to `where text_match(col1, '/(?!r1)/')`
   
   There are some tests that apply the optimization to more advanced cases.
   
   Regex can be quite complex and it isn't clear to me how expressive is the regex language we (and lucene) support. By doing some analysis I'm sure that this optimization will break some regex like the ones that use backreferences.
   
   To know if the optimization can be applied or not, it would be necessary to analyze the regex, which would require to parse the regex into a AST. As far as I know Pinot doesn't have that, so this optimization is disabled by default and can be enabled by activating a new query option.
   
   Future improvements may include to translate some `text_match` predicated that are not regex to regex, so we would be able to merge things like `where text_match(col1, '/r1/') or text_match(col1, '"literal1" and "literal2"')` into `where text_match(col1, '/(?:r1)|(?=(?:literal1)|(?:literal2))/')` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145782478

   ORMs are not going to generate them, but a user may try to do something like `regexp_like(col1, '(.+)=\1')`, which will be positive in texts like `a=a` but false in `b=a`. I think this is very bizarre, but possible. 
   
   What I'm going to do is to try to detect the usage of backreferences in the regexp. To do so, I'm going to discard any `regexp_like` whose predicate matches with `.*(?:(?:\\[0-9]+)|(?:\\k\<\w+\>)).*`. The function may have false positives (for example the regexp`\\1` will be a false positive) and false negatives (where is more difficult to find examples, maybe playing with utf8?), but it may be good enough to activate the query option by default in a future release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154392815

   Just circling back here having taken a quick look at `BenchmarkFuseRegexp.decreasing9Fusing` - I got similar results to Gonzalo:
   
   ```
   Benchmark                              (_fstType)  Mode  Cnt   Score   Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing      LUCENE  avgt    5  32.525 ± 2.733  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing      NATIVE  avgt    5   0.487 ± 0.123  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing        null  avgt    5   0.845 ± 0.204  ms/op
   ```
   
    Lucene spends virtually all of its time in _minimization_ (which I mistakenly referred to as _determinization_ earlier):
   
   <img width="936" alt="Screenshot 2022-06-13 at 21 10 51" src="https://user-images.githubusercontent.com/16439049/173436813-a13a5830-ad7b-4a8c-8e5f-3cf8003640df.png">
    
   Now if I enable minimization in the native implementation by reverting #8237, the scores change drastically:
   
   ```
   Benchmark                              (_fstType)  Mode  Cnt    Score    Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing      LUCENE  avgt    5   32.399 ±  3.005  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing      NATIVE  avgt    5  366.520 ± 40.574  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing        null  avgt    5    0.819 ±  0.172  ms/op
   ```
   
   As do the profiles. Without minimization, there is no clear single bottleneck:
   <img width="879" alt="Screenshot 2022-06-13 at 21 14 47" src="https://user-images.githubusercontent.com/16439049/173437464-980859ce-e7c5-4e8f-b43e-ba609ab39396.png">
   
   With minimization, there is a single bottleneck, and it's in minimization :
   
   <img width="931" alt="Screenshot 2022-06-13 at 21 23 10" src="https://user-images.githubusercontent.com/16439049/173438712-5f8efc36-1f8d-4ea5-8bb6-b135d21ff4fc.png">
   
   
   As to whether this is cause for concern is beside the point; this is the root cause of the difference observed, rather than a quirk of the benchmark which needs to be analysed and removed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145792924

   > ORMs are not going to generate them
   
   On the contrary, this exact pattern has been observed generated by customer systems querying Pinot 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155165270

   I was able to invest more time in this topic today and I have to say that my initial benchmark was completely useless. Apart from some typos in the queries (which may have some impact), there were two problems that completely fooled me:
   - The implicit limit introduced by Pinot. As only the first 10 elements were returned, queries that actually found elements could finish faster.
   - Something is wrong with Lucene and Native indexes. I guess I'm not configuring them correctly, but they simply return 0 results when the regexp is a little bit complex (like having groups).
   
   Therefore I've changed all the tests to return a count and each iteration is actually verifying whether the expected value is returned or not, in which case the benchmark is stopped. As a result, all fusing tests fail whenever there is an index. I hope it is something related to my test (ie I didn't configure the index properly) and not some semantic discrepancy these indexes have in relation how each engine evaluates regexp, as that could mean that there are customers receiving false results.
   
   These are the results I've got. Note that some combinations are not shown. For example, decreasing9Fusing and LUCENE or NATIVE. That means that these combinations did not return what it was expected (in all these particular cases it means that they return 0 rows). Note that apart from the fusing case, benchmarks like `optimal10`, which is `regexp_like(DOMAIN_NAMES, 'domain\d')` did also failed.
   
   ```
   Benchmark                                  (_fstType)  Mode  Cnt     Score      Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing            null  avgt    5  1438.185 ±   73.963  ms/op
   BenchmarkFuseRegexp.decreasing9Like            LUCENE  avgt    5    41.927 ±    0.984  ms/op
   BenchmarkFuseRegexp.decreasing9Like            NATIVE  avgt    5    40.232 ±    5.597  ms/op
   BenchmarkFuseRegexp.decreasing9Like              null  avgt    5  3957.902 ±  154.228  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           LUCENE  avgt    5    40.744 ±    2.059  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           NATIVE  avgt    5    61.443 ±    1.301  ms/op
   BenchmarkFuseRegexp.decreasing9Regex             null  avgt    5  2116.747 ±  602.570  ms/op
   BenchmarkFuseRegexp.increasing10Fusing           null  avgt    5  1320.869 ±   45.509  ms/op
   BenchmarkFuseRegexp.increasing10Like           LUCENE  avgt    5    42.253 ±    0.695  ms/op
   BenchmarkFuseRegexp.increasing10Like           NATIVE  avgt    5    39.049 ±    1.829  ms/op
   BenchmarkFuseRegexp.increasing10Like             null  avgt    5  5071.207 ± 3134.887  ms/op
   BenchmarkFuseRegexp.increasing10Regex          LUCENE  avgt    5    44.084 ±    2.737  ms/op
   BenchmarkFuseRegexp.increasing10Regex          NATIVE  avgt    5    39.650 ±    1.567  ms/op
   BenchmarkFuseRegexp.increasing10Regex            null  avgt    5  2318.479 ±   75.632  ms/op
   BenchmarkFuseRegexp.optimal10                    null  avgt    5   185.757 ±    3.426  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          LUCENE  avgt    5     0.212 ±    0.136  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          NATIVE  avgt    5     0.201 ±    0.140  ms/op
   BenchmarkFuseRegexp.optimal10NotFound            null  avgt    5   248.525 ±   60.819  ms/op
   BenchmarkFuseRegexp.optimal1Like               LUCENE  avgt    5    16.431 ±    3.533  ms/op
   BenchmarkFuseRegexp.optimal1Like               NATIVE  avgt    5    16.586 ±    0.828  ms/op
   BenchmarkFuseRegexp.optimal1Like                 null  avgt    5   468.731 ±  185.623  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       LUCENE  avgt    5     0.197 ±    0.062  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       NATIVE  avgt    5     0.170 ±    0.072  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound         null  avgt    5   477.176 ±  129.445  ms/op
   BenchmarkFuseRegexp.optimal1Regex              LUCENE  avgt    5    15.485 ±    0.517  ms/op
   BenchmarkFuseRegexp.optimal1Regex              NATIVE  avgt    5    16.711 ±    0.857  ms/op
   BenchmarkFuseRegexp.optimal1Regex                null  avgt    5   205.676 ±    6.921  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      LUCENE  avgt    5     0.212 ±    0.123  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      NATIVE  avgt    5     0.186 ±    0.029  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound        null  avgt    5   239.459 ±   13.846  ms/op
   BenchmarkFuseRegexp.selective2Fusing             null  avgt    5   541.530 ±  168.173  ms/op
   BenchmarkFuseRegexp.selective2Like             LUCENE  avgt    5    19.961 ±    0.937  ms/op
   BenchmarkFuseRegexp.selective2Like             NATIVE  avgt    5    20.387 ±    1.953  ms/op
   BenchmarkFuseRegexp.selective2Like               null  avgt    5   916.620 ±   96.702  ms/op
   BenchmarkFuseRegexp.selective2Regex            LUCENE  avgt    5    23.366 ±    0.795  ms/op
   BenchmarkFuseRegexp.selective2Regex            NATIVE  avgt    5    20.037 ±    0.601  ms/op
   BenchmarkFuseRegexp.selective2Regex              null  avgt    5   389.965 ±    7.284  ms/op
   ```
   
   I've also changed `RegexpPatternConverterUtils` (and its relative test) to do not introduce useless `^.*` at the beginning of the expression nor `.*$` at the end. I've repeated the benchmark with that change and these are the results I've got. As you can see, most benchmarks didn't change that much, but the ones that use like and have no index are twice as fast, so they achieve the same numbers than their equivalent `optimalXRegex`, as expected.
   
   ```
   Benchmark                                  (_fstType)  Mode  Cnt     Score     Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing            null  avgt    5  1365.403 ±  91.822  ms/op
   BenchmarkFuseRegexp.decreasing9Like            LUCENE  avgt    5    40.553 ±   1.394  ms/op
   BenchmarkFuseRegexp.decreasing9Like            NATIVE  avgt    5    39.057 ±   0.905  ms/op
   BenchmarkFuseRegexp.decreasing9Like              null  avgt    5  2046.090 ±  40.428  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           LUCENE  avgt    5    42.584 ±   2.387  ms/op
   BenchmarkFuseRegexp.decreasing9Regex           NATIVE  avgt    5    61.583 ±   1.893  ms/op
   BenchmarkFuseRegexp.decreasing9Regex             null  avgt    5  2054.523 ±  75.452  ms/op
   BenchmarkFuseRegexp.increasing10Fusing           null  avgt    5  1305.249 ±  65.576  ms/op
   BenchmarkFuseRegexp.increasing10Like           LUCENE  avgt    5    43.328 ±   2.040  ms/op
   BenchmarkFuseRegexp.increasing10Like           NATIVE  avgt    5    61.580 ±   1.149  ms/op
   BenchmarkFuseRegexp.increasing10Like             null  avgt    5  2272.112 ±  99.393  ms/op
   BenchmarkFuseRegexp.increasing10Regex          LUCENE  avgt    5    54.319 ±   2.048  ms/op
   BenchmarkFuseRegexp.increasing10Regex          NATIVE  avgt    5    39.080 ±   1.817  ms/op
   BenchmarkFuseRegexp.increasing10Regex            null  avgt    5  2267.062 ±  49.525  ms/op
   BenchmarkFuseRegexp.optimal10                    null  avgt    5   187.012 ±  13.054  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          LUCENE  avgt    5     0.210 ±   0.103  ms/op
   BenchmarkFuseRegexp.optimal10NotFound          NATIVE  avgt    5     0.201 ±   0.066  ms/op
   BenchmarkFuseRegexp.optimal10NotFound            null  avgt    5   242.423 ±  17.612  ms/op
   BenchmarkFuseRegexp.optimal1Like               LUCENE  avgt    5    16.395 ±   0.884  ms/op
   BenchmarkFuseRegexp.optimal1Like               NATIVE  avgt    5    15.506 ±   0.591  ms/op
   BenchmarkFuseRegexp.optimal1Like                 null  avgt    5   209.389 ±  11.456  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       LUCENE  avgt    5     0.209 ±   0.129  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound       NATIVE  avgt    5     0.172 ±   0.021  ms/op
   BenchmarkFuseRegexp.optimal1LikeNotFound         null  avgt    5   245.542 ±  35.323  ms/op
   BenchmarkFuseRegexp.optimal1Regex              LUCENE  avgt    5    15.421 ±   0.804  ms/op
   BenchmarkFuseRegexp.optimal1Regex              NATIVE  avgt    5    15.908 ±   0.491  ms/op
   BenchmarkFuseRegexp.optimal1Regex                null  avgt    5   204.505 ±  13.134  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      LUCENE  avgt    5     0.210 ±   0.089  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound      NATIVE  avgt    5     0.203 ±   0.107  ms/op
   BenchmarkFuseRegexp.optimal1RegexNotFound        null  avgt    5   235.262 ±   8.533  ms/op
   BenchmarkFuseRegexp.selective2Fusing             null  avgt    5   517.545 ± 193.977  ms/op
   BenchmarkFuseRegexp.selective2Like             LUCENE  avgt    5    23.622 ±   1.178  ms/op
   BenchmarkFuseRegexp.selective2Like             NATIVE  avgt    5    20.547 ±   0.778  ms/op
   BenchmarkFuseRegexp.selective2Like               null  avgt    5   399.638 ±  54.746  ms/op
   BenchmarkFuseRegexp.selective2Regex            LUCENE  avgt    5    23.883 ±   0.802  ms/op
   BenchmarkFuseRegexp.selective2Regex            NATIVE  avgt    5    20.113 ±   0.858  ms/op
   BenchmarkFuseRegexp.selective2Regex              null  avgt    5   416.987 ±  28.789  ms/op
   ```
   
   With this results, the optimization introduced in this PR doesn't seem to be amazing, but to be honest I'm quite worried about what is going on with regexp and Lucene/Native indexes. As said above, I hope that the benchmark is not correctly configuring the indexes, but even with that, it seems very dangerous that an index can be configured in such a way that it affects the semantics of the queries.
   
   @Jackie-Jiang @atris I would really appreciate if any of you can take a look at the benchmark trying to find what is going on.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145708838

   I don’t think it’s possible to express backtracking behaviour in a LIKE statement, this optimisation would target long unions of LIKE statements generated by ORMs, and should be most effective in reducing the impact on a cluster when there is no text or FST index in place. I expect this would also help FST based regular expression evaluation, so getting some numbers to justify extending to FST regex evaluators would be great.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154813143

   > Lastly, why do you say that indices don't help for single regexes? Unless I am misreading the benchmark result, even for a single regex case, native index is pretty fast?
   
   My benchmark may not be the best and it is only measuring a few regexp cases, but I wouldn't say that native gives us a significant advantage:
   
   ```
   -- this actually has a single, quite complex regex expression
   BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op <-- here native is faster by 1.8x
   BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   -- this actually has a single, quite complex regex expression
   BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op <-- here native is slower by 1.4x, although the difference is lower than the error margin
   BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
   BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    0.066  ms/op
   BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    0.011  ms/op  <-- here native is faster by 1.1x, although the difference is lower than the error margin
   BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
   ```
   
   So we have one case where native is 1.8x faster and some others were it is equivalent to the java regex engine, which is known by its lack of performance. I mean, it is not like the native index is atrocious, but I would expect more from an index than a situational 1.8x performance increase. 
   
   To be clear, as I already said, this is not a benchmark on the FST itself and it is very narrow, so I don't want to get conclusions about whether we should or we should not use these indexes. I think this benchmark proves that there are very specific cases where these indexes are not very useful and therefore it would be nice to invest time to design proper tests with different kind of regex to verify whether FST indexes are in general useful evaluating regex or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156032082

   > @gortiz another thing to remember is that neither Lucene nor native FST index follow the java regex pattern language.
   > 
   > If you look up the Lucene regex documentation, and look at the FST index and LIKE tests in Pinot, you will notice the difference.
   > 
   > Unfortunately, since we do not actually parse the regex engine until deep inside the FST territory, we have no way of reporting unknown characters. That may be a factor leading to 0 results for your queries
   
   Wow, that is shocking and I would see that as a huge problem to use Pinot in a big team. In cases like that I would expect to have a new operation that does the search using the Lucene expression language or whatever we want to use and fail if there is no index. What you are saying is that if I have a column that is not indexed and suddenly a colleague of mine decides to index that column, queries I had will start to return different (and incorrect) results. 
   
   Anyway, if that is what was decided, that is what we have. I'm going to close this PR, as it doesn't make sense to apply this optimization if it can be produce false results when someone adds an index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on a diff in pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on code in PR #8818:
URL: https://github.com/apache/pinot/pull/8818#discussion_r888712252


##########
pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java:
##########
@@ -255,6 +255,9 @@ private BaseFilterOperator constructPhysicalOperator(FilterContext filter, int n
                   || textIndexReader instanceof NativeMutableTextIndex) {
                 throw new UnsupportedOperationException("TEXT_MATCH is not supported on native text index");
               }
+              if (textIndexReader == null) {
+                throw new UnsupportedOperationException("TEXT_MATCH is not supported when there is no text index");

Review Comment:
   This moves the correctness check from execution to parsing time, which should be better, but makes `ExplainPlanQueriesTest` fail.
   
   As indicated by its name, this tests does execute `explain plan` instead of the actual query. This is why the test didn't fail before. As this PR shouldn't be related to `text_match` but `regexp_like`, I'm going to remove this change for simplicity 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145814759

   I don’t believe the native FST supports backreferences, and I don’t think it would be easy to add support either. Not sure about the lucene FST index. As mentioned above, LIKE statements (e.g. LIKE ‘%foo%’) can’t express all regular expression features, which should reduce the scope of the edge cases to be considered.
   
   This definitely needs benchmarking for at least the following scenarios;
   * Single like statement
   * N like statements combined with OR
   
   for each of:
   * no FST index - both raw and dictionarized
   * Lucene FST index
   * Native FST index


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156125315

   Yeah, I completely understand that we should maintain the Lucene language when Natife FST was created. My concern is with the fact that when Lucene index was added, we decided to accept a different language in `regexp_match`. IMHO that is a consistency problem that is not easy to explain to Pinot customers. Pinot doesn't include documentation about this operation and how it changes depending on whether there is an index or not and I feel sorry for whoever has to write that doc, because it seems quite difficult to describe.
   
   IMHO a better solution would be to do not optimize `regexp_match` with indexes and to have another operation whose string argument was a Lucene expression.
   
   It is too late to do that, so let's focus on useful things. This PR is closed, but I'm going to cherry pick the like to regex optimization to add it in another PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145160683

   Wondering how much performance gain are we expecting here? `text_match` will use the Lucene text index to get the matching docs for the request. Essentially the comparison is between 2 simple requests vs 1 combined request. Some perf number would be helpful to justify the change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154279956

   @Jackie-Jiang one thing is that the native index aggressively tries to prune branches, both while building the query automaton and during the traversal.
   
    That is also what causes general suffix and prefix queries to be much faster. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154272773

   I will check out the branch and give some comments tomorrow 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz closed pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz closed pull request #8818: regexp_like fusing
URL: https://github.com/apache/pinot/pull/8818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155183347

   @gortiz are you creating a FST index or a text index? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152679541

   Yeah, the REGEXP_LIKE translation from LIKE is not equivalent.
   
   Also, remember that any FST traversal will benefit most from a selective expression rather than a full term match.
   
   Lastly, why do you say that indices don't help for single regexes? Unless I am misreading the benchmark result, even for a single regex case, native index is pretty fast? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145648604

   > I thought this was supposed to be targeting regexp_like
   
   My bad. I looked for regexp in [the docs](https://docs.pinot.apache.org/configuration-reference/functions) and given that only found regexpExact, I thought it was text_match the operation we want to optimize.
   
   Anyway, the idea is the same, so most of the code can be reused 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] codecov-commenter commented on pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1144753999

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#8818](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1a54b02) into [master](https://codecov.io/gh/apache/pinot/commit/b597226592e0b0cc26b423c95d9479cc39cef6ad?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b597226) will **decrease** coverage by `42.74%`.
   > The diff coverage is `22.18%`.
   
   > :exclamation: Current head 1a54b02 differs from pull request most recent head f6d6db3. Consider uploading reports for the commit f6d6db3 to get more accurate results
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #8818       +/-   ##
   =============================================
   - Coverage     69.72%   26.97%   -42.75%     
   + Complexity     4621        1     -4620     
   =============================================
     Files          1735     1725       -10     
     Lines         91320    91055      -265     
     Branches      13644    13616       -28     
   =============================================
   - Hits          63670    24560    -39110     
   - Misses        23228    64157    +40929     
   + Partials       4422     2338     -2084     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `26.97% <22.18%> (-0.01%)` | :arrow_down: |
   | integration2 | `?` | |
   | unittests1 | `?` | |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...e/pinot/common/utils/FileUploadDownloadClient.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvRmlsZVVwbG9hZERvd25sb2FkQ2xpZW50LmphdmE=) | `46.28% <ø> (-13.15%)` | :arrow_down: |
   | [...ler/api/resources/TableConfigsRestletResource.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9hcGkvcmVzb3VyY2VzL1RhYmxlQ29uZmlnc1Jlc3RsZXRSZXNvdXJjZS5qYXZh) | `0.00% <0.00%> (-75.44%)` | :arrow_down: |
   | [...ava/org/apache/pinot/core/plan/FilterPlanNode.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9wbGFuL0ZpbHRlclBsYW5Ob2RlLmphdmE=) | `48.09% <0.00%> (-34.08%)` | :arrow_down: |
   | [...t/core/query/optimizer/filter/FilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL0ZpbHRlck9wdGltaXplci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../optimizer/filter/FlattenAndOrFilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL0ZsYXR0ZW5BbmRPckZpbHRlck9wdGltaXplci5qYXZh) | `88.88% <ø> (ø)` | |
   | [...ery/optimizer/filter/MergeEqInFilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL01lcmdlRXFJbkZpbHRlck9wdGltaXplci5qYXZh) | `72.83% <ø> (-19.76%)` | :arrow_down: |
   | [...optimizer/filter/TimePredicateFilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL1RpbWVQcmVkaWNhdGVGaWx0ZXJPcHRpbWl6ZXIuamF2YQ==) | `40.66% <ø> (-41.34%)` | :arrow_down: |
   | [...nt/creator/impl/inv/BitmapInvertedIndexWriter.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9pbnYvQml0bWFwSW52ZXJ0ZWRJbmRleFdyaXRlci5qYXZh) | `0.00% <0.00%> (-90.00%)` | :arrow_down: |
   | [...gment/index/readers/BitmapInvertedIndexReader.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3JlYWRlcnMvQml0bWFwSW52ZXJ0ZWRJbmRleFJlYWRlci5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...ache/pinot/segment/local/utils/IngestionUtils.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9Jbmdlc3Rpb25VdGlscy5qYXZh) | `0.00% <0.00%> (-28.24%)` | :arrow_down: |
   | ... and [1261 more](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [b597226...f6d6db3](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145804872

   I've changed the implementation and the description of the PR to be focused on `regexp_like` instead of `text_match`. 
   
   Anyway, I would like to have some benchmarks before this feature is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on code in PR #8818:
URL: https://github.com/apache/pinot/pull/8818#discussion_r894797921


##########
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkFuseRegexp.java:
##########
@@ -0,0 +1,343 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.perf;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.TimeUnit;
+import org.apache.commons.io.FileUtils;
+import org.apache.pinot.queries.BaseQueriesTest;
+import org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader;
+import org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl;
+import org.apache.pinot.segment.local.segment.index.loader.IndexLoadingConfig;
+import org.apache.pinot.segment.local.segment.readers.GenericRowRecordReader;
+import org.apache.pinot.segment.spi.IndexSegment;
+import org.apache.pinot.segment.spi.creator.SegmentGeneratorConfig;
+import org.apache.pinot.spi.config.table.FSTType;
+import org.apache.pinot.spi.config.table.FieldConfig;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.TableType;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.data.readers.RecordReader;
+import org.apache.pinot.spi.utils.builder.TableConfigBuilder;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.CompilerControl;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Warmup;
+import org.openjdk.jmh.infra.Blackhole;
+import org.openjdk.jmh.runner.Runner;
+import org.openjdk.jmh.runner.options.OptionsBuilder;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@Fork(1)
+@Warmup(iterations = 5, time = 1)
+@Measurement(iterations = 5, time = 1)
+@State(Scope.Benchmark)
+public class BenchmarkFuseRegexp extends BaseQueriesTest {
+
+  private static final File INDEX_DIR = new File(FileUtils.getTempDirectory(), "BenchmarkFuseRegexp");
+  private static final String TABLE_NAME = "MyTable";
+  private static final String SEGMENT_NAME = "testSegment";
+  private static final String DOMAIN_NAMES_COL = "DOMAIN_NAMES";
+  private static final String INT_COL_NAME = "INT_COL";
+  private static final String NO_INDEX_STRING_COL_NAME = "NO_INDEX_COL";
+
+  @Param({"LUCENE", "NATIVE", "null"})
+  private String _fstType;
+  //@Param({"DOMAIN_NAMES LIKE '%domain<i>%'"})
+  String _predicate = "DOMAIN_NAMES LIKE '%domain<i>%'";
+  //@Param({"and", "or"})
+  String _conjunction = "or";
+//  @Param("2500000")
+  int _numRows = 2500000;
+//  @Param("1000")
+  int _intBaseValue = 1000;
+//  @Param({"100"})
+  int _numBlocks = 100;
+
+  private IndexSegment _indexSegment;
+  private Schema _schema;
+  private TableConfig _tableConfig;
+
+  @Setup(Level.Trial)
+  public void setUp()
+      throws Exception {
+    FileUtils.deleteQuietly(INDEX_DIR);
+    FSTType fstType = _fstType.equals("null") ? null : FSTType.valueOf(_fstType);
+    buildSegment(fstType);
+    IndexLoadingConfig indexLoadingConfig = new IndexLoadingConfig();
+    if (fstType != null) {
+      Set<String> fstIndexCols = new HashSet<>();
+      fstIndexCols.add(DOMAIN_NAMES_COL);
+      indexLoadingConfig.setFSTIndexColumns(fstIndexCols);
+      indexLoadingConfig.setFSTIndexType(fstType);
+    }
+    _indexSegment = ImmutableSegmentLoader.load(new File(INDEX_DIR, SEGMENT_NAME), indexLoadingConfig);
+
+    checkCorrectness();
+  }
+
+  @TearDown(Level.Trial)
+  public void tearDown() {
+    _indexSegment.destroy();
+    FileUtils.deleteQuietly(INDEX_DIR);
+    EXECUTOR_SERVICE.shutdownNow();
+  }
+
+  private List<GenericRow> createTestData(int numRows) {
+    List<GenericRow> rows = new ArrayList<>(numRows);
+    String[] domainNames = new String[]{
+        "www.domain1.com", "www.domain1.co.ab", "www.domain1.co.bc",
+        "www.domain1.co.cd", "www.sd.domain1.com", "www.sd.domain1.co.ab", "www.sd.domain1.co.bc",
+        "www.sd.domain1.co.cd", "www.domain2.com", "www.domain2.co.ab", "www.domain2.co.bc", "www.domain2.co.cd",
+        "www.sd.domain2.com", "www.sd.domain2.co.ab", "www.sd.domain2.co.bc", "www.sd.domain2.co.cd"
+    };
+    String[] noIndexData = new String[]{"test1", "test2", "test3", "test4", "test5"};
+    for (int i = 0; i < numRows; i++) {
+      String domain = domainNames[i % domainNames.length];
+      GenericRow row = new GenericRow();
+      row.putField(INT_COL_NAME, _intBaseValue + i);
+      row.putField(NO_INDEX_STRING_COL_NAME, noIndexData[i % noIndexData.length]);
+      row.putField(DOMAIN_NAMES_COL, domain);
+      rows.add(row);
+    }
+    return rows;
+  }
+
+  private void buildSegment(FSTType fstType)
+      throws Exception {
+    List<GenericRow> rows = createTestData(_numRows);
+    List<FieldConfig> fieldConfigs = new ArrayList<>();
+    fieldConfigs.add(
+        new FieldConfig(DOMAIN_NAMES_COL, FieldConfig.EncodingType.DICTIONARY, FieldConfig.IndexType.FST, null, null));
+    _tableConfig = new TableConfigBuilder(TableType.OFFLINE).setTableName(TABLE_NAME)
+        .setInvertedIndexColumns(Collections.singletonList(DOMAIN_NAMES_COL)).setFieldConfigList(fieldConfigs).build();
+    _schema = new Schema.SchemaBuilder().setSchemaName(TABLE_NAME)
+        .addSingleValueDimension(DOMAIN_NAMES_COL, FieldSpec.DataType.STRING)
+        .addSingleValueDimension(NO_INDEX_STRING_COL_NAME, FieldSpec.DataType.STRING)
+        .addMetric(INT_COL_NAME, FieldSpec.DataType.INT).build();
+    SegmentGeneratorConfig config = new SegmentGeneratorConfig(_tableConfig, _schema);
+    config.setOutDir(INDEX_DIR.getPath());
+    config.setTableName(TABLE_NAME);
+    config.setSegmentName(SEGMENT_NAME);
+    if (fstType != null) {
+      config.setFSTIndexType(fstType);
+    }
+    SegmentIndexCreationDriverImpl driver = new SegmentIndexCreationDriverImpl();
+    try (RecordReader recordReader = new GenericRowRecordReader(rows)) {
+      driver.init(config, recordReader);
+      driver.build();
+    }
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void optimal1Like(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain0%'";
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void optimal1Regex(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain0')";
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void optimal10(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain\\d')";
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void increasing10Like(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE "
+        + "DOMAIN_NAMES LIKE '%domain0%' or "
+        + "DOMAIN_NAMES LIKE '%domain1%' or "
+        + "DOMAIN_NAMES LIKE '%domain2%' or "
+        + "DOMAIN_NAMES LIKE '%domain3%' or "
+        + "DOMAIN_NAMES LIKE '%domain4%' or "
+        + "DOMAIN_NAMES LIKE '%domain5%' or "
+        + "DOMAIN_NAMES LIKE '%domain6%' or "
+        + "DOMAIN_NAMES LIKE '%domain7%' or "
+        + "DOMAIN_NAMES LIKE '%domain8%' or "
+        + "DOMAIN_NAMES LIKE '%domain9%'";
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void increasing10Regex(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE "
+        + "regexp_like(DOMAIN_NAMES, '^.domain0.*$') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain1.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain2.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain3.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain4.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain5.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain6.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain7.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain8.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain9.*')";
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void increasing10Fusing(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE "
+        + "regexp_like(DOMAIN_NAMES, '"
+        + "(?:^.*domain0.*$)|"
+        + "(?:^.*domain1.*$)|"
+        + "(?:^.*domain2.*$)|"
+        + "(?:^.*domain3.*$)|"
+        + "(?:^.*domain4.*$)|"
+        + "(?:^.*domain5.*$)|"
+        + "(?:^.*domain6.*$)|"
+        + "(?:^.*domain7.*$)|"
+        + "(?:^.*domain8.*$)|"
+        + "(?:^.*domain9.*$)')";
+
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void decreasing9Like(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE "
+        + "DOMAIN_NAMES LIKE '%domain9%' or "
+        + "DOMAIN_NAMES LIKE '%domain8%' or "
+        + "DOMAIN_NAMES LIKE '%domain7%' or "
+        + "DOMAIN_NAMES LIKE '%domain6%' or "
+        + "DOMAIN_NAMES LIKE '%domain5%' or "
+        + "DOMAIN_NAMES LIKE '%domain4%' or "
+        + "DOMAIN_NAMES LIKE '%domain3%' or "
+        + "DOMAIN_NAMES LIKE '%domain2%' or "
+        + "DOMAIN_NAMES LIKE '%domain1%'";
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void decreasing9Regex(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE "
+        + "regexp_like(DOMAIN_NAMES, '^.domain9.*$') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain8.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain7.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain6.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain5.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain4.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain3.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain2.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain2.*')";
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Benchmark
+  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+  public void decreasing9Fusing(Blackhole bh) {
+    String query = "SELECT INT_COL FROM MyTable WHERE "
+        + "regexp_like(DOMAIN_NAMES, '"
+        + "(?:^.*domain9.*$)|"
+        + "(?:^.*domain8.*$)|"
+        + "(?:^.*domain7.*$)|"
+        + "(?:^.*domain6.*$)|"
+        + "(?:^.*domain5.*$)|"
+        + "(?:^.*domain4.*$)|"
+        + "(?:^.*domain3.*$)|"
+        + "(?:^.*domain2.*$)|"
+        + "(?:^.*domain1.*$)')";
+
+    bh.consume(getBrokerResponse(query));
+  }
+
+  @Override
+  protected String getFilter() {
+    return null;
+  }
+
+  @Override
+  protected IndexSegment getIndexSegment() {
+    return _indexSegment;
+  }
+
+  @Override
+  protected List<IndexSegment> getIndexSegments() {
+    return Collections.singletonList(_indexSegment);
+  }
+
+  public static void main(String[] args)
+      throws Exception {
+    new Runner(new OptionsBuilder().include(BenchmarkFuseRegexp.class.getSimpleName())
+//        .addProfiler(JavaFlightRecorderProfiler.class)
+        .build()).run();
+  }
+
+  private void checkCorrectness() {
+    String query1 = "SELECT INT_COL FROM MyTable WHERE "
+        + "regexp_like(DOMAIN_NAMES, '^.domain9.*$') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain8.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain7.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain6.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain5.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain4.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain3.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain2.*') or "
+        + "regexp_like(DOMAIN_NAMES, '^.domain2.*')";
+    String query2 = "SELECT INT_COL FROM MyTable WHERE "
+        + "regexp_like(DOMAIN_NAMES, '"
+        + "(?:^.*domain9.*$)|"
+        + "(?:^.*domain8.*$)|"
+        + "(?:^.*domain7.*$)|"
+        + "(?:^.*domain6.*$)|"
+        + "(?:^.*domain5.*$)|"
+        + "(?:^.*domain4.*$)|"
+        + "(?:^.*domain3.*$)|"
+        + "(?:^.*domain2.*$)|"
+        + "(?:^.*domain1.*$)')";
+
+    long total1 = getBrokerResponse(query1).getTotalDocs();
+    long total2 = getBrokerResponse(query2).getTotalDocs();
+
+    if (total1 != total2) {

Review Comment:
   This is not the correct check. This returns the total documents of the table, instead of the matching docs. `getNumDocsScanned` will return the documents matched



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154277210

   > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
   > 
   > ```
   > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
   > BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   > BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   > BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   > 
   > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
   > BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   > BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   > BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   > ```
   > 
   > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
   
   The difference will be due to determinization of the automaton, which the native implementation doesn’t do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156850809

@gortiz @atris Thanks for taking time looking into this issue. Even though we might not move forward with the query rewrite, we got several valuable insights from the benchmark.

With regard to the query syntax, +100 on we should keep the query behavior consistent with or without the FST index. Currently we try to convert java regex to Lucene search query using `RegexpPatternConverterUtils.regexpLikeToLuceneRegExp()`. The support is very limited though, and might only be able to cover the basic `LIKE` predicate.

We want to utilize the FST to accelerate the regex matching (`LIKE` and `REGEXP_LIKE`). There are 2 approaches:
- Add a new predicate type, e.g. `FST_LIKE` and take the Lucene dialect
- Make FST work on java regex

I personally prefer the second approach because it can improve the existing query performance, but it requires changes to the native FST, and we cannot support Lucene FST anymore. Or we can support both, and make the `FST_LIKE` only work with Lucene FST.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152641627

   @gortiz Thanks for the detailed analysis. This is awesome, and very interesting!
   
   > ## We cannot use this optimization with Lucene
   > The cost is too high when the expression gets complex. It even with ` java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states.` when the expression have 10 branches.
   > 
   
   Curious how native FST handles it, and why is it so much faster than the Lucene FST. cc @atris 
   
   > ## In general `COL like 'whatever'` expression is 2x-3x worse than `regexp_like(COL, 'similar expression')`
   > Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when `BaseQueriesTest.getBrokerResponse` is called (which is great, because we can test the difference).
   
   I think the reason is that it is not apple-to-apple comparison. `col LIKE 'abc'` is actually equivalent to `regexp_like(col, '^abc$')`
   
   > ## Pinot regex execute faster wildcards than a single values
   > It is quite clear in the optimal cases:
   > 
   > ```
   > -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
   > BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
   > -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
   > BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op
   > ```
   > 
   > At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset.
   > 
   > ```
   > select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1
   > ```
   > 
   > Responds quite faster than
   > 
   > ```
   > select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1
   > ```
   > 
   > The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x.
   > 
   > I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it.
   
   Let's re-do the benchmark. I don't believe there can be 2000x difference. Also, this is very counter-intuitive, so I'd suggest let's figure out if it is actually from the java regex engine, or because of our code.
   
   > ### When there are several regexp in the actual expression, indexes are very useful.
   > Examples:
   > 
   > ```
   > BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    0.362  ms/op
   > BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    0.222  ms/op
   > BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  379.729  ms/op
   > 
   > BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    0.208  ms/op
   > BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    0.159  ms/op
   > BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  289.928  ms/op
   > ```
   
   I feel the inefficiency is from `regexp_like(DOMAIN_NAMES, 'domain0')` (exact match) which takes much longer than expected. We should check what is causing the huge performance difference


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154271827

   > 
   
   Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
   
   ```
   SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
   BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   
   SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
   BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   ```
   
   We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156103702

@gortiz There has been a long comment trail on this PR so I am summarising my responses in this comment:

1. Performance -- Like any other index, FST indices also have their worst case. FST is an automaton, and in an extreme scenario where a high number of branches are traversed for a query, the performance of the index will approach the brute force approximation.

In this case, the query you tried had an abnormally large number of branches, to the extent that Lucene decided to not even try and process the query. The average user rarely fires such queries, and in the corner cases where they do, they should be ok with a brute force latency, as long as their 99th percentile queries keep performing significantly better, which is the case as is evident from the benchmarks that test the FST index.

2. Language Semantics -- Lucene has, for long, published its regex dialect clearly. Engines based on Lucene, such as ElasticSearch and Apache Solr, follow the same dialect (https://help.oncrawl.com/en/articles/1685982-lucene-regex-cheat-sheet)

It is arguable whether this is correct or not, but we have to keep historical perspective in mind, since many Pinot users migrate from one of these engines.

Please note that, from native FST index's perspective, it is pretty simple to support java regex language. The reason I did not do that was because the existing FST index was based on Lucene and thus supported Lucene's regex dialect. For back compatibility, native FST index supports the same dialect.

3. Parsing And Reporting Invalid Regexes: I agree that in an ideal scenario, we should be parsing regexes and reporting characters that we do not accept. However, given the complexity of an average regex, the time complexity of doing this will be non trivial.

4. Specific To This PR: I looked into the PR and while the effort is appreciative, I do not see significant performance improvement. Also, the API surface changes required for this change are high, which makes me a bit uneasy.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154278541

   > > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
   > > 
   > > ```
   > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
   > > BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   > > BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   > > BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   > > 
   > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
   > > BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   > > BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   > > BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   > > ```
   > > 
   > > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
   > 
   > The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
   
   I doubt determination alone can cause such a massive difference.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155193112

   @gortiz another thing to remember is that neither Lucene nor native FST index follow the java regex pattern language.
   
    If you look up the Lucene regex documentation, and look at the FST index and LIKE tests in Pinot, you will notice the difference.
   
   Unfortunately, since we do not actually parse the regex engine until deep inside the FST territory, we have no way of reporting unknown characters. That may be a factor leading to 0 results for your queries


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: text_match fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145276586

   I thought this was supposed to be targeting regexp_like 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145835768

   I didn't know LIKE expressions were evaluated as REGEX_LIKE. Reading the code, it seems that the transformation is made in `RequestContextUtils`, which is evaluated after the filter optimization is done. This means that (in case I don't miss something else) the optimizations included here will not affect LIKE expressions.
   
   I can modify the optimizer to add the ability to transform LIKE expressions into REGEX_LIKE the same way `RequestContextUtils` does right now. But I'm going to focus on the benchmark that tests the scenarios @richardstartin mentioned, but explicitly using REGEXP_LIKE for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152506922

   My last commit includes a JMH benchmark. I found several interesting things. Please take a look, some of my conclusions are so shocking that it wouldn't surprise me if another pair of eyes discovered that I did something wrong.
   
   TL;DR: I think we should apply this optimization only when executed in columns without indexes. This may make the implementation more difficult, but the performance gain may be very large.
   
   Benchmarks are:
   - optimal1Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'domain0'`
   - optimal1Regex: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain0')`
   - optimal10: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain\d')`
   - increasing10Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain0%' or DOMAIN_NAMES LIKE '%domain1%' or ... or DOMAIN_NAMES LIKE '%domain9%'`
   - increasing10Regex: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES LIKE, '^.domain0.*$') or ... or regexp_like(DOMAIN_NAMES LIKE, '^.domain9.*$')`
   - increasing10Fusing: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')`
   - decreasing9Like, decreasing9Regex and decreasing9Fusing: Are like `increasing10Like`, `increasing10Regex` and `increasing10Fusing` but going from 9 to 1
      - Note: they have in fact 9 elements. This a mistake. Initially queries where created dynamically and I did it wrong. But it was a lucky mistake that shows that Lucene fails when there are 10 fused expression but it doesn't fail when there are 9.
   
   Last results I´ve got were:
   ```
   Benchmark                               (_fstType)  Mode  Cnt     Score      Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   BenchmarkFuseRegexp.decreasing9Like         LUCENE  avgt    5     0.615 ±    0.056  ms/op
   BenchmarkFuseRegexp.decreasing9Like         NATIVE  avgt    5     0.515 ±    0.111  ms/op
   BenchmarkFuseRegexp.decreasing9Like           null  avgt    5  3596.864 ± 1335.035  ms/op
   BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    0.362  ms/op
   BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    0.222  ms/op
   BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  379.729  ms/op
   BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   BenchmarkFuseRegexp.increasing10Like        LUCENE  avgt    5     0.696 ±    0.121  ms/op
   BenchmarkFuseRegexp.increasing10Like        NATIVE  avgt    5     0.540 ±    0.049  ms/op
   BenchmarkFuseRegexp.increasing10Like          null  avgt    5  3543.339 ±  183.168  ms/op
   BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    0.208  ms/op
   BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    0.159  ms/op
   BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  289.928  ms/op
   BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    0.066  ms/op
   BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    0.011  ms/op
   BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
   BenchmarkFuseRegexp.optimal1Like            LUCENE  avgt    5     0.179 ±    0.066  ms/op
   BenchmarkFuseRegexp.optimal1Like            NATIVE  avgt    5     0.134 ±    0.029  ms/op
   BenchmarkFuseRegexp.optimal1Like              null  avgt    5   470.387 ±  187.235  ms/op
   BenchmarkFuseRegexp.optimal1Regex           LUCENE  avgt    5     0.172 ±    0.029  ms/op
   BenchmarkFuseRegexp.optimal1Regex           NATIVE  avgt    5     0.141 ±    0.027  ms/op
   BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op
   ```
   
   After analyzing it, there are some points that are quite clear:
   
   ## We cannot use this optimization with Lucene
   
   The cost is too high when the expression gets complex. It even with `
   java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states.` when the expression have 10 branches.
   
   ## In general `COL like 'whatever'` expression is 2x-3x worse than `regexp_like(COL, 'similar expression')`
   
   Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when `BaseQueriesTest.getBrokerResponse` is called (which is great, because we can test the difference).
   
   I've also tried the controller UI interface and this issue is repeated there. Using the github_events dataset, 
   ```
   select * from githubEvents where regexp_like(actor_url, '^.*victor\d.*$') limit 1
   ```
   is quite faster than
   ```
   select * from githubEvents where actor_url LIKE '%victor7%' limit 1
   ```
   
   ## Pinot regex execute faster wildcards than a single values
   
   It is quite clear in the optimal cases:
   ```
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
   BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
   BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op
   ```
   
   At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset.
   ```
   select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1
   ```
   Responds quite faster than
    
   ```
   select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1
   ```
   
   The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x.
   
   I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it.
   
   ## Native or Lucene indexes don't actually increase the performance of this regexp
   
   This benchmark wasn't focused on that and more studies should be done, but it seems clear that:
   
   ### When there are several regexp in the actual expression, indexes are very useful.
   
   Examples:
   ```
   BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    0.362  ms/op
   BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    0.222  ms/op
   BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  379.729  ms/op
   
   BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    0.208  ms/op
   BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    0.159  ms/op
   BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  289.928  ms/op
   ```
   
   - When there is only one regexp in the actual expressions, indexes are not very useful.
   For example:
   ```
   -- this actually has a single, quite complex regex expression
   BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   -- this actually has a single, quite complex regex expression
   BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
   BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    0.066  ms/op
   BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    0.011  ms/op
   BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
   BenchmarkFuseRegexp.optimal1Regex           LUCENE  avgt    5     0.172 ±    0.029  ms/op
   BenchmarkFuseRegexp.optimal1Regex           NATIVE  avgt    5     0.141 ±    0.027  ms/op
   BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op
   ```
   
   The last one shows a big difference, but it may be related to the issue with regex without wildcards that I discussed above.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154336975

   > > > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
   > > > ```
   > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
   > > > BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   > > > BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   > > > BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   > > > 
   > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
   > > > BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   > > > BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   > > > BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   > > > ```
   > > > 
   > > > 
   > > >     
   > > >       
   > > >     
   > > > 
   > > >       
   > > >     
   > > > 
   > > >     
   > > >   
   > > > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
   > > 
   > > 
   > > The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
   > 
   > I doubt determinization alone can cause such a massive difference.
   
   Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154343989

   > > > > > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
   > > > > > ```
   > > > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
   > > > > > BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   > > > > > BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   > > > > > BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   > > > > > 
   > > > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
   > > > > > BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   > > > > > BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   > > > > > BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   > > > > > ```
   > > > > > 
   > > > > > 
   > > > > >     
   > > > > >       
   > > > > >     
   > > > > > 
   > > > > >       
   > > > > >     
   > > > > > 
   > > > > >     
   > > > > >   
   > > > > > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
   > > > > 
   > > > > 
   > > > > The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
   > > > 
   > > > 
   > > > I doubt determinization alone can cause such a massive difference.
   > > 
   > > 
   > > Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)
   > 
   > I am still failing to understand what the concern is.
   > 
   > Even if its plain determinization that causes native to be significantly faster, is that not a good thing?
   
   @Jackie-Jiang was asking about the root cause of the difference. I suggested that the root cause is likely to be determinization, but it needs to be investigated. Where has concern been expressed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing

Posted by GitBox <gi...@apache.org>.

atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154338406

   > > > > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
   > > > > ```
   > > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
   > > > > BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
   > > > > BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
   > > > > BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
   > > > > 
   > > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
   > > > > BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   > > > > BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
   > > > > BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
   > > > > ```
   > > > > 
   > > > > 
   > > > >     
   > > > >       
   > > > >     
   > > > > 
   > > > >       
   > > > >     
   > > > > 
   > > > >     
   > > > >   
   > > > > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
   > > > 
   > > > 
   > > > The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
   > > 
   > > I doubt determinization alone can cause such a massive difference.
   > 
   > Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)
   
   I am still failing to understand what the concern is. 
   
   Even if its plain determinization that causes native to be significantly faster, is that not a good thing? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org