You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/06/02 10:34:31 UTC
[GitHub] [pinot] gortiz opened a new pull request, #8818: text_match fusing
gortiz opened a new pull request, #8818:
URL: https://github.com/apache/pinot/pull/8818
When executing queries like:
```sql
select col1, col2 from Table where text_match(col1, '/r1/') or text_match(col1, '/r2/')
```
Pinot has to scan the referred column twice. This PR creates an optimization that tries to fuse boolean algebra and `text_match` predicates. Specifically, as indicated in the Javadoc:
- Queries `where text_match(col1, '/r1/') and text_match(col1, '/r2/')` will be translated to `where text_match(col1, '/(?=r1)(?=r2)/')`
- Queries `where text_match(col1, '/r1/') or text_match(col1, '/r2/')` will be translated to `where text_match(col1, '/(?:r1)|(?:r2)/')`
- Queries `where not text_match(col1, '/r1/')` will be translated to `where text_match(col1, '/(?!r1)/')`
There are some tests that apply the optimization to more advanced cases.
Regex can be quite complex and it isn't clear to me how expressive is the regex language we (and lucene) support. By doing some analysis I'm sure that this optimization will break some regex like the ones that use backreferences.
To know if the optimization can be applied or not, it would be necessary to analyze the regex, which would require to parse the regex into a AST. As far as I know Pinot doesn't have that, so this optimization is disabled by default and can be enabled by activating a new query option.
Future improvements may include to translate some `text_match` predicated that are not regex to regex, so we would be able to merge things like `where text_match(col1, '/r1/') or text_match(col1, '"literal1" and "literal2"')` into `where text_match(col1, '/(?:r1)|(?=(?:literal1)|(?:literal2))/')`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145782478
ORMs are not going to generate them, but a user may try to do something like `regexp_like(col1, '(.+)=\1')`, which will be positive in texts like `a=a` but false in `b=a`. I think this is very bizarre, but possible.
What I'm going to do is to try to detect the usage of backreferences in the regexp. To do so, I'm going to discard any `regexp_like` whose predicate matches with `.*(?:(?:\\[0-9]+)|(?:\\k\<\w+\>)).*`. The function may have false positives (for example the regexp`\\1` will be a false positive) and false negatives (where is more difficult to find examples, maybe playing with utf8?), but it may be good enough to activate the query option by default in a future release.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154392815
Just circling back here having taken a quick look at `BenchmarkFuseRegexp.decreasing9Fusing` - I got similar results to Gonzalo:
```
Benchmark (_fstType) Mode Cnt Score Error Units
BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 32.525 ± 2.733 ms/op
BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.487 ± 0.123 ms/op
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.845 ± 0.204 ms/op
```
Lucene spends virtually all of its time in _minimization_ (which I mistakenly referred to as _determinization_ earlier):
<img width="936" alt="Screenshot 2022-06-13 at 21 10 51" src="https://user-images.githubusercontent.com/16439049/173436813-a13a5830-ad7b-4a8c-8e5f-3cf8003640df.png">
Now if I enable minimization in the native implementation by reverting #8237, the scores change drastically:
```
Benchmark (_fstType) Mode Cnt Score Error Units
BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 32.399 ± 3.005 ms/op
BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 366.520 ± 40.574 ms/op
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.819 ± 0.172 ms/op
```
As do the profiles. Without minimization, there is no clear single bottleneck:
<img width="879" alt="Screenshot 2022-06-13 at 21 14 47" src="https://user-images.githubusercontent.com/16439049/173437464-980859ce-e7c5-4e8f-b43e-ba609ab39396.png">
With minimization, there is a single bottleneck, and it's in minimization :
<img width="931" alt="Screenshot 2022-06-13 at 21 23 10" src="https://user-images.githubusercontent.com/16439049/173438712-5f8efc36-1f8d-4ea5-8bb6-b135d21ff4fc.png">
As to whether this is cause for concern is beside the point; this is the root cause of the difference observed, rather than a quirk of the benchmark which needs to be analysed and removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145792924
> ORMs are not going to generate them
On the contrary, this exact pattern has been observed generated by customer systems querying Pinot
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155165270
I was able to invest more time in this topic today and I have to say that my initial benchmark was completely useless. Apart from some typos in the queries (which may have some impact), there were two problems that completely fooled me:
- The implicit limit introduced by Pinot. As only the first 10 elements were returned, queries that actually found elements could finish faster.
- Something is wrong with Lucene and Native indexes. I guess I'm not configuring them correctly, but they simply return 0 results when the regexp is a little bit complex (like having groups).
Therefore I've changed all the tests to return a count and each iteration is actually verifying whether the expected value is returned or not, in which case the benchmark is stopped. As a result, all fusing tests fail whenever there is an index. I hope it is something related to my test (ie I didn't configure the index properly) and not some semantic discrepancy these indexes have in relation how each engine evaluates regexp, as that could mean that there are customers receiving false results.
These are the results I've got. Note that some combinations are not shown. For example, decreasing9Fusing and LUCENE or NATIVE. That means that these combinations did not return what it was expected (in all these particular cases it means that they return 0 rows). Note that apart from the fusing case, benchmarks like `optimal10`, which is `regexp_like(DOMAIN_NAMES, 'domain\d')` did also failed.
```
Benchmark (_fstType) Mode Cnt Score Error Units
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 1438.185 ± 73.963 ms/op
BenchmarkFuseRegexp.decreasing9Like LUCENE avgt 5 41.927 ± 0.984 ms/op
BenchmarkFuseRegexp.decreasing9Like NATIVE avgt 5 40.232 ± 5.597 ms/op
BenchmarkFuseRegexp.decreasing9Like null avgt 5 3957.902 ± 154.228 ms/op
BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 40.744 ± 2.059 ms/op
BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 61.443 ± 1.301 ms/op
BenchmarkFuseRegexp.decreasing9Regex null avgt 5 2116.747 ± 602.570 ms/op
BenchmarkFuseRegexp.increasing10Fusing null avgt 5 1320.869 ± 45.509 ms/op
BenchmarkFuseRegexp.increasing10Like LUCENE avgt 5 42.253 ± 0.695 ms/op
BenchmarkFuseRegexp.increasing10Like NATIVE avgt 5 39.049 ± 1.829 ms/op
BenchmarkFuseRegexp.increasing10Like null avgt 5 5071.207 ± 3134.887 ms/op
BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 44.084 ± 2.737 ms/op
BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 39.650 ± 1.567 ms/op
BenchmarkFuseRegexp.increasing10Regex null avgt 5 2318.479 ± 75.632 ms/op
BenchmarkFuseRegexp.optimal10 null avgt 5 185.757 ± 3.426 ms/op
BenchmarkFuseRegexp.optimal10NotFound LUCENE avgt 5 0.212 ± 0.136 ms/op
BenchmarkFuseRegexp.optimal10NotFound NATIVE avgt 5 0.201 ± 0.140 ms/op
BenchmarkFuseRegexp.optimal10NotFound null avgt 5 248.525 ± 60.819 ms/op
BenchmarkFuseRegexp.optimal1Like LUCENE avgt 5 16.431 ± 3.533 ms/op
BenchmarkFuseRegexp.optimal1Like NATIVE avgt 5 16.586 ± 0.828 ms/op
BenchmarkFuseRegexp.optimal1Like null avgt 5 468.731 ± 185.623 ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound LUCENE avgt 5 0.197 ± 0.062 ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound NATIVE avgt 5 0.170 ± 0.072 ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound null avgt 5 477.176 ± 129.445 ms/op
BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 15.485 ± 0.517 ms/op
BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 16.711 ± 0.857 ms/op
BenchmarkFuseRegexp.optimal1Regex null avgt 5 205.676 ± 6.921 ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound LUCENE avgt 5 0.212 ± 0.123 ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound NATIVE avgt 5 0.186 ± 0.029 ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound null avgt 5 239.459 ± 13.846 ms/op
BenchmarkFuseRegexp.selective2Fusing null avgt 5 541.530 ± 168.173 ms/op
BenchmarkFuseRegexp.selective2Like LUCENE avgt 5 19.961 ± 0.937 ms/op
BenchmarkFuseRegexp.selective2Like NATIVE avgt 5 20.387 ± 1.953 ms/op
BenchmarkFuseRegexp.selective2Like null avgt 5 916.620 ± 96.702 ms/op
BenchmarkFuseRegexp.selective2Regex LUCENE avgt 5 23.366 ± 0.795 ms/op
BenchmarkFuseRegexp.selective2Regex NATIVE avgt 5 20.037 ± 0.601 ms/op
BenchmarkFuseRegexp.selective2Regex null avgt 5 389.965 ± 7.284 ms/op
```
I've also changed `RegexpPatternConverterUtils` (and its relative test) to do not introduce useless `^.*` at the beginning of the expression nor `.*$` at the end. I've repeated the benchmark with that change and these are the results I've got. As you can see, most benchmarks didn't change that much, but the ones that use like and have no index are twice as fast, so they achieve the same numbers than their equivalent `optimalXRegex`, as expected.
```
Benchmark (_fstType) Mode Cnt Score Error Units
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 1365.403 ± 91.822 ms/op
BenchmarkFuseRegexp.decreasing9Like LUCENE avgt 5 40.553 ± 1.394 ms/op
BenchmarkFuseRegexp.decreasing9Like NATIVE avgt 5 39.057 ± 0.905 ms/op
BenchmarkFuseRegexp.decreasing9Like null avgt 5 2046.090 ± 40.428 ms/op
BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 42.584 ± 2.387 ms/op
BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 61.583 ± 1.893 ms/op
BenchmarkFuseRegexp.decreasing9Regex null avgt 5 2054.523 ± 75.452 ms/op
BenchmarkFuseRegexp.increasing10Fusing null avgt 5 1305.249 ± 65.576 ms/op
BenchmarkFuseRegexp.increasing10Like LUCENE avgt 5 43.328 ± 2.040 ms/op
BenchmarkFuseRegexp.increasing10Like NATIVE avgt 5 61.580 ± 1.149 ms/op
BenchmarkFuseRegexp.increasing10Like null avgt 5 2272.112 ± 99.393 ms/op
BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 54.319 ± 2.048 ms/op
BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 39.080 ± 1.817 ms/op
BenchmarkFuseRegexp.increasing10Regex null avgt 5 2267.062 ± 49.525 ms/op
BenchmarkFuseRegexp.optimal10 null avgt 5 187.012 ± 13.054 ms/op
BenchmarkFuseRegexp.optimal10NotFound LUCENE avgt 5 0.210 ± 0.103 ms/op
BenchmarkFuseRegexp.optimal10NotFound NATIVE avgt 5 0.201 ± 0.066 ms/op
BenchmarkFuseRegexp.optimal10NotFound null avgt 5 242.423 ± 17.612 ms/op
BenchmarkFuseRegexp.optimal1Like LUCENE avgt 5 16.395 ± 0.884 ms/op
BenchmarkFuseRegexp.optimal1Like NATIVE avgt 5 15.506 ± 0.591 ms/op
BenchmarkFuseRegexp.optimal1Like null avgt 5 209.389 ± 11.456 ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound LUCENE avgt 5 0.209 ± 0.129 ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound NATIVE avgt 5 0.172 ± 0.021 ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound null avgt 5 245.542 ± 35.323 ms/op
BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 15.421 ± 0.804 ms/op
BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 15.908 ± 0.491 ms/op
BenchmarkFuseRegexp.optimal1Regex null avgt 5 204.505 ± 13.134 ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound LUCENE avgt 5 0.210 ± 0.089 ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound NATIVE avgt 5 0.203 ± 0.107 ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound null avgt 5 235.262 ± 8.533 ms/op
BenchmarkFuseRegexp.selective2Fusing null avgt 5 517.545 ± 193.977 ms/op
BenchmarkFuseRegexp.selective2Like LUCENE avgt 5 23.622 ± 1.178 ms/op
BenchmarkFuseRegexp.selective2Like NATIVE avgt 5 20.547 ± 0.778 ms/op
BenchmarkFuseRegexp.selective2Like null avgt 5 399.638 ± 54.746 ms/op
BenchmarkFuseRegexp.selective2Regex LUCENE avgt 5 23.883 ± 0.802 ms/op
BenchmarkFuseRegexp.selective2Regex NATIVE avgt 5 20.113 ± 0.858 ms/op
BenchmarkFuseRegexp.selective2Regex null avgt 5 416.987 ± 28.789 ms/op
```
With this results, the optimization introduced in this PR doesn't seem to be amazing, but to be honest I'm quite worried about what is going on with regexp and Lucene/Native indexes. As said above, I hope that the benchmark is not correctly configuring the indexes, but even with that, it seems very dangerous that an index can be configured in such a way that it affects the semantics of the queries.
@Jackie-Jiang @atris I would really appreciate if any of you can take a look at the benchmark trying to find what is going on.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145708838
I don’t think it’s possible to express backtracking behaviour in a LIKE statement, this optimisation would target long unions of LIKE statements generated by ORMs, and should be most effective in reducing the impact on a cluster when there is no text or FST index in place. I expect this would also help FST based regular expression evaluation, so getting some numbers to justify extending to FST regex evaluators would be great.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154813143
> Lastly, why do you say that indices don't help for single regexes? Unless I am misreading the benchmark result, even for a single regex case, native index is pretty fast?
My benchmark may not be the best and it is only measuring a few regexp cases, but I wouldn't say that native gives us a significant advantage:
```
-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op <-- here native is faster by 1.8x
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op <-- here native is slower by 1.4x, although the difference is lower than the error margin
BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
BenchmarkFuseRegexp.optimal10 LUCENE avgt 5 0.169 ± 0.066 ms/op
BenchmarkFuseRegexp.optimal10 NATIVE avgt 5 0.136 ± 0.011 ms/op <-- here native is faster by 1.1x, although the difference is lower than the error margin
BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op
```
So we have one case where native is 1.8x faster and some others were it is equivalent to the java regex engine, which is known by its lack of performance. I mean, it is not like the native index is atrocious, but I would expect more from an index than a situational 1.8x performance increase.
To be clear, as I already said, this is not a benchmark on the FST itself and it is very narrow, so I don't want to get conclusions about whether we should or we should not use these indexes. I think this benchmark proves that there are very specific cases where these indexes are not very useful and therefore it would be nice to invest time to design proper tests with different kind of regex to verify whether FST indexes are in general useful evaluating regex or not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156032082
> @gortiz another thing to remember is that neither Lucene nor native FST index follow the java regex pattern language.
>
> If you look up the Lucene regex documentation, and look at the FST index and LIKE tests in Pinot, you will notice the difference.
>
> Unfortunately, since we do not actually parse the regex engine until deep inside the FST territory, we have no way of reporting unknown characters. That may be a factor leading to 0 results for your queries
Wow, that is shocking and I would see that as a huge problem to use Pinot in a big team. In cases like that I would expect to have a new operation that does the search using the Lucene expression language or whatever we want to use and fail if there is no index. What you are saying is that if I have a column that is not indexed and suddenly a colleague of mine decides to index that column, queries I had will start to return different (and incorrect) results.
Anyway, if that is what was decided, that is what we have. I'm going to close this PR, as it doesn't make sense to apply this optimization if it can be produce false results when someone adds an index.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on a diff in pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on code in PR #8818:
URL: https://github.com/apache/pinot/pull/8818#discussion_r888712252
##########
pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java:
##########
@@ -255,6 +255,9 @@ private BaseFilterOperator constructPhysicalOperator(FilterContext filter, int n
|| textIndexReader instanceof NativeMutableTextIndex) {
throw new UnsupportedOperationException("TEXT_MATCH is not supported on native text index");
}
+ if (textIndexReader == null) {
+ throw new UnsupportedOperationException("TEXT_MATCH is not supported when there is no text index");
Review Comment:
This moves the correctness check from execution to parsing time, which should be better, but makes `ExplainPlanQueriesTest` fail.
As indicated by its name, this tests does execute `explain plan` instead of the actual query. This is why the test didn't fail before. As this PR shouldn't be related to `text_match` but `regexp_like`, I'm going to remove this change for simplicity
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145814759
I don’t believe the native FST supports backreferences, and I don’t think it would be easy to add support either. Not sure about the lucene FST index. As mentioned above, LIKE statements (e.g. LIKE ‘%foo%’) can’t express all regular expression features, which should reduce the scope of the edge cases to be considered.
This definitely needs benchmarking for at least the following scenarios;
* Single like statement
* N like statements combined with OR
for each of:
* no FST index - both raw and dictionarized
* Lucene FST index
* Native FST index
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156125315
Yeah, I completely understand that we should maintain the Lucene language when Natife FST was created. My concern is with the fact that when Lucene index was added, we decided to accept a different language in `regexp_match`. IMHO that is a consistency problem that is not easy to explain to Pinot customers. Pinot doesn't include documentation about this operation and how it changes depending on whether there is an index or not and I feel sorry for whoever has to write that doc, because it seems quite difficult to describe.
IMHO a better solution would be to do not optimize `regexp_match` with indexes and to have another operation whose string argument was a Lucene expression.
It is too late to do that, so let's focus on useful things. This PR is closed, but I'm going to cherry pick the like to regex optimization to add it in another PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145160683
Wondering how much performance gain are we expecting here? `text_match` will use the Lucene text index to get the matching docs for the request. Essentially the comparison is between 2 simple requests vs 1 combined request. Some perf number would be helpful to justify the change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154279956
@Jackie-Jiang one thing is that the native index aggressively tries to prune branches, both while building the query automaton and during the traversal.
That is also what causes general suffix and prefix queries to be much faster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154272773
I will check out the branch and give some comments tomorrow
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz closed pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz closed pull request #8818: regexp_like fusing
URL: https://github.com/apache/pinot/pull/8818
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155183347
@gortiz are you creating a FST index or a text index?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152679541
Yeah, the REGEXP_LIKE translation from LIKE is not equivalent.
Also, remember that any FST traversal will benefit most from a selective expression rather than a full term match.
Lastly, why do you say that indices don't help for single regexes? Unless I am misreading the benchmark result, even for a single regex case, native index is pretty fast?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145648604
> I thought this was supposed to be targeting regexp_like
My bad. I looked for regexp in [the docs](https://docs.pinot.apache.org/configuration-reference/functions) and given that only found regexpExact, I thought it was text_match the operation we want to optimize.
Anyway, the idea is the same, so most of the code can be reused
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] codecov-commenter commented on pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1144753999
# [Codecov](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#8818](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1a54b02) into [master](https://codecov.io/gh/apache/pinot/commit/b597226592e0b0cc26b423c95d9479cc39cef6ad?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b597226) will **decrease** coverage by `42.74%`.
> The diff coverage is `22.18%`.
> :exclamation: Current head 1a54b02 differs from pull request most recent head f6d6db3. Consider uploading reports for the commit f6d6db3 to get more accurate results
```diff
@@ Coverage Diff @@
## master #8818 +/- ##
=============================================
- Coverage 69.72% 26.97% -42.75%
+ Complexity 4621 1 -4620
=============================================
Files 1735 1725 -10
Lines 91320 91055 -265
Branches 13644 13616 -28
=============================================
- Hits 63670 24560 -39110
- Misses 23228 64157 +40929
+ Partials 4422 2338 -2084
```
| Flag | Coverage Δ | |
|---|---|---|
| integration1 | `26.97% <22.18%> (-0.01%)` | :arrow_down: |
| integration2 | `?` | |
| unittests1 | `?` | |
| unittests2 | `?` | |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [...e/pinot/common/utils/FileUploadDownloadClient.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vdXRpbHMvRmlsZVVwbG9hZERvd25sb2FkQ2xpZW50LmphdmE=) | `46.28% <ø> (-13.15%)` | :arrow_down: |
| [...ler/api/resources/TableConfigsRestletResource.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9hcGkvcmVzb3VyY2VzL1RhYmxlQ29uZmlnc1Jlc3RsZXRSZXNvdXJjZS5qYXZh) | `0.00% <0.00%> (-75.44%)` | :arrow_down: |
| [...ava/org/apache/pinot/core/plan/FilterPlanNode.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9wbGFuL0ZpbHRlclBsYW5Ob2RlLmphdmE=) | `48.09% <0.00%> (-34.08%)` | :arrow_down: |
| [...t/core/query/optimizer/filter/FilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL0ZpbHRlck9wdGltaXplci5qYXZh) | `0.00% <0.00%> (ø)` | |
| [.../optimizer/filter/FlattenAndOrFilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL0ZsYXR0ZW5BbmRPckZpbHRlck9wdGltaXplci5qYXZh) | `88.88% <ø> (ø)` | |
| [...ery/optimizer/filter/MergeEqInFilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL01lcmdlRXFJbkZpbHRlck9wdGltaXplci5qYXZh) | `72.83% <ø> (-19.76%)` | :arrow_down: |
| [...optimizer/filter/TimePredicateFilterOptimizer.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9xdWVyeS9vcHRpbWl6ZXIvZmlsdGVyL1RpbWVQcmVkaWNhdGVGaWx0ZXJPcHRpbWl6ZXIuamF2YQ==) | `40.66% <ø> (-41.34%)` | :arrow_down: |
| [...nt/creator/impl/inv/BitmapInvertedIndexWriter.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9pbnYvQml0bWFwSW52ZXJ0ZWRJbmRleFdyaXRlci5qYXZh) | `0.00% <0.00%> (-90.00%)` | :arrow_down: |
| [...gment/index/readers/BitmapInvertedIndexReader.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3JlYWRlcnMvQml0bWFwSW52ZXJ0ZWRJbmRleFJlYWRlci5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
| [...ache/pinot/segment/local/utils/IngestionUtils.java](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9Jbmdlc3Rpb25VdGlscy5qYXZh) | `0.00% <0.00%> (-28.24%)` | :arrow_down: |
| ... and [1261 more](https://codecov.io/gh/apache/pinot/pull/8818/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [b597226...f6d6db3](https://codecov.io/gh/apache/pinot/pull/8818?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145804872
I've changed the implementation and the description of the PR to be focused on `regexp_like` instead of `text_match`.
Anyway, I would like to have some benchmarks before this feature is merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #8818:
URL: https://github.com/apache/pinot/pull/8818#discussion_r894797921
##########
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkFuseRegexp.java:
##########
@@ -0,0 +1,343 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.perf;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.TimeUnit;
+import org.apache.commons.io.FileUtils;
+import org.apache.pinot.queries.BaseQueriesTest;
+import org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader;
+import org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl;
+import org.apache.pinot.segment.local.segment.index.loader.IndexLoadingConfig;
+import org.apache.pinot.segment.local.segment.readers.GenericRowRecordReader;
+import org.apache.pinot.segment.spi.IndexSegment;
+import org.apache.pinot.segment.spi.creator.SegmentGeneratorConfig;
+import org.apache.pinot.spi.config.table.FSTType;
+import org.apache.pinot.spi.config.table.FieldConfig;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.TableType;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.data.readers.RecordReader;
+import org.apache.pinot.spi.utils.builder.TableConfigBuilder;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.CompilerControl;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Warmup;
+import org.openjdk.jmh.infra.Blackhole;
+import org.openjdk.jmh.runner.Runner;
+import org.openjdk.jmh.runner.options.OptionsBuilder;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@Fork(1)
+@Warmup(iterations = 5, time = 1)
+@Measurement(iterations = 5, time = 1)
+@State(Scope.Benchmark)
+public class BenchmarkFuseRegexp extends BaseQueriesTest {
+
+ private static final File INDEX_DIR = new File(FileUtils.getTempDirectory(), "BenchmarkFuseRegexp");
+ private static final String TABLE_NAME = "MyTable";
+ private static final String SEGMENT_NAME = "testSegment";
+ private static final String DOMAIN_NAMES_COL = "DOMAIN_NAMES";
+ private static final String INT_COL_NAME = "INT_COL";
+ private static final String NO_INDEX_STRING_COL_NAME = "NO_INDEX_COL";
+
+ @Param({"LUCENE", "NATIVE", "null"})
+ private String _fstType;
+ //@Param({"DOMAIN_NAMES LIKE '%domain<i>%'"})
+ String _predicate = "DOMAIN_NAMES LIKE '%domain<i>%'";
+ //@Param({"and", "or"})
+ String _conjunction = "or";
+// @Param("2500000")
+ int _numRows = 2500000;
+// @Param("1000")
+ int _intBaseValue = 1000;
+// @Param({"100"})
+ int _numBlocks = 100;
+
+ private IndexSegment _indexSegment;
+ private Schema _schema;
+ private TableConfig _tableConfig;
+
+ @Setup(Level.Trial)
+ public void setUp()
+ throws Exception {
+ FileUtils.deleteQuietly(INDEX_DIR);
+ FSTType fstType = _fstType.equals("null") ? null : FSTType.valueOf(_fstType);
+ buildSegment(fstType);
+ IndexLoadingConfig indexLoadingConfig = new IndexLoadingConfig();
+ if (fstType != null) {
+ Set<String> fstIndexCols = new HashSet<>();
+ fstIndexCols.add(DOMAIN_NAMES_COL);
+ indexLoadingConfig.setFSTIndexColumns(fstIndexCols);
+ indexLoadingConfig.setFSTIndexType(fstType);
+ }
+ _indexSegment = ImmutableSegmentLoader.load(new File(INDEX_DIR, SEGMENT_NAME), indexLoadingConfig);
+
+ checkCorrectness();
+ }
+
+ @TearDown(Level.Trial)
+ public void tearDown() {
+ _indexSegment.destroy();
+ FileUtils.deleteQuietly(INDEX_DIR);
+ EXECUTOR_SERVICE.shutdownNow();
+ }
+
+ private List<GenericRow> createTestData(int numRows) {
+ List<GenericRow> rows = new ArrayList<>(numRows);
+ String[] domainNames = new String[]{
+ "www.domain1.com", "www.domain1.co.ab", "www.domain1.co.bc",
+ "www.domain1.co.cd", "www.sd.domain1.com", "www.sd.domain1.co.ab", "www.sd.domain1.co.bc",
+ "www.sd.domain1.co.cd", "www.domain2.com", "www.domain2.co.ab", "www.domain2.co.bc", "www.domain2.co.cd",
+ "www.sd.domain2.com", "www.sd.domain2.co.ab", "www.sd.domain2.co.bc", "www.sd.domain2.co.cd"
+ };
+ String[] noIndexData = new String[]{"test1", "test2", "test3", "test4", "test5"};
+ for (int i = 0; i < numRows; i++) {
+ String domain = domainNames[i % domainNames.length];
+ GenericRow row = new GenericRow();
+ row.putField(INT_COL_NAME, _intBaseValue + i);
+ row.putField(NO_INDEX_STRING_COL_NAME, noIndexData[i % noIndexData.length]);
+ row.putField(DOMAIN_NAMES_COL, domain);
+ rows.add(row);
+ }
+ return rows;
+ }
+
+ private void buildSegment(FSTType fstType)
+ throws Exception {
+ List<GenericRow> rows = createTestData(_numRows);
+ List<FieldConfig> fieldConfigs = new ArrayList<>();
+ fieldConfigs.add(
+ new FieldConfig(DOMAIN_NAMES_COL, FieldConfig.EncodingType.DICTIONARY, FieldConfig.IndexType.FST, null, null));
+ _tableConfig = new TableConfigBuilder(TableType.OFFLINE).setTableName(TABLE_NAME)
+ .setInvertedIndexColumns(Collections.singletonList(DOMAIN_NAMES_COL)).setFieldConfigList(fieldConfigs).build();
+ _schema = new Schema.SchemaBuilder().setSchemaName(TABLE_NAME)
+ .addSingleValueDimension(DOMAIN_NAMES_COL, FieldSpec.DataType.STRING)
+ .addSingleValueDimension(NO_INDEX_STRING_COL_NAME, FieldSpec.DataType.STRING)
+ .addMetric(INT_COL_NAME, FieldSpec.DataType.INT).build();
+ SegmentGeneratorConfig config = new SegmentGeneratorConfig(_tableConfig, _schema);
+ config.setOutDir(INDEX_DIR.getPath());
+ config.setTableName(TABLE_NAME);
+ config.setSegmentName(SEGMENT_NAME);
+ if (fstType != null) {
+ config.setFSTIndexType(fstType);
+ }
+ SegmentIndexCreationDriverImpl driver = new SegmentIndexCreationDriverImpl();
+ try (RecordReader recordReader = new GenericRowRecordReader(rows)) {
+ driver.init(config, recordReader);
+ driver.build();
+ }
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void optimal1Like(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain0%'";
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void optimal1Regex(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain0')";
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void optimal10(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain\\d')";
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void increasing10Like(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE "
+ + "DOMAIN_NAMES LIKE '%domain0%' or "
+ + "DOMAIN_NAMES LIKE '%domain1%' or "
+ + "DOMAIN_NAMES LIKE '%domain2%' or "
+ + "DOMAIN_NAMES LIKE '%domain3%' or "
+ + "DOMAIN_NAMES LIKE '%domain4%' or "
+ + "DOMAIN_NAMES LIKE '%domain5%' or "
+ + "DOMAIN_NAMES LIKE '%domain6%' or "
+ + "DOMAIN_NAMES LIKE '%domain7%' or "
+ + "DOMAIN_NAMES LIKE '%domain8%' or "
+ + "DOMAIN_NAMES LIKE '%domain9%'";
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void increasing10Regex(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE "
+ + "regexp_like(DOMAIN_NAMES, '^.domain0.*$') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain1.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain2.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain3.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain4.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain5.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain6.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain7.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain8.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain9.*')";
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void increasing10Fusing(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE "
+ + "regexp_like(DOMAIN_NAMES, '"
+ + "(?:^.*domain0.*$)|"
+ + "(?:^.*domain1.*$)|"
+ + "(?:^.*domain2.*$)|"
+ + "(?:^.*domain3.*$)|"
+ + "(?:^.*domain4.*$)|"
+ + "(?:^.*domain5.*$)|"
+ + "(?:^.*domain6.*$)|"
+ + "(?:^.*domain7.*$)|"
+ + "(?:^.*domain8.*$)|"
+ + "(?:^.*domain9.*$)')";
+
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void decreasing9Like(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE "
+ + "DOMAIN_NAMES LIKE '%domain9%' or "
+ + "DOMAIN_NAMES LIKE '%domain8%' or "
+ + "DOMAIN_NAMES LIKE '%domain7%' or "
+ + "DOMAIN_NAMES LIKE '%domain6%' or "
+ + "DOMAIN_NAMES LIKE '%domain5%' or "
+ + "DOMAIN_NAMES LIKE '%domain4%' or "
+ + "DOMAIN_NAMES LIKE '%domain3%' or "
+ + "DOMAIN_NAMES LIKE '%domain2%' or "
+ + "DOMAIN_NAMES LIKE '%domain1%'";
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void decreasing9Regex(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE "
+ + "regexp_like(DOMAIN_NAMES, '^.domain9.*$') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain8.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain7.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain6.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain5.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain4.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain3.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain2.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain2.*')";
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Benchmark
+ @CompilerControl(CompilerControl.Mode.DONT_INLINE)
+ public void decreasing9Fusing(Blackhole bh) {
+ String query = "SELECT INT_COL FROM MyTable WHERE "
+ + "regexp_like(DOMAIN_NAMES, '"
+ + "(?:^.*domain9.*$)|"
+ + "(?:^.*domain8.*$)|"
+ + "(?:^.*domain7.*$)|"
+ + "(?:^.*domain6.*$)|"
+ + "(?:^.*domain5.*$)|"
+ + "(?:^.*domain4.*$)|"
+ + "(?:^.*domain3.*$)|"
+ + "(?:^.*domain2.*$)|"
+ + "(?:^.*domain1.*$)')";
+
+ bh.consume(getBrokerResponse(query));
+ }
+
+ @Override
+ protected String getFilter() {
+ return null;
+ }
+
+ @Override
+ protected IndexSegment getIndexSegment() {
+ return _indexSegment;
+ }
+
+ @Override
+ protected List<IndexSegment> getIndexSegments() {
+ return Collections.singletonList(_indexSegment);
+ }
+
+ public static void main(String[] args)
+ throws Exception {
+ new Runner(new OptionsBuilder().include(BenchmarkFuseRegexp.class.getSimpleName())
+// .addProfiler(JavaFlightRecorderProfiler.class)
+ .build()).run();
+ }
+
+ private void checkCorrectness() {
+ String query1 = "SELECT INT_COL FROM MyTable WHERE "
+ + "regexp_like(DOMAIN_NAMES, '^.domain9.*$') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain8.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain7.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain6.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain5.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain4.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain3.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain2.*') or "
+ + "regexp_like(DOMAIN_NAMES, '^.domain2.*')";
+ String query2 = "SELECT INT_COL FROM MyTable WHERE "
+ + "regexp_like(DOMAIN_NAMES, '"
+ + "(?:^.*domain9.*$)|"
+ + "(?:^.*domain8.*$)|"
+ + "(?:^.*domain7.*$)|"
+ + "(?:^.*domain6.*$)|"
+ + "(?:^.*domain5.*$)|"
+ + "(?:^.*domain4.*$)|"
+ + "(?:^.*domain3.*$)|"
+ + "(?:^.*domain2.*$)|"
+ + "(?:^.*domain1.*$)')";
+
+ long total1 = getBrokerResponse(query1).getTotalDocs();
+ long total2 = getBrokerResponse(query2).getTotalDocs();
+
+ if (total1 != total2) {
Review Comment:
This is not the correct check. This returns the total documents of the table, instead of the matching docs. `getNumDocsScanned` will return the documents matched
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154277210
> Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
>
> ```
> SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
> BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
> BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
> BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
>
> SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
> BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
> BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
> BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
> ```
>
> We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156850809
@gortiz @atris Thanks for taking time looking into this issue. Even though we might not move forward with the query rewrite, we got several valuable insights from the benchmark.
With regard to the query syntax, +100 on we should keep the query behavior consistent with or without the FST index. Currently we try to convert java regex to Lucene search query using `RegexpPatternConverterUtils.regexpLikeToLuceneRegExp()`. The support is very limited though, and might only be able to cover the basic `LIKE` predicate.
We want to utilize the FST to accelerate the regex matching (`LIKE` and `REGEXP_LIKE`). There are 2 approaches:
- Add a new predicate type, e.g. `FST_LIKE` and take the Lucene dialect
- Make FST work on java regex
I personally prefer the second approach because it can improve the existing query performance, but it requires changes to the native FST, and we cannot support Lucene FST anymore. Or we can support both, and make the `FST_LIKE` only work with Lucene FST.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152641627
@gortiz Thanks for the detailed analysis. This is awesome, and very interesting!
> ## We cannot use this optimization with Lucene
> The cost is too high when the expression gets complex. It even with ` java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states.` when the expression have 10 branches.
>
Curious how native FST handles it, and why is it so much faster than the Lucene FST. cc @atris
> ## In general `COL like 'whatever'` expression is 2x-3x worse than `regexp_like(COL, 'similar expression')`
> Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when `BaseQueriesTest.getBrokerResponse` is called (which is great, because we can test the difference).
I think the reason is that it is not apple-to-apple comparison. `col LIKE 'abc'` is actually equivalent to `regexp_like(col, '^abc$')`
> ## Pinot regex execute faster wildcards than a single values
> It is quite clear in the optimal cases:
>
> ```
> -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
> BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op
> -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
> BenchmarkFuseRegexp.optimal1Regex null avgt 5 243.160 ± 59.722 ms/op
> ```
>
> At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset.
>
> ```
> select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1
> ```
>
> Responds quite faster than
>
> ```
> select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1
> ```
>
> The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x.
>
> I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it.
Let's re-do the benchmark. I don't believe there can be 2000x difference. Also, this is very counter-intuitive, so I'd suggest let's figure out if it is actually from the java regex engine, or because of our code.
> ### When there are several regexp in the actual expression, indexes are very useful.
> Examples:
>
> ```
> BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 0.593 ± 0.362 ms/op
> BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 0.422 ± 0.222 ms/op
> BenchmarkFuseRegexp.decreasing9Regex null avgt 5 1053.673 ± 379.729 ms/op
>
> BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 0.590 ± 0.208 ms/op
> BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 0.435 ± 0.159 ms/op
> BenchmarkFuseRegexp.increasing10Regex null avgt 5 1096.238 ± 289.928 ms/op
> ```
I feel the inefficiency is from `regexp_like(DOMAIN_NAMES, 'domain0')` (exact match) which takes much longer than expected. We should check what is causing the huge performance difference
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] Jackie-Jiang commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154271827
>
Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
```
SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
```
We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1156103702
@gortiz There has been a long comment trail on this PR so I am summarising my responses in this comment:
1. Performance -- Like any other index, FST indices also have their worst case. FST is an automaton, and in an extreme scenario where a high number of branches are traversed for a query, the performance of the index will approach the brute force approximation.
In this case, the query you tried had an abnormally large number of branches, to the extent that Lucene decided to not even try and process the query. The average user rarely fires such queries, and in the corner cases where they do, they should be ok with a brute force latency, as long as their 99th percentile queries keep performing significantly better, which is the case as is evident from the benchmarks that test the FST index.
2. Language Semantics -- Lucene has, for long, published its regex dialect clearly. Engines based on Lucene, such as ElasticSearch and Apache Solr, follow the same dialect (https://help.oncrawl.com/en/articles/1685982-lucene-regex-cheat-sheet)
It is arguable whether this is correct or not, but we have to keep historical perspective in mind, since many Pinot users migrate from one of these engines.
Please note that, from native FST index's perspective, it is pretty simple to support java regex language. The reason I did not do that was because the existing FST index was based on Lucene and thus supported Lucene's regex dialect. For back compatibility, native FST index supports the same dialect.
3. Parsing And Reporting Invalid Regexes: I agree that in an ideal scenario, we should be parsing regexes and reporting characters that we do not accept. However, given the complexity of an average regex, the time complexity of doing this will be non trivial.
4. Specific To This PR: I looked into the PR and while the effort is appreciative, I do not see significant performance improvement. Also, the API surface changes required for this change are high, which makes me a bit uneasy.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154278541
> > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
> >
> > ```
> > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
> > BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
> > BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
> > BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
> >
> > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
> > BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
> > BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
> > BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
> > ```
> >
> > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
>
> The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
I doubt determination alone can cause such a massive difference.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1155193112
@gortiz another thing to remember is that neither Lucene nor native FST index follow the java regex pattern language.
If you look up the Lucene regex documentation, and look at the FST index and LIKE tests in Pinot, you will notice the difference.
Unfortunately, since we do not actually parse the regex engine until deep inside the FST territory, we have no way of reporting unknown characters. That may be a factor leading to 0 results for your queries
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: text_match fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145276586
I thought this was supposed to be targeting regexp_like
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1145835768
I didn't know LIKE expressions were evaluated as REGEX_LIKE. Reading the code, it seems that the transformation is made in `RequestContextUtils`, which is evaluated after the filter optimization is done. This means that (in case I don't miss something else) the optimizations included here will not affect LIKE expressions.
I can modify the optimizer to add the ability to transform LIKE expressions into REGEX_LIKE the same way `RequestContextUtils` does right now. But I'm going to focus on the benchmark that tests the scenarios @richardstartin mentioned, but explicitly using REGEXP_LIKE for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] gortiz commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152506922
My last commit includes a JMH benchmark. I found several interesting things. Please take a look, some of my conclusions are so shocking that it wouldn't surprise me if another pair of eyes discovered that I did something wrong.
TL;DR: I think we should apply this optimization only when executed in columns without indexes. This may make the implementation more difficult, but the performance gain may be very large.
Benchmarks are:
- optimal1Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'domain0'`
- optimal1Regex: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain0')`
- optimal10: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain\d')`
- increasing10Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain0%' or DOMAIN_NAMES LIKE '%domain1%' or ... or DOMAIN_NAMES LIKE '%domain9%'`
- increasing10Regex: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES LIKE, '^.domain0.*$') or ... or regexp_like(DOMAIN_NAMES LIKE, '^.domain9.*$')`
- increasing10Fusing: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')`
- decreasing9Like, decreasing9Regex and decreasing9Fusing: Are like `increasing10Like`, `increasing10Regex` and `increasing10Fusing` but going from 9 to 1
- Note: they have in fact 9 elements. This a mistake. Initially queries where created dynamically and I did it wrong. But it was a lucky mistake that shows that Lucene fails when there are 10 fused expression but it doesn't fail when there are 9.
Last results I´ve got were:
```
Benchmark (_fstType) Mode Cnt Score Error Units
BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
BenchmarkFuseRegexp.decreasing9Like LUCENE avgt 5 0.615 ± 0.056 ms/op
BenchmarkFuseRegexp.decreasing9Like NATIVE avgt 5 0.515 ± 0.111 ms/op
BenchmarkFuseRegexp.decreasing9Like null avgt 5 3596.864 ± 1335.035 ms/op
BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 0.593 ± 0.362 ms/op
BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 0.422 ± 0.222 ms/op
BenchmarkFuseRegexp.decreasing9Regex null avgt 5 1053.673 ± 379.729 ms/op
BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
BenchmarkFuseRegexp.increasing10Like LUCENE avgt 5 0.696 ± 0.121 ms/op
BenchmarkFuseRegexp.increasing10Like NATIVE avgt 5 0.540 ± 0.049 ms/op
BenchmarkFuseRegexp.increasing10Like null avgt 5 3543.339 ± 183.168 ms/op
BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 0.590 ± 0.208 ms/op
BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 0.435 ± 0.159 ms/op
BenchmarkFuseRegexp.increasing10Regex null avgt 5 1096.238 ± 289.928 ms/op
BenchmarkFuseRegexp.optimal10 LUCENE avgt 5 0.169 ± 0.066 ms/op
BenchmarkFuseRegexp.optimal10 NATIVE avgt 5 0.136 ± 0.011 ms/op
BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op
BenchmarkFuseRegexp.optimal1Like LUCENE avgt 5 0.179 ± 0.066 ms/op
BenchmarkFuseRegexp.optimal1Like NATIVE avgt 5 0.134 ± 0.029 ms/op
BenchmarkFuseRegexp.optimal1Like null avgt 5 470.387 ± 187.235 ms/op
BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 0.172 ± 0.029 ms/op
BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 0.141 ± 0.027 ms/op
BenchmarkFuseRegexp.optimal1Regex null avgt 5 243.160 ± 59.722 ms/op
```
After analyzing it, there are some points that are quite clear:
## We cannot use this optimization with Lucene
The cost is too high when the expression gets complex. It even with `
java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states.` when the expression have 10 branches.
## In general `COL like 'whatever'` expression is 2x-3x worse than `regexp_like(COL, 'similar expression')`
Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when `BaseQueriesTest.getBrokerResponse` is called (which is great, because we can test the difference).
I've also tried the controller UI interface and this issue is repeated there. Using the github_events dataset,
```
select * from githubEvents where regexp_like(actor_url, '^.*victor\d.*$') limit 1
```
is quite faster than
```
select * from githubEvents where actor_url LIKE '%victor7%' limit 1
```
## Pinot regex execute faster wildcards than a single values
It is quite clear in the optimal cases:
```
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
BenchmarkFuseRegexp.optimal1Regex null avgt 5 243.160 ± 59.722 ms/op
```
At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset.
```
select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1
```
Responds quite faster than
```
select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1
```
The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x.
I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it.
## Native or Lucene indexes don't actually increase the performance of this regexp
This benchmark wasn't focused on that and more studies should be done, but it seems clear that:
### When there are several regexp in the actual expression, indexes are very useful.
Examples:
```
BenchmarkFuseRegexp.decreasing9Regex LUCENE avgt 5 0.593 ± 0.362 ms/op
BenchmarkFuseRegexp.decreasing9Regex NATIVE avgt 5 0.422 ± 0.222 ms/op
BenchmarkFuseRegexp.decreasing9Regex null avgt 5 1053.673 ± 379.729 ms/op
BenchmarkFuseRegexp.increasing10Regex LUCENE avgt 5 0.590 ± 0.208 ms/op
BenchmarkFuseRegexp.increasing10Regex NATIVE avgt 5 0.435 ± 0.159 ms/op
BenchmarkFuseRegexp.increasing10Regex null avgt 5 1096.238 ± 289.928 ms/op
```
- When there is only one regexp in the actual expressions, indexes are not very useful.
For example:
```
-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
BenchmarkFuseRegexp.optimal10 LUCENE avgt 5 0.169 ± 0.066 ms/op
BenchmarkFuseRegexp.optimal10 NATIVE avgt 5 0.136 ± 0.011 ms/op
BenchmarkFuseRegexp.optimal10 null avgt 5 0.149 ± 0.082 ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
BenchmarkFuseRegexp.optimal1Regex LUCENE avgt 5 0.172 ± 0.029 ms/op
BenchmarkFuseRegexp.optimal1Regex NATIVE avgt 5 0.141 ± 0.027 ms/op
BenchmarkFuseRegexp.optimal1Regex null avgt 5 243.160 ± 59.722 ms/op
```
The last one shows a big difference, but it may be related to the issue with regex without wildcards that I discussed above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154336975
> > > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
> > > ```
> > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
> > > BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
> > > BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
> > > BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
> > >
> > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
> > > BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
> > > BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
> > > BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
> > > ```
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
> >
> >
> > The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
>
> I doubt determinization alone can cause such a massive difference.
Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] richardstartin commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154343989
> > > > > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
> > > > > ```
> > > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
> > > > > BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
> > > > > BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
> > > > > BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
> > > > >
> > > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
> > > > > BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
> > > > > BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
> > > > > BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
> > > > > ```
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
> > > >
> > > >
> > > > The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
> > >
> > >
> > > I doubt determinization alone can cause such a massive difference.
> >
> >
> > Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)
>
> I am still failing to understand what the concern is.
>
> Even if its plain determinization that causes native to be significantly faster, is that not a good thing?
@Jackie-Jiang was asking about the root cause of the difference. I suggested that the root cause is likely to be determinization, but it needs to be investigated. Where has concern been expressed?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] atris commented on pull request #8818: regexp_like fusing
Posted by GitBox <gi...@apache.org>.
atris commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1154338406
> > > > Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:
> > > > ```
> > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
> > > > BenchmarkFuseRegexp.decreasing9Fusing LUCENE avgt 5 24.198 ± 13.439 ms/op
> > > > BenchmarkFuseRegexp.decreasing9Fusing NATIVE avgt 5 0.263 ± 0.019 ms/op
> > > > BenchmarkFuseRegexp.decreasing9Fusing null avgt 5 0.476 ± 0.040 ms/op
> > > >
> > > > SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
> > > > BenchmarkFuseRegexp.increasing10Fusing LUCENE avgt 5 ERROR
> > > > BenchmarkFuseRegexp.increasing10Fusing NATIVE avgt 5 0.326 ± 0.138 ms/op
> > > > BenchmarkFuseRegexp.increasing10Fusing null avgt 5 0.226 ± 0.062 ms/op
> > > > ```
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.
> > >
> > >
> > > The difference will be due to determinization of the automaton, which the native implementation doesn’t do.
> >
> > I doubt determinization alone can cause such a massive difference.
>
> Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)
I am still failing to understand what the concern is.
Even if its plain determinization that causes native to be significantly faster, is that not a good thing?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org