You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "jackluo923 (via GitHub)" <gi...@apache.org> on 2023/11/19 02:49:01 UTC
[PR] Configurable Lucene analyzer [pinot]
jackluo923 opened a new pull request, #12027:
URL: https://github.com/apache/pinot/pull/12027
Currently, Pinot hard-coded the Lucene analyzer (`standardAnalyzer`) to tokenize strings for indexing and search. In various scenarios, it is extremely useful to customize the analyzer. There are at least two other users who have requested this feature in https://github.com/apache/pinot/issues/9154.
This PR introduces the capability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis. Specifically, this PR allows user to specify the FQCN (fully qualified class name) of the Lucene analyzer to use in the text index:
```
fieldConfigList: [
{
"name": "columnName",
"indexType": "TEXT",
"indexTypes": [
"TEXT"
],
"properties": {
"luceneAnalyzerFQCN": "org.apache.lucene.analysis.core.KeywordAnalyzer"
},
}
]
```
**Default Behavior**
If user did not specify the `luceneAnalyzerFQCN` property, the behavior is exactly the same as before which is to use the StandardAnalyzer with couple configuration properties.
**User Specified Behavior**
When user specifies the `luceneAnalyzerFQCN`, the default constructor of the specified Lucene analyzer class is invoked via reflection to create a Lucene analyzer. If user-specified analyzer class does not exist, the ReflectionOperationException is caught and a runtime exception with a more meaningful exception message is thrown.
**Testing**
This configurable Lucene analyzer feature is currently used in production to index and search large amount of text data on multi-variable text columns using `KeywordAnalyzer`. All existing unit tests with the default behavior are passing.
tags: `release-notes`, `enhancement`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
Re: [PR] Configurable Lucene analyzer [pinot]
Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #12027:
URL: https://github.com/apache/pinot/pull/12027#discussion_r1414737279
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -112,6 +113,12 @@ private static List<String> parseEntryAsString(@Nullable Map<String, String> col
.map(String::trim).collect(Collectors.toList());
}
+ public static Analyzer getAnalyzerFromFQCN(@Nonnull String luceneAnalyzerFQCN) throws
Review Comment:
```suggestion
public static Analyzer getAnalyzerFromClassName(String luceneAnalyzerClass) throws
```
##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java:
##########
@@ -172,6 +187,11 @@ public AbstractBuilder withLuceneMaxBufferSizeMB(int maxBufferSizeMB) {
_luceneMaxBufferSizeMB = maxBufferSizeMB;
return this;
}
+
+ public AbstractBuilder withLuceneAnalyzerFQCN(String luceneAnalyzerFQCN) {
Review Comment:
```suggestion
public AbstractBuilder withLuceneAnalyzerClass(String luceneAnalyzerClass) {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
Re: [PR] Configurable Lucene analyzer [pinot]
Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #12027:
URL: https://github.com/apache/pinot/pull/12027#issuecomment-1830949434
## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
Attention: `14 lines` in your changes are missing coverage. Please review.
> Comparison is base [(`f7f8260`)](https://app.codecov.io/gh/apache/pinot/commit/f7f82608e098b26f2d241627d8a0a811cd8f5fe8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.63% compared to head [(`a60d47f`)](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.66%.
> Report is 21 commits behind head on master.
| [Files](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines |
|---|---|---|
| [...ment/creator/impl/text/LuceneTextIndexCreator.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC90ZXh0L0x1Y2VuZVRleHRJbmRleENyZWF0b3IuamF2YQ==) | 50.00% | [3 Missing and 2 partials :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
| [...pache/pinot/segment/spi/index/TextIndexConfig.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL2luZGV4L1RleHRJbmRleENvbmZpZy5qYXZh) | 60.00% | [3 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
| [...cal/segment/index/text/TextIndexConfigBuilder.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3RleHQvVGV4dEluZGV4Q29uZmlnQnVpbGRlci5qYXZh) | 0.00% | [1 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
| [...ment/index/readers/text/LuceneTextIndexReader.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3JlYWRlcnMvdGV4dC9MdWNlbmVUZXh0SW5kZXhSZWFkZXIuamF2YQ==) | 83.33% | [0 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
| [...egment/local/segment/index/text/TextIndexType.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3RleHQvVGV4dEluZGV4VHlwZS5qYXZh) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
| [...ot/segment/local/segment/store/TextIndexUtils.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L3N0b3JlL1RleHRJbmRleFV0aWxzLmphdmE=) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
<details><summary>Additional details and impacted files</summary>
```diff
@@ Coverage Diff @@
## master #12027 +/- ##
============================================
+ Coverage 61.63% 61.66% +0.03%
- Complexity 207 1152 +945
============================================
Files 2385 2385
Lines 129379 129397 +18
Branches 20033 20038 +5
============================================
+ Hits 79742 79793 +51
+ Misses 43826 43775 -51
- Partials 5811 5829 +18
```
| [Flag](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
|---|---|---|
| [custom-integration1](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (ø)` | |
| [integration](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (ø)` | |
| [integration1](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (ø)` | |
| [integration2](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (ø)` | |
| [java-11](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.59% <57.57%> (+0.01%)` | :arrow_up: |
| [java-21](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.52% <57.57%> (+0.01%)` | :arrow_up: |
| [skip-bytebuffers-false](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.65% <57.57%> (+0.04%)` | :arrow_up: |
| [skip-bytebuffers-true](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.48% <57.57%> (-0.01%)` | :arrow_down: |
| [temurin](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.66% <57.57%> (+0.03%)` | :arrow_up: |
| [unittests](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.66% <57.57%> (+0.03%)` | :arrow_up: |
| [unittests1](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `46.97% <18.18%> (+<0.01%)` | :arrow_up: |
| [unittests2](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.59% <39.39%> (+0.02%)` | :arrow_up: |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
</details>
[:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
:loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
Re: [PR] Configurable Lucene analyzer [pinot]
Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang merged PR #12027:
URL: https://github.com/apache/pinot/pull/12027
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
Re: [PR] Configurable Lucene analyzer [pinot]
Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #12027:
URL: https://github.com/apache/pinot/pull/12027#discussion_r1408514071
##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java:
##########
@@ -57,7 +59,9 @@ public TextIndexConfig(
@JsonProperty("stopWordsInclude") List<String> stopWordsInclude,
@JsonProperty("stopWordsExclude") List<String> stopWordsExclude,
@JsonProperty("luceneUseCompoundFile") Boolean luceneUseCompoundFile,
- @JsonProperty("luceneMaxBufferSizeMB") Integer luceneMaxBufferSizeMB) {
+ @JsonProperty("luceneMaxBufferSizeMB") Integer luceneMaxBufferSizeMB,
+ @JsonProperty("luceneAnalyzerFQCN") String luceneAnalyzerFQCN
Review Comment:
Suggest renaming it to `luceneAnalyzerClass` to match the other similar configs
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java:
##########
@@ -62,9 +60,12 @@ public class RealtimeLuceneTextIndex implements MutableTextIndex {
* @param segmentName realtime segment name
* @param stopWordsInclude the words to include in addition to the default stop word list
* @param stopWordsExclude stop words to exclude from default stop words
+ * @param maxBufferSizeMB maximum size of the Lucene index buffer
+ * @param luceneAnalyzerFQCN fully qualified class name of the Lucene analyzer used for tokenization
*/
public RealtimeLuceneTextIndex(String column, File segmentIndexDir, String segmentName,
- List<String> stopWordsInclude, List<String> stopWordsExclude, boolean useCompoundFile, int maxBufferSizeMB) {
+ List<String> stopWordsInclude, List<String> stopWordsExclude, boolean useCompoundFile, int maxBufferSizeMB,
+ String luceneAnalyzerFQCN) {
Review Comment:
Shall we directly pass in the `TextIndexConfig` here instead of keeping adding extra parameters?
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java:
##########
@@ -120,7 +122,8 @@ public MutableRoaringBitmap getDocIds(String searchQuery) {
Callable<MutableRoaringBitmap> searchCallable = () -> {
IndexSearcher indexSearcher = null;
try {
- Query query = new QueryParser(_column, _analyzer).parse(searchQuery);
+ Query query =
+ new QueryParser(_column, _indexCreator.getIndexWriter().getConfig().getAnalyzer()).parse(searchQuery);
Review Comment:
(minor) Shall we cache the `_analyzer` after creating the `IndexWriter`?
##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java:
##########
@@ -51,6 +51,9 @@ public class FieldConfig extends BaseJsonConfig {
public static final String TEXT_INDEX_STOP_WORD_EXCLUDE_KEY = "stopWordExclude";
public static final String TEXT_INDEX_LUCENE_USE_COMPOUND_FILE = "luceneUseCompoundFile";
public static final String TEXT_INDEX_LUCENE_MAX_BUFFER_SIZE_MB = "luceneMaxBufferSizeMB";
+ public static final String TEXT_INDEX_LUCENE_ANALYZER_FQCN = "luceneAnalyzerFQCN";
+ public static final String TEXT_INDEX_DEFAULT_LUCENE_ANALYZER_FQCN
+ = "org.apache.lucene.analysis.standard.StandardAnalyzer";
Review Comment:
```suggestion
public static final String TEXT_INDEX_LUCENE_ANALYZER_CLASS = "luceneAnalyzerClass";
public static final String DEFAULT_TEXT_INDEX_LUCENE_ANALYZER_CLASS
= "org.apache.lucene.analysis.standard.StandardAnalyzer";
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
Re: [PR] Configurable Lucene analyzer [pinot]
Posted by "jackluo923 (via GitHub)" <gi...@apache.org>.
jackluo923 commented on PR #12027:
URL: https://github.com/apache/pinot/pull/12027#issuecomment-1839947856
> LGTM. Please update the description to reflect the config key change.
>
> Could you please help add this to the pinot doc after it is merged?
Description is updated and I will add the necessary modifications to the Pinot doc repo after this pull request is merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org