You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "jackluo923 (via GitHub)" <gi...@apache.org> on 2023/11/19 02:49:01 UTC

[PR] Configurable Lucene analyzer [pinot]

jackluo923 opened a new pull request, #12027:
URL: https://github.com/apache/pinot/pull/12027

   Currently, Pinot hard-coded the Lucene analyzer (`standardAnalyzer`) to tokenize strings for indexing and search. In various scenarios, it is extremely useful to customize the analyzer. There are at least two other users who have requested this feature in https://github.com/apache/pinot/issues/9154. 
   
   This PR introduces the capability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis. Specifically, this PR allows user to specify the FQCN (fully qualified class name) of the Lucene analyzer to use in the text index:
   ```
   fieldConfigList: [
      {
           "name": "columnName",
           "indexType": "TEXT",
           "indexTypes": [
             "TEXT"
           ],
           "properties": {
             "luceneAnalyzerFQCN": "org.apache.lucene.analysis.core.KeywordAnalyzer"
           },
         }
     ]
     ```    
   
   **Default Behavior**
   If user did not specify the `luceneAnalyzerFQCN` property, the behavior is exactly the same as before which is to use the StandardAnalyzer with couple configuration properties.
   
   **User Specified Behavior**
   When user specifies the `luceneAnalyzerFQCN`, the default constructor of the specified Lucene analyzer class is invoked via reflection to create a Lucene analyzer. If user-specified analyzer class does not exist, the ReflectionOperationException is caught and a runtime exception with a more meaningful exception message is thrown.
   
   **Testing**
   This configurable Lucene analyzer feature is currently used in production to index and search large amount of text data on multi-variable text columns using `KeywordAnalyzer`. All existing unit tests with the default behavior are passing.
   
   tags: `release-notes`, `enhancement`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Configurable Lucene analyzer [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #12027:
URL: https://github.com/apache/pinot/pull/12027#discussion_r1414737279


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -112,6 +113,12 @@ private static List<String> parseEntryAsString(@Nullable Map<String, String> col
         .map(String::trim).collect(Collectors.toList());
   }
 
+  public static Analyzer getAnalyzerFromFQCN(@Nonnull String luceneAnalyzerFQCN) throws

Review Comment:
   ```suggestion
     public static Analyzer getAnalyzerFromClassName(String luceneAnalyzerClass) throws
   ```



##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java:
##########
@@ -172,6 +187,11 @@ public AbstractBuilder withLuceneMaxBufferSizeMB(int maxBufferSizeMB) {
       _luceneMaxBufferSizeMB = maxBufferSizeMB;
       return this;
     }
+
+    public AbstractBuilder withLuceneAnalyzerFQCN(String luceneAnalyzerFQCN) {

Review Comment:
   ```suggestion
       public AbstractBuilder withLuceneAnalyzerClass(String luceneAnalyzerClass) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Configurable Lucene analyzer [pinot]

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #12027:
URL: https://github.com/apache/pinot/pull/12027#issuecomment-1830949434

   ## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   Attention: `14 lines` in your changes are missing coverage. Please review.
   > Comparison is base [(`f7f8260`)](https://app.codecov.io/gh/apache/pinot/commit/f7f82608e098b26f2d241627d8a0a811cd8f5fe8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.63% compared to head [(`a60d47f`)](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.66%.
   > Report is 21 commits behind head on master.
   
   | [Files](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines |
   |---|---|---|
   | [...ment/creator/impl/text/LuceneTextIndexCreator.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC90ZXh0L0x1Y2VuZVRleHRJbmRleENyZWF0b3IuamF2YQ==) | 50.00% | [3 Missing and 2 partials :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...pache/pinot/segment/spi/index/TextIndexConfig.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL2luZGV4L1RleHRJbmRleENvbmZpZy5qYXZh) | 60.00% | [3 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...cal/segment/index/text/TextIndexConfigBuilder.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3RleHQvVGV4dEluZGV4Q29uZmlnQnVpbGRlci5qYXZh) | 0.00% | [1 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...ment/index/readers/text/LuceneTextIndexReader.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3JlYWRlcnMvdGV4dC9MdWNlbmVUZXh0SW5kZXhSZWFkZXIuamF2YQ==) | 83.33% | [0 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...egment/local/segment/index/text/TextIndexType.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3RleHQvVGV4dEluZGV4VHlwZS5qYXZh) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [...ot/segment/local/segment/store/TextIndexUtils.java](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L3N0b3JlL1RleHRJbmRleFV0aWxzLmphdmE=) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   
   <details><summary>Additional details and impacted files</summary>
   
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #12027      +/-   ##
   ============================================
   + Coverage     61.63%   61.66%   +0.03%     
   - Complexity      207     1152     +945     
   ============================================
     Files          2385     2385              
     Lines        129379   129397      +18     
     Branches      20033    20038       +5     
   ============================================
   + Hits          79742    79793      +51     
   + Misses        43826    43775      -51     
   - Partials       5811     5829      +18     
   ```
   
   | [Flag](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [custom-integration1](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (ø)` | |
   | [integration](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (ø)` | |
   | [integration1](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (ø)` | |
   | [integration2](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (ø)` | |
   | [java-11](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.59% <57.57%> (+0.01%)` | :arrow_up: |
   | [java-21](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.52% <57.57%> (+0.01%)` | :arrow_up: |
   | [skip-bytebuffers-false](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.65% <57.57%> (+0.04%)` | :arrow_up: |
   | [skip-bytebuffers-true](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.48% <57.57%> (-0.01%)` | :arrow_down: |
   | [temurin](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.66% <57.57%> (+0.03%)` | :arrow_up: |
   | [unittests](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `61.66% <57.57%> (+0.03%)` | :arrow_up: |
   | [unittests1](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `46.97% <18.18%> (+<0.01%)` | :arrow_up: |
   | [unittests2](https://app.codecov.io/gh/apache/pinot/pull/12027/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.59% <39.39%> (+0.02%)` | :arrow_up: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   
   </details>
   
   [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/pinot/pull/12027?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).   
   :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Configurable Lucene analyzer [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang merged PR #12027:
URL: https://github.com/apache/pinot/pull/12027


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Configurable Lucene analyzer [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #12027:
URL: https://github.com/apache/pinot/pull/12027#discussion_r1408514071


##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java:
##########
@@ -57,7 +59,9 @@ public TextIndexConfig(
       @JsonProperty("stopWordsInclude") List<String> stopWordsInclude,
       @JsonProperty("stopWordsExclude") List<String> stopWordsExclude,
       @JsonProperty("luceneUseCompoundFile") Boolean luceneUseCompoundFile,
-      @JsonProperty("luceneMaxBufferSizeMB") Integer luceneMaxBufferSizeMB) {
+      @JsonProperty("luceneMaxBufferSizeMB") Integer luceneMaxBufferSizeMB,
+      @JsonProperty("luceneAnalyzerFQCN") String luceneAnalyzerFQCN

Review Comment:
   Suggest renaming it to `luceneAnalyzerClass` to match the other similar configs



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java:
##########
@@ -62,9 +60,12 @@ public class RealtimeLuceneTextIndex implements MutableTextIndex {
    * @param segmentName realtime segment name
    * @param stopWordsInclude the words to include in addition to the default stop word list
    * @param stopWordsExclude stop words to exclude from default stop words
+   * @param maxBufferSizeMB maximum size of the Lucene index buffer
+   * @param luceneAnalyzerFQCN fully qualified class name of the Lucene analyzer used for tokenization
    */
   public RealtimeLuceneTextIndex(String column, File segmentIndexDir, String segmentName,
-      List<String> stopWordsInclude, List<String> stopWordsExclude, boolean useCompoundFile, int maxBufferSizeMB) {
+      List<String> stopWordsInclude, List<String> stopWordsExclude, boolean useCompoundFile, int maxBufferSizeMB,
+      String luceneAnalyzerFQCN) {

Review Comment:
   Shall we directly pass in the `TextIndexConfig` here instead of keeping adding extra parameters?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java:
##########
@@ -120,7 +122,8 @@ public MutableRoaringBitmap getDocIds(String searchQuery) {
     Callable<MutableRoaringBitmap> searchCallable = () -> {
       IndexSearcher indexSearcher = null;
       try {
-        Query query = new QueryParser(_column, _analyzer).parse(searchQuery);
+        Query query =
+            new QueryParser(_column, _indexCreator.getIndexWriter().getConfig().getAnalyzer()).parse(searchQuery);

Review Comment:
   (minor) Shall we cache the `_analyzer` after creating the `IndexWriter`?



##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java:
##########
@@ -51,6 +51,9 @@ public class FieldConfig extends BaseJsonConfig {
   public static final String TEXT_INDEX_STOP_WORD_EXCLUDE_KEY = "stopWordExclude";
   public static final String TEXT_INDEX_LUCENE_USE_COMPOUND_FILE = "luceneUseCompoundFile";
   public static final String TEXT_INDEX_LUCENE_MAX_BUFFER_SIZE_MB = "luceneMaxBufferSizeMB";
+  public static final String TEXT_INDEX_LUCENE_ANALYZER_FQCN = "luceneAnalyzerFQCN";
+  public static final String TEXT_INDEX_DEFAULT_LUCENE_ANALYZER_FQCN
+          = "org.apache.lucene.analysis.standard.StandardAnalyzer";

Review Comment:
   ```suggestion
     public static final String TEXT_INDEX_LUCENE_ANALYZER_CLASS = "luceneAnalyzerClass";
     public static final String DEFAULT_TEXT_INDEX_LUCENE_ANALYZER_CLASS
             = "org.apache.lucene.analysis.standard.StandardAnalyzer";
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Configurable Lucene analyzer [pinot]

Posted by "jackluo923 (via GitHub)" <gi...@apache.org>.
jackluo923 commented on PR #12027:
URL: https://github.com/apache/pinot/pull/12027#issuecomment-1839947856

   > LGTM. Please update the description to reflect the config key change.
   > 
   > Could you please help add this to the pinot doc after it is merged?
   
   Description is updated and I will add the necessary modifications to the Pinot doc repo after this pull request is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org