You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/11/02 00:49:48 UTC

[GitHub] [pinot] jasperjiaguo opened a new pull request, #9708: Customize stopword for Lucene Index

jasperjiaguo opened a new pull request, #9708:
URL: https://github.com/apache/pinot/pull/9708

   Instructions:
   1. The PR has to be tagged with at least one of the following labels (*):
      1. `feature`
      2. `bugfix`
      3. `performance`
      4. `ui`
      5. `backward-incompat`
      6. `release-notes` (**)
   2. Remove these instructions before publishing the PR.
    
   (*) Other labels to consider:
   - `testing`
   - `dependencies`
   - `docker`
   - `kubernetes`
   - `observability`
   - `security`
   - `code-style`
   - `extension-point`
   - `refactor`
   - `cleanup`
   
   (**) Use `release-notes` label for scenarios like:
   - New configuration options
   - Deprecation of configurations
   - Signature changes to public methods/interfaces
   - New plugins added or old plugins removed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011288776


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +71,50 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_EXCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> parseEntryAsString(@Nullable Map<String, String> columnProperties,
+      String stopWordKey) {
+    if (columnProperties == null) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011097757


##########
pinot-core/src/test/resources/data/text_search_data/skills.txt:
##########
@@ -21,4 +21,7 @@ C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernet
 Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution
 Database engine, OLAP systems, OLTP transaction processing at large scale, concurrency, multi-threading, GO, building large scale systems
 GET /administrator/ HTTP/1.1 200 4263 - Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0 - NullPointerException
-Foo worked in a lot of places and learned a lot of things
\ No newline at end of file
+Foo worked in a lot of places and learned a lot of things
+IT support, python, hardware debugging
+IT staff, workspace coordinator

Review Comment:
   Can we verify that without this fix, query that uses 'IT support OR IT Manager' will return the count same as 'support OR Manager's
   
   I think the test data needs further change. 



##########
pinot-core/src/test/resources/data/text_search_data/skills.txt:
##########
@@ -21,4 +21,7 @@ C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernet
 Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution
 Database engine, OLAP systems, OLTP transaction processing at large scale, concurrency, multi-threading, GO, building large scale systems
 GET /administrator/ HTTP/1.1 200 4263 - Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0 - NullPointerException
-Foo worked in a lot of places and learned a lot of things
\ No newline at end of file
+Foo worked in a lot of places and learned a lot of things
+IT support, python, hardware debugging
+IT staff, workspace coordinator

Review Comment:
   Can we verify that without this fix, query that uses 'IT support OR IT Manager' will return the count same as 'support OR Manager'
   
   I think the test data needs further change. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
amrishlal commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012229530


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java:
##########
@@ -319,12 +319,16 @@ public long getLatestIngestionTimestamp() {
 
       // Text index
       MutableTextIndex textIndex;
+      List<String> stopWordsInclude = null;
+      List<String> stopWordsExclude = null;

Review Comment:
   Minor: it seems like this is the only reason why a null check is needed in `getStandardAnalyzerWithCustomizedStopWords` for `stopWordsInclude` and `stopWordsExclude`.  Otherwise, all the other places that create stop world list are returning non null `List<String>` and marked as `NonNull`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011238308


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +71,50 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_EXCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> parseEntryAsString(@Nullable Map<String, String> columnProperties,

Review Comment:
   This is private helper ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jackjlli commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jackjlli commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012074635


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java:
##########
@@ -55,10 +59,15 @@ public class LuceneTextIndexCreator implements TextIndexCreator {
 
   private int _nextDocId = 0;
 
-  public static final CharArraySet ENGLISH_STOP_WORDS_SET = new CharArraySet(Arrays
-      .asList("a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no",
-          "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", "they", "this",
-          "to", "was", "will", "with", "those"), true);
+  public static HashSet<String> getDefaultEnglishStopWordsSet() {
+    return new HashSet<>(

Review Comment:
   nit: `ImmutableSet.of("a", "an", ...)` as the default set won't be changed?
   also make it `final`.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java:
##########
@@ -82,16 +91,20 @@ public class LuceneTextIndexCreator implements TextIndexCreator {
    *               no need to commit the index from the realtime side. So when the realtime segment
    *               is destroyed (which is after the realtime segment has been committed and converted
    *               to offline), we close this lucene index writer to release resources but don't commit.
-   *               This is the reason to have commit flag part of the constructor.
+   * @param stopWordsInclude the words to include in addition to the default stop word list

Review Comment:
   What if `stopWordsInclude` and `stopWordsExclude` have the same word, will the code throw any exception on that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011238148


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +71,50 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {

Review Comment:
   These 2 can be private ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #9708:
URL: https://github.com/apache/pinot/pull/9708#issuecomment-1306241575

   Thanks for adding the feature. Can you please update the PR description with an example on how to configure the stop words? Also, can you please update the pinot documentation about this new config?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012141151


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java:
##########
@@ -55,10 +59,15 @@ public class LuceneTextIndexCreator implements TextIndexCreator {
 
   private int _nextDocId = 0;
 
-  public static final CharArraySet ENGLISH_STOP_WORDS_SET = new CharArraySet(Arrays
-      .asList("a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no",
-          "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", "they", "this",
-          "to", "was", "will", "with", "those"), true);
+  public static HashSet<String> getDefaultEnglishStopWordsSet() {
+    return new HashSet<>(

Review Comment:
   It will be changed during stop word customization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011288361


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +71,50 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {

Review Comment:
   They are used for reader so cannot be private



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011230919


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,26 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed

Review Comment:
   Nvm.. I think I got it. 
   
   I think what you mean is `"exclude the stop words from stop word list"` and thus they should be indexed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jackjlli commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jackjlli commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012074635


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java:
##########
@@ -55,10 +59,15 @@ public class LuceneTextIndexCreator implements TextIndexCreator {
 
   private int _nextDocId = 0;
 
-  public static final CharArraySet ENGLISH_STOP_WORDS_SET = new CharArraySet(Arrays
-      .asList("a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no",
-          "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", "they", "this",
-          "to", "was", "will", "with", "those"), true);
+  public static HashSet<String> getDefaultEnglishStopWordsSet() {
+    return new HashSet<>(

Review Comment:
   nit: `ImmutableSet.of("a", "an", ...)` as the default set won't be changed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011096425


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,14 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, 'IT')";

Review Comment:
   Can we also try phrase queries to mimic what we saw in the customer prod use case ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012159054


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +73,49 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nonnull
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_EXCLUDE_KEY);
+  }
+
+  @Nonnull
+  private static List<String> parseEntryAsString(@Nullable Map<String, String> columnProperties,
+      String stopWordKey) {
+    if (columnProperties == null) {
+      return Collections.EMPTY_LIST;
+    }
+    String includeWords = columnProperties.getOrDefault(stopWordKey, "");
+    return Arrays.stream(includeWords.split(FieldConfig.TEXT_INDEX_STOP_WORD_SEPERATOR))
+        .map(String::trim).collect(Collectors.toList());
+  }
+
+  public static StandardAnalyzer getStandardAnalyzerWithCustomizedStopWords(List<String> stopWordsInclude,
+      List<String> stopWordsExclude) {
+    HashSet<String> stopWordSet = LuceneTextIndexCreator.getDefaultEnglishStopWordsSet();
+    if (stopWordsInclude != null) {
+      stopWordSet.addAll(stopWordsInclude);
+    }
+    if (stopWordsExclude != null) {
+      stopWordsExclude.forEach(stopWordSet::remove);

Review Comment:
   Oh this was actually suggested by intelliJ. I think the reason was https://stackoverflow.com/questions/28671903/the-hashsett-removeall-method-is-surprisingly-slow
   ![Screen Shot 2022-11-02 at 11 24 28 AM](https://user-images.githubusercontent.com/10736840/199573127-04eff941-851a-4801-a052-8a31f757a9e8.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011288983


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,26 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed

Review Comment:
   Updated the comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011183270


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,14 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, 'IT')";

Review Comment:
   Done



##########
pinot-core/src/test/resources/data/text_search_data/skills.txt:
##########
@@ -21,4 +21,7 @@ C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernet
 Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution
 Database engine, OLAP systems, OLTP transaction processing at large scale, concurrency, multi-threading, GO, building large scale systems
 GET /administrator/ HTTP/1.1 200 4263 - Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0 - NullPointerException
-Foo worked in a lot of places and learned a lot of things
\ No newline at end of file
+Foo worked in a lot of places and learned a lot of things
+IT support, python, hardware debugging
+IT staff, workspace coordinator

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011233715


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,26 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '\"IT support\" or \"IT manager\"')";
+    testInterSegmentAggregationQueryHelper(query, 8);
+
+    // query with excluded stop-words. they should be indexed
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '\"IT\"')";
+    testInterSegmentAggregationQueryHelper(query, 16);
+
+    // query without stop-words
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '\"support\" or \"manager\"')";
+    testInterSegmentAggregationQueryHelper(query, 12);

Review Comment:
   Can we add a test where for another text index column, we don't set the excludeStopWords IT in the FieldConfig and thus the result of search `'\"IT support\" or \"IT manager\"'` is same as `'\"support\" or \"manager\"'`  which is 12



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011288572


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,26 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '\"IT support\" or \"IT manager\"')";
+    testInterSegmentAggregationQueryHelper(query, 8);
+
+    // query with excluded stop-words. they should be indexed
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '\"IT\"')";
+    testInterSegmentAggregationQueryHelper(query, 16);
+
+    // query without stop-words
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '\"support\" or \"manager\"')";
+    testInterSegmentAggregationQueryHelper(query, 12);

Review Comment:
   Good point. Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012236489


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java:
##########
@@ -319,12 +319,16 @@ public long getLatestIngestionTimestamp() {
 
       // Text index
       MutableTextIndex textIndex;
+      List<String> stopWordsInclude = null;
+      List<String> stopWordsExclude = null;

Review Comment:
   Oh the null checking is for easier testing of `LuceneTextIndexCreator` as well 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #9708:
URL: https://github.com/apache/pinot/pull/9708#issuecomment-1299478437

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/9708?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#9708](https://codecov.io/gh/apache/pinot/pull/9708?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0cb9609) into [master](https://codecov.io/gh/apache/pinot/commit/e4ee05355ca4c177fdc10b35245c37b6a94f8089?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e4ee053) will **increase** coverage by `4.48%`.
   > The diff coverage is `93.47%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #9708      +/-   ##
   ============================================
   + Coverage     62.93%   67.41%   +4.48%     
   + Complexity     5178     5135      -43     
   ============================================
     Files          1938     1447     -491     
     Lines        103986    75695   -28291     
     Branches      15766    12058    -3708     
   ============================================
   - Hits          65440    51033   -14407     
   + Misses        33677    20987   -12690     
   + Partials       4869     3675    -1194     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `67.41% <93.47%> (-0.02%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/9708?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...org/apache/pinot/spi/config/table/FieldConfig.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvY29uZmlnL3RhYmxlL0ZpZWxkQ29uZmlnLmphdmE=) | `96.42% <ø> (ø)` | |
   | [...local/indexsegment/mutable/MutableSegmentImpl.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9pbmRleHNlZ21lbnQvbXV0YWJsZS9NdXRhYmxlU2VnbWVudEltcGwuamF2YQ==) | `57.83% <40.00%> (-0.05%)` | :arrow_down: |
   | [...me/impl/invertedindex/RealtimeLuceneTextIndex.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2ludmVydGVkaW5kZXgvUmVhbHRpbWVMdWNlbmVUZXh0SW5kZXguamF2YQ==) | `66.10% <100.00%> (ø)` | |
   | [...ment/creator/impl/DefaultIndexCreatorProvider.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9EZWZhdWx0SW5kZXhDcmVhdG9yUHJvdmlkZXIuamF2YQ==) | `67.44% <100.00%> (ø)` | |
   | [...ment/creator/impl/SegmentColumnarIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9TZWdtZW50Q29sdW1uYXJJbmRleENyZWF0b3IuamF2YQ==) | `79.88% <100.00%> (+0.27%)` | :arrow_up: |
   | [...ment/creator/impl/text/LuceneTextIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC90ZXh0L0x1Y2VuZVRleHRJbmRleENyZWF0b3IuamF2YQ==) | `79.59% <100.00%> (+0.86%)` | :arrow_up: |
   | [...t/index/loader/invertedindex/TextIndexHandler.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L2xvYWRlci9pbnZlcnRlZGluZGV4L1RleHRJbmRleEhhbmRsZXIuamF2YQ==) | `86.51% <100.00%> (+0.47%)` | :arrow_up: |
   | [...ment/index/readers/text/LuceneTextIndexReader.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2luZGV4L3JlYWRlcnMvdGV4dC9MdWNlbmVUZXh0SW5kZXhSZWFkZXIuamF2YQ==) | `79.68% <100.00%> (+0.65%)` | :arrow_up: |
   | [...ot/segment/local/segment/store/TextIndexUtils.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L3N0b3JlL1RleHRJbmRleFV0aWxzLmphdmE=) | `87.87% <100.00%> (+11.40%)` | :arrow_up: |
   | [...inot/segment/spi/creator/IndexCreationContext.java](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL2NyZWF0b3IvSW5kZXhDcmVhdGlvbkNvbnRleHQuamF2YQ==) | `100.00% <100.00%> (ø)` | |
   | ... and [747 more](https://codecov.io/gh/apache/pinot/pull/9708/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012145231


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java:
##########
@@ -82,16 +91,20 @@ public class LuceneTextIndexCreator implements TextIndexCreator {
    *               no need to commit the index from the realtime side. So when the realtime segment
    *               is destroyed (which is after the realtime segment has been committed and converted
    *               to offline), we close this lucene index writer to release resources but don't commit.
-   *               This is the reason to have commit flag part of the constructor.
+   * @param stopWordsInclude the words to include in addition to the default stop word list

Review Comment:
   The current behavior is to prioritize exclude these duplicate words. I'm planning to put default `ENGLISH_STOP_WORDS_SET`, config keys, and this behavior in the user doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012141151


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java:
##########
@@ -55,10 +59,15 @@ public class LuceneTextIndexCreator implements TextIndexCreator {
 
   private int _nextDocId = 0;
 
-  public static final CharArraySet ENGLISH_STOP_WORDS_SET = new CharArraySet(Arrays
-      .asList("a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no",
-          "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", "they", "this",
-          "to", "was", "will", "with", "those"), true);
+  public static HashSet<String> getDefaultEnglishStopWordsSet() {
+    return new HashSet<>(

Review Comment:
   It will be changed during stop word customization. The default set (static final) will be initialized once by this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jackjlli commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jackjlli commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012230918


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +73,49 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nonnull
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_EXCLUDE_KEY);
+  }
+
+  @Nonnull
+  private static List<String> parseEntryAsString(@Nullable Map<String, String> columnProperties,
+      String stopWordKey) {
+    if (columnProperties == null) {
+      return Collections.EMPTY_LIST;
+    }
+    String includeWords = columnProperties.getOrDefault(stopWordKey, "");
+    return Arrays.stream(includeWords.split(FieldConfig.TEXT_INDEX_STOP_WORD_SEPERATOR))
+        .map(String::trim).collect(Collectors.toList());
+  }
+
+  public static StandardAnalyzer getStandardAnalyzerWithCustomizedStopWords(List<String> stopWordsInclude,
+      List<String> stopWordsExclude) {
+    HashSet<String> stopWordSet = LuceneTextIndexCreator.getDefaultEnglishStopWordsSet();
+    if (stopWordsInclude != null) {
+      stopWordSet.addAll(stopWordsInclude);
+    }
+    if (stopWordsExclude != null) {
+      stopWordsExclude.forEach(stopWordSet::remove);

Review Comment:
   I see. I'm fine with that. Thanks for the pointer!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011288095


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +71,50 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_EXCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> parseEntryAsString(@Nullable Map<String, String> columnProperties,

Review Comment:
   DOne



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jackjlli commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jackjlli commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012146866


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +73,49 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nonnull
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nonnull
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_EXCLUDE_KEY);
+  }
+
+  @Nonnull
+  private static List<String> parseEntryAsString(@Nullable Map<String, String> columnProperties,
+      String stopWordKey) {
+    if (columnProperties == null) {
+      return Collections.EMPTY_LIST;
+    }
+    String includeWords = columnProperties.getOrDefault(stopWordKey, "");
+    return Arrays.stream(includeWords.split(FieldConfig.TEXT_INDEX_STOP_WORD_SEPERATOR))
+        .map(String::trim).collect(Collectors.toList());
+  }
+
+  public static StandardAnalyzer getStandardAnalyzerWithCustomizedStopWords(List<String> stopWordsInclude,
+      List<String> stopWordsExclude) {
+    HashSet<String> stopWordSet = LuceneTextIndexCreator.getDefaultEnglishStopWordsSet();
+    if (stopWordsInclude != null) {
+      stopWordSet.addAll(stopWordsInclude);
+    }
+    if (stopWordsExclude != null) {
+      stopWordsExclude.forEach(stopWordSet::remove);

Review Comment:
   nit: how about using `stopWordSet.removeAll(stopWordsExclude);`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jasperjiaguo commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
jasperjiaguo commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012147254


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java:
##########
@@ -57,8 +58,11 @@ public class RealtimeLuceneTextIndex implements MutableTextIndex {
    * @param column column name
    * @param segmentIndexDir realtime segment consumer dir
    * @param segmentName realtime segment name
+   * @param stopWordsInclude stop words to include from default stop words

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on PR #9708:
URL: https://github.com/apache/pinot/pull/9708#issuecomment-1301479060

   The tests pass locally and this failed even after retriggering couple of times. 
   
   Merging this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011227863


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java:
##########
@@ -64,4 +71,50 @@ public static boolean isFstTypeNative(@Nullable Map<String, String> textIndexPro
   public static FSTType getFSTTypeOfIndex(File indexDir, String column) {
     return SegmentDirectoryPaths.findTextIndexIndexFile(indexDir, column) != null ? FSTType.LUCENE : FSTType.NATIVE;
   }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsInclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(String colName,
+      Map<String, Map<String, String>> columnProperties) {
+    return extractStopWordsExclude(columnProperties.getOrDefault(colName, null));
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsInclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_INCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> extractStopWordsExclude(Map<String, String> columnProperty) {
+    return parseEntryAsString(columnProperty, FieldConfig.TEXT_INDEX_STOP_WORD_EXCLUDE_KEY);
+  }
+
+  @Nullable
+  public static List<String> parseEntryAsString(@Nullable Map<String, String> columnProperties,
+      String stopWordKey) {
+    if (columnProperties == null) {

Review Comment:
   (nit) Returning empty may be better to avoid NPE. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011096425


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,14 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed
+    query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, 'IT')";

Review Comment:
   Can we try phrase queries to mimic what we saw in the customer prod use case ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1011229923


##########
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java:
##########
@@ -1864,6 +1878,26 @@ public void testInterSegment() {
     query = "SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL_DICT, 'a and or in the are')";
     testInterSegmentAggregationQueryHelper(query, 0);
 
+    // query with excluded stop-words. they should be indexed

Review Comment:
   This seems a bit confusing. If this test is for excluding the stop words, then the comment should say `"they should not be indexed"` I guess ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia merged pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia merged PR #9708:
URL: https://github.com/apache/pinot/pull/9708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] somandal commented on a diff in pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
somandal commented on code in PR #9708:
URL: https://github.com/apache/pinot/pull/9708#discussion_r1012069284


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java:
##########
@@ -57,8 +58,11 @@ public class RealtimeLuceneTextIndex implements MutableTextIndex {
    * @param column column name
    * @param segmentIndexDir realtime segment consumer dir
    * @param segmentName realtime segment name
+   * @param stopWordsInclude stop words to include from default stop words

Review Comment:
   nit: the comment can be reworded to be a bit more clear: "stop words to include in addition to default stop words"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] siddharthteotia commented on pull request #9708: Customize stopword for Lucene Index

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on PR #9708:
URL: https://github.com/apache/pinot/pull/9708#issuecomment-1301174096

   @jasperjiaguo can you also rebase once if not already. Just in case we see some spurious failures again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org