You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "deemoliu (via GitHub)" <gi...@apache.org> on 2024/02/09 21:54:18 UTC

[PR] Add Prefix, Suffix and Ngram UDFs [pinot]

deemoliu opened a new pull request, #12392:
URL: https://github.com/apache/pinot/pull/12392

   `feature`: Adding ngram, prefix, postfix UDFs
   
   Context:
   
   We are onboarding a use case and trying the inrease query throughput. We tested the QPS cannot further improved with the existing REGEXP_LIKE queries or text_match queries. The queries as follows;
   `select col1, col2 from table where REPEXP_LIKE(col3, '^data*')`
   `select col1, col2 from table where REGEXP_LIKE(col3, 'data$')`
   `select col1, col2 from table where REGEXP_LIKE(col3, '*data*')
   `select col1, col2 from table where TEXT_MATCH(col3, '/data*/') 
   ...
   `
   
   The plan is to generated the derived columns that persisted prefix, postfix, and ngram to use inverted indexes to filter the result fast, and add the text match indexes to do validation after filtering to avoid false positive result. 
   
   This patch is created to generate prefix, postfix, and ngrams for a field.
   it can be used by the following transformation config
   ```
          {
             "columnName": "col_prefix",
             "transformFunction": "prefix(col, 3, null)"
           },
          {
             "columnName": "col_prefix",
             "transformFunction": "suffix(col, 3, null)"
           },
          {
             "columnName": "col_prefix",
             "transformFunction": "ngram(col, 3)"
           },
          {
             "columnName": "col_prefix",
             "transformFunction": "ngram(col, 1, 3)"
           },
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1486770689


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'

Review Comment:
   prefix can be ['a', 'ab', 'abc'] and it can also be ['^a', '^ab', '^abc']
   suffix can be ['abc', 'bc', 'c'] and it can also be ['abc$', 'bc$', 'c$']
   the regexChar is used for providing convenience for customer to match regex expressions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1566565835


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -581,6 +586,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    ObjectList<String> prefixList = new ObjectArrayList<>();

Review Comment:
   `ObjectArrayList` is not really buying us anything here.
   Given we know the number of prefixes upfront, we can directly allocate the array



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -18,6 +18,10 @@
  */
 package org.apache.pinot.common.function.scalar;
 
+import it.unimi.dsi.fastutil.objects.ObjectArrayList;

Review Comment:
   No need to use fastutil here. We can use the java default ones



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) {
+    if (prefix == null) {

Review Comment:
   This is not addressed ^^
   Take a look at `ScalarFunction.class`. You need to annotate it as `nullableParameters`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1486692865


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= length && prefixLength <= input.length(); prefixLength++) {
+      if (regexChar != null) {
+        prefixSet.add(regexChar + input.substring(0, prefixLength));
+      } else {
+        prefixSet.add(input.substring(0, prefixLength));
+      }
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param length the max length of the suffix strings for the string.
+   * @param regexChar the character for regex matching to be added to suffix strings generated. e.g. '$'
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffix(String input, int length, String regexChar) {
+    ObjectSet<String> suffixSet = new ObjectLinkedOpenHashSet<>();
+    for (int suffixLength = 1; suffixLength <= length && suffixLength <= input.length(); suffixLength++) {
+      if (regexChar != null) {
+        suffixSet.add(input.substring(input.length() - suffixLength) + regexChar);
+      } else {
+        suffixSet.add(input.substring(input.length() - suffixLength));
+      }
+    }
+    return suffixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param length the max length of the ngram for the string.
+   * @return generate an array of ngram of the string that length are exactly matching the specified length.
+   */
+  @ScalarFunction
+  public static String[] ngram(String input, int length) {

Review Comment:
   Suggest renaming this to `ngrams` to be consistent with CH: https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions#ngrams



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {

Review Comment:
   The name of the function may lead users to think that this is equivalent to `input.substring(0, arg)`. But instead this is returning all prefixes <= given length.



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'

Review Comment:
   What's the role of this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1572757028


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -581,6 +584,111 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"prefix"})

Review Comment:
   Do you want to alias it to `prefix`? I don't think this is really `prefix`



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -581,6 +584,111 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"prefix"})
+  public static String[] prefixesWithPrefix(String input, int maxlength, @Nullable String prefix) {
+    if (prefix == null) {
+      return prefixes(input, maxlength);
+    }
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = prefix + input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] suffixArr = new String[arrLength];
+    for (int suffixIdx = 1; suffixIdx <= arrLength; suffixIdx++) {
+      suffixArr[suffixIdx - 1] = input.substring(input.length() - suffixIdx);
+    }
+    return suffixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @param suffix the suffix string to be appended for suffix strings generated. e.g. '$' for regex matching.
+   * @return generate an array of suffix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"suffix"})
+  public static String[] suffixesWithSuffix(String input, int maxlength, @Nullable String suffix) {
+    if (suffix == null) {
+      return suffixes(input, maxlength);
+    }
+    int arrLength = Math.min(maxlength, input.length());
+    String[] suffixArr = new String[arrLength];
+    for (int suffixIdx = 1; suffixIdx <= arrLength; suffixIdx++) {
+      suffixArr[suffixIdx - 1] = input.substring(input.length() - suffixIdx) + suffix;
+    }
+    return suffixArr;
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param length the max length of the ngram for the string.
+   * @return generate an array of unique ngram of the string that length are exactly matching the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniqueNgrams(String input, int length) {
+    if (length == 0 || length > input.length()) {
+      return new String[0];
+    }
+    ObjectSet<String> ngramSet = new ObjectLinkedOpenHashSet<>();
+    for (int i = 0; i < input.length() - length + 1; i++) {
+      ngramSet.add(input.substring(i, i + length));
+    }
+    return ngramSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param minGram the min length of the ngram for the string.
+   * @param maxGram the max length of the ngram for the string.
+   * @return generate an array of ngram of the string that length are within the specified range [minGram, maxGram].
+   */
+  @ScalarFunction
+  public static String[] uniqueNgrams(String input, int minGram, int maxGram) {
+    ObjectSet<String> ngramSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   Same here



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -581,6 +584,111 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"prefix"})
+  public static String[] prefixesWithPrefix(String input, int maxlength, @Nullable String prefix) {
+    if (prefix == null) {
+      return prefixes(input, maxlength);
+    }
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = prefix + input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] suffixArr = new String[arrLength];
+    for (int suffixIdx = 1; suffixIdx <= arrLength; suffixIdx++) {
+      suffixArr[suffixIdx - 1] = input.substring(input.length() - suffixIdx);
+    }
+    return suffixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @param suffix the suffix string to be appended for suffix strings generated. e.g. '$' for regex matching.
+   * @return generate an array of suffix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"suffix"})

Review Comment:
   Same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1486773708


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {

Review Comment:
   I think the idea here is valid. I can add 
   prefix(String input, int length, String regexChar) for exact length prefix
   prefix(String input, int minLength, int maxLength, String regexChar) for minMax length prefix 
   
   this won't be equivalent to `input.substring(0, arg)` but it will have same style as `ngrams`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1501214571


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixMatchers(String input, int maxlength, String regexChar) {
+    if (regexChar == null) {
+      return prefixes(input, maxlength);
+    }
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(regexChar + input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixes(String input, int maxlength) {
+    ObjectSet<String> suffixSet = new ObjectLinkedOpenHashSet<>();
+    for (int suffixLength = 1; suffixLength <= maxlength && suffixLength <= input.length(); suffixLength++) {
+      suffixSet.add(input.substring(input.length() - suffixLength));
+    }
+    return suffixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @param regexChar the character for regex matching to be added to suffix strings generated. e.g. '$'
+   * @return generate an array of suffix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixMatchers(String input, int maxlength, String regexChar) {
+    if (regexChar == null) {
+      return suffixes(input, maxlength);
+    }
+    ObjectSet<String> suffixSet = new ObjectLinkedOpenHashSet<>();
+    for (int suffixLength = 1; suffixLength <= maxlength && suffixLength <= input.length(); suffixLength++) {
+      suffixSet.add(input.substring(input.length() - suffixLength) + regexChar);
+    }
+    return suffixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param length the max length of the ngram for the string.
+   * @return generate an array of ngram of the string that length are exactly matching the specified length.
+   */
+  @ScalarFunction
+  public static String[] ngrams(String input, int length) {

Review Comment:
   similar comment as above: can we rename this to `ngramsUnique`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1501214269


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixMatchers(String input, int maxlength, String regexChar) {

Review Comment:
   sorry, I didn't understand what `regexChar` is for. We are simply prepending the character to the strings right?
   
   Can we then name this method: `uniquePrefixesWithPrefix(String input, int maxLength, String prefix)` or similar?
   
   My point being that this function itself is not tied to the fact that we are going to use this for regex matching purposes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1515232533


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction

Review Comment:
   You want to add alias `unique_prefixes`, same for other functions



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction

Review Comment:
   Do we need `unique` though? The prefixes will always be unique because they all have different length



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) {
+    if (prefix == null) {

Review Comment:
   In order to accept `null`, you want to annotate it as `nullableParameters`. Please also annotate the parameter to be `@Nullable`



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   We don't need a set since all prefixes have different length. Same for other functions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #12392:
URL: https://github.com/apache/pinot/pull/12392#issuecomment-1936688829

   ## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/12392?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   Attention: `25 lines` in your changes are missing coverage. Please review.
   > Comparison is base [(`d501478`)](https://app.codecov.io/gh/apache/pinot/commit/d5014786dbd364e65a0fbd9596c4e59830de1bf9?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.67% compared to head [(`a8b6202`)](https://app.codecov.io/gh/apache/pinot/pull/12392?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 0.00%.
   > Report is 33 commits behind head on master.
   
   | [Files](https://app.codecov.io/gh/apache/pinot/pull/12392?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines |
   |---|---|---|
   | [.../pinot/common/function/scalar/StringFunctions.java](https://app.codecov.io/gh/apache/pinot/pull/12392?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9jb21tb24vZnVuY3Rpb24vc2NhbGFyL1N0cmluZ0Z1bmN0aW9ucy5qYXZh) | 0.00% | [25 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12392?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   
   <details><summary>Additional details and impacted files</summary>
   
   
   ```diff
   @@              Coverage Diff              @@
   ##             master   #12392       +/-   ##
   =============================================
   - Coverage     61.67%    0.00%   -61.68%     
   =============================================
     Files          2422     2352       -70     
     Lines        132148   129103     -3045     
     Branches      20385    19995      -390     
   =============================================
   - Hits          81502        0    -81502     
   - Misses        44660   129103    +84443     
   + Partials       5986        0     -5986     
   ```
   
   | [Flag](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [custom-integration1](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [integration](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-0.01%)` | :arrow_down: |
   | [integration1](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [integration2](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (ø)` | |
   | [java-11](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [java-21](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.57%)` | :arrow_down: |
   | [skip-bytebuffers-false](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [skip-bytebuffers-true](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.54%)` | :arrow_down: |
   | [temurin](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-61.68%)` | :arrow_down: |
   | [unittests](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [unittests1](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [unittests2](https://app.codecov.io/gh/apache/pinot/pull/12392/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   
   </details>
   
   [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/pinot/pull/12392?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).   
   :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1567926498


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -18,6 +18,10 @@
  */
 package org.apache.pinot.common.function.scalar;
 
+import it.unimi.dsi.fastutil.objects.ObjectArrayList;

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1486834300


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {

Review Comment:
   If we are returning multiple prefixes/suffixes, then let's rename this to `prefixes/suffixes`. `prefix/suffix` should be reserved in case we need to add `input.substring(0, arg)` equivalent later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang merged PR #12392:
URL: https://github.com/apache/pinot/pull/12392


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1576631085


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -581,6 +584,111 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"prefix"})
+  public static String[] prefixesWithPrefix(String input, int maxlength, @Nullable String prefix) {
+    if (prefix == null) {
+      return prefixes(input, maxlength);
+    }
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = prefix + input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] suffixArr = new String[arrLength];
+    for (int suffixIdx = 1; suffixIdx <= arrLength; suffixIdx++) {
+      suffixArr[suffixIdx - 1] = input.substring(input.length() - suffixIdx);
+    }
+    return suffixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @param suffix the suffix string to be appended for suffix strings generated. e.g. '$' for regex matching.
+   * @return generate an array of suffix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"suffix"})

Review Comment:
   updated, thanks @Jackie-Jiang 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1487298764


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {

Review Comment:
   sure 
   ```
   prefixes(String input, int minLength, int maxLength, String regexChar) for minMax length prefix
   prefix(String input, int length, String regexChar) for exact length prefix 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1487311632


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'

Review Comment:
   hmm, the idea here is to support use cases that need to union the (prefixes, suffixes, ngram) column in one column, so the indexes size will reduced and easier to fit into memory.
   I feel this is a relatively common use case to generate the prefix matcher and suffix matcher, so added the regex character as an optimal parameters. 
   
   we can do the following 
   ```
   prefixes(String input, int minLength, int maxLength)
   prefixMatchers(String input, int minLength, int maxLength, String regexChar)
   ```
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on PR #12392:
URL: https://github.com/apache/pinot/pull/12392#issuecomment-1981564946

   > Are there equivalent/similar functions in other commonly used DBs (e.g. PostgreSQL)? We should try to match the behavior
   
   i think postgresql provide a 3-gram module called `pg_trgm` where `pg` means PostgreSQL.
   it including the following functions 
   `show_trgm (text)`
   
   Do you think i should rename the function similarly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1487319902


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   this is a very great point, but i think LinkedHashSet will maintain the same order when item have been added to the set.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1501215617


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   `uniquePrefixes` sounds good. let me rename it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1567925821


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -581,6 +586,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    ObjectList<String> prefixList = new ObjectArrayList<>();

Review Comment:
   replaced the objectArrayList with a fixSizeArr. thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1576618558


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -581,6 +584,111 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"prefix"})
+  public static String[] prefixesWithPrefix(String input, int maxlength, @Nullable String prefix) {
+    if (prefix == null) {
+      return prefixes(input, maxlength);
+    }
+    int arrLength = Math.min(maxlength, input.length());
+    String[] prefixArr = new String[arrLength];
+    for (int prefixIdx = 1; prefixIdx <= arrLength; prefixIdx++) {
+      prefixArr[prefixIdx - 1] = prefix + input.substring(0, prefixIdx);
+    }
+    return prefixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixes(String input, int maxlength) {
+    int arrLength = Math.min(maxlength, input.length());
+    String[] suffixArr = new String[arrLength];
+    for (int suffixIdx = 1; suffixIdx <= arrLength; suffixIdx++) {
+      suffixArr[suffixIdx - 1] = input.substring(input.length() - suffixIdx);
+    }
+    return suffixArr;
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @param suffix the suffix string to be appended for suffix strings generated. e.g. '$' for regex matching.
+   * @return generate an array of suffix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction(nullableParameters = true, names = {"suffix"})
+  public static String[] suffixesWithSuffix(String input, int maxlength, @Nullable String suffix) {
+    if (suffix == null) {
+      return suffixes(input, maxlength);
+    }
+    int arrLength = Math.min(maxlength, input.length());
+    String[] suffixArr = new String[arrLength];
+    for (int suffixIdx = 1; suffixIdx <= arrLength; suffixIdx++) {
+      suffixArr[suffixIdx - 1] = input.substring(input.length() - suffixIdx) + suffix;
+    }
+    return suffixArr;
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param length the max length of the ngram for the string.
+   * @return generate an array of unique ngram of the string that length are exactly matching the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniqueNgrams(String input, int length) {
+    if (length == 0 || length > input.length()) {
+      return new String[0];
+    }
+    ObjectSet<String> ngramSet = new ObjectLinkedOpenHashSet<>();
+    for (int i = 0; i < input.length() - length + 1; i++) {
+      ngramSet.add(input.substring(i, i + length));
+    }
+    return ngramSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param minGram the min length of the ngram for the string.
+   * @param maxGram the max length of the ngram for the string.
+   * @return generate an array of ngram of the string that length are within the specified range [minGram, maxGram].
+   */
+  @ScalarFunction
+  public static String[] uniqueNgrams(String input, int minGram, int maxGram) {
+    ObjectSet<String> ngramSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   hi @Jackie-Jiang ngrams doesn't guarantee to be unique, right? so the usage of Set is to dedup and avoid duplicates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1488656798


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {

Review Comment:
   actually `prefix, suffix` for exact match won't be needed since there is already a substr() function we can use directly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1501241895


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixMatchers(String input, int maxlength, String regexChar) {

Review Comment:
   sg. regex matching is one of the subset of the function. there are other cases like key-value pair matching etc.



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefixMatchers(String input, int maxlength, String regexChar) {
+    if (regexChar == null) {
+      return prefixes(input, maxlength);
+    }
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(regexChar + input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixes(String input, int maxlength) {
+    ObjectSet<String> suffixSet = new ObjectLinkedOpenHashSet<>();
+    for (int suffixLength = 1; suffixLength <= maxlength && suffixLength <= input.length(); suffixLength++) {
+      suffixSet.add(input.substring(input.length() - suffixLength));
+    }
+    return suffixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param maxlength the max length of the suffix strings for the string.
+   * @param regexChar the character for regex matching to be added to suffix strings generated. e.g. '$'
+   * @return generate an array of suffix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffixMatchers(String input, int maxlength, String regexChar) {
+    if (regexChar == null) {
+      return suffixes(input, maxlength);
+    }
+    ObjectSet<String> suffixSet = new ObjectLinkedOpenHashSet<>();
+    for (int suffixLength = 1; suffixLength <= maxlength && suffixLength <= input.length(); suffixLength++) {
+      suffixSet.add(input.substring(input.length() - suffixLength) + regexChar);
+    }
+    return suffixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param length the max length of the ngram for the string.
+   * @return generate an array of ngram of the string that length are exactly matching the specified length.
+   */
+  @ScalarFunction
+  public static String[] ngrams(String input, int length) {

Review Comment:
   sg.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1566496110


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   updated, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1567925396


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) {
+    if (prefix == null) {

Review Comment:
   thanks @Jackie-Jiang for pointer. updated 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1486774035


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= length && prefixLength <= input.length(); prefixLength++) {
+      if (regexChar != null) {
+        prefixSet.add(regexChar + input.substring(0, prefixLength));
+      } else {
+        prefixSet.add(input.substring(0, prefixLength));
+      }
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for suffix strings generations.
+   * @param length the max length of the suffix strings for the string.
+   * @param regexChar the character for regex matching to be added to suffix strings generated. e.g. '$'
+   * @return generate an array of suffix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] suffix(String input, int length, String regexChar) {
+    ObjectSet<String> suffixSet = new ObjectLinkedOpenHashSet<>();
+    for (int suffixLength = 1; suffixLength <= length && suffixLength <= input.length(); suffixLength++) {
+      if (regexChar != null) {
+        suffixSet.add(input.substring(input.length() - suffixLength) + regexChar);
+      } else {
+        suffixSet.add(input.substring(input.length() - suffixLength));
+      }
+    }
+    return suffixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for ngram generations.
+   * @param length the max length of the ngram for the string.
+   * @return generate an array of ngram of the string that length are exactly matching the specified length.
+   */
+  @ScalarFunction
+  public static String[] ngram(String input, int length) {

Review Comment:
   thanks for the review, this sounds good to me. i will rename it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1487202421


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   Missed this in the first pass. Can we return a list instead since that seems more fitting for `prefixes`?
   
   For the set version we could create `uniquePrefixes, uniqueSuffixes` or something. Moreover we should specify how the set will be ordered since otherwise the replicas of consuming segments will diverge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1487311632


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'

Review Comment:
   hmm, the idea here is to support use cases that need to union the (prefixes, suffixes, ngram) column in one column, so the indexes size will reduced and easier to fit into memory.
   I feel this is a relatively common use case to generate the prefix matcher and suffix matcher, so added the regex character as an optimal parameters. 
   
   we can also do the following 
   ```
   prefix(String input, int length)
   prefixMatcher(String input, int length, String regexChar)
   
   prefixes(String input, int minLength, int maxLength)
   prefixMatchers(String input, int minLength, int maxLength, String regexChar)
   ```
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1487298764


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {

Review Comment:
   sure. actually i will cover prefix() for exact match
   ```
   prefixes(String input, int minLength, int maxLength, String regexChar) for minMax length prefix
   prefix(String input, int length, String regexChar) for exact length prefix 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1486839293


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'

Review Comment:
   In my opinion, it doesn't make sense to keep this in the OSS functions registry since this is too use-case specific. Can't the caller adjust their queries as required?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1501211234


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,81 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param length the max length of the prefix strings for the string.
+   * @param regexChar the character for regex matching to be added to prefix strings generated. e.g. '^'
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] prefix(String input, int length, String regexChar) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();

Review Comment:
   I see. But it will only return the unique prefixes. Can we rename the method to `uniquePrefixes` or similar to be explicit about this?
   
   `prefixes` suggests that we will enumerate all prefixes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Add Prefix, Suffix and Ngram UDFs [pinot]

Posted by "deemoliu (via GitHub)" <gi...@apache.org>.
deemoliu commented on code in PR #12392:
URL: https://github.com/apache/pinot/pull/12392#discussion_r1566393961


##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction

Review Comment:
   sg. i think the reason of unique_prefixes is to reserve `prefixes` for other purpose or implementations. if no objection, let me use `prefixes()` then.



##########
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java:
##########
@@ -570,6 +572,107 @@ public static String[] split(String input, String delimiter, int limit) {
     return StringUtils.splitByWholeSeparator(input, delimiter, limit);
   }
 
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @return generate an array of prefix strings of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixes(String input, int maxlength) {
+    ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
+    for (int prefixLength = 1; prefixLength <= maxlength && prefixLength <= input.length(); prefixLength++) {
+      prefixSet.add(input.substring(0, prefixLength));
+    }
+    return prefixSet.toArray(new String[0]);
+  }
+
+  /**
+   * @param input an input string for prefix strings generations.
+   * @param maxlength the max length of the prefix strings for the string.
+   * @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
+   * @return generate an array of prefix matchers of the string that are shorter than the specified length.
+   */
+  @ScalarFunction
+  public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) {
+    if (prefix == null) {

Review Comment:
   updated, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org