You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "sabi0 (via GitHub)" <gi...@apache.org> on 2023/12/24 22:30:51 UTC

[PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

sabi0 opened a new pull request, #12973:
URL: https://github.com/apache/lucene/pull/12973

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss merged PR #12973:
URL: https://github.com/apache/lucene/pull/12973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1445345012


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   Thank you for investigating. I think javac and javadoc should be consistent here - if they're not, it's worth firing a message to openjdk...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "sabi0 (via GitHub)" <gi...@apache.org>.

sabi0 commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1444504974


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   Thank you for the link. I did not know about the `\uu...` either.
   
   Unfortunately, javadoc seems to swallow all of those 'u's anyway:
   ```
   <div class="block">Returns a String where the escape char has been removed, or kept only once if there was a
    double escape.
   
    <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.</div>
   ```
   
   The `{@code ...}` markup works the same:
   ```
   <code>\u0041</code>   => A
   <code>\uu0041</code>  => A
   <code>\\u0041</code>  => \\u0041
   
   {@code \u0041}   => A
   {@code \uu0041}  => A
   {@code \\u0041}  => \\u0041
   ```
   
   JDK Javadoc uses Unicode escape for the backslash itself: `{@code \u005Cu0800}`:
   https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/DataInput.java#L116



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "sabi0 (via GitHub)" <gi...@apache.org>.

sabi0 commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1444477534


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -40,105 +40,109 @@ public class EscapeQuerySyntaxImpl implements EscapeQuerySyntax {
     "AND", "OR", "NOT", "TO", "WITHIN", "SENTENCE", "PARAGRAPH", "INORDER"
   };
 
-  private static final CharSequence escapeChar(CharSequence str, Locale locale) {
-    if (str == null || str.length() == 0) return str;
+  private static CharSequence escapeChar(CharSequence str, Locale locale) {
+    if (str == null || str.isEmpty()) return str;
 
     CharSequence buffer = str;
 
-    // regular escapable Char for terms
-    for (int i = 0; i < escapableTermChars.length; i++) {
-      buffer = replaceIgnoreCase(buffer, escapableTermChars[i].toLowerCase(locale), "\\", locale);
+    // regular escapable char for terms
+    for (String escapableTermChar : escapableTermChars) {
+      buffer = escapeIgnoringCase(buffer, escapableTermChar.toLowerCase(locale), "\\", locale);
     }
 
-    // First Character of a term as more escaping chars
-    for (int i = 0; i < escapableTermExtraFirstChars.length; i++) {
-      if (buffer.charAt(0) == escapableTermExtraFirstChars[i].charAt(0)) {
-        buffer = "\\" + buffer.charAt(0) + buffer.subSequence(1, buffer.length());
+    // first char of a term as more escaping chars
+    for (String escapableTermExtraFirstChar : escapableTermExtraFirstChars) {
+      if (buffer.charAt(0) == escapableTermExtraFirstChar.charAt(0)) {
+        buffer = "\\" + buffer;
         break;
       }
     }
 
     return buffer;
   }
 
-  private final CharSequence escapeQuoted(CharSequence str, Locale locale) {
-    if (str == null || str.length() == 0) return str;
+  private static CharSequence escapeQuoted(CharSequence str, Locale locale) {
+    if (str == null || str.isEmpty()) return str;
 
     CharSequence buffer = str;
 
-    for (int i = 0; i < escapableQuotedChars.length; i++) {
-      buffer = replaceIgnoreCase(buffer, escapableTermChars[i].toLowerCase(locale), "\\", locale);
+    for (String escapableQuotedChar : escapableQuotedChars) {
+      buffer = escapeIgnoringCase(buffer, escapableQuotedChar.toLowerCase(locale), "\\", locale);
     }
     return buffer;
   }
 
-  private static final CharSequence escapeTerm(CharSequence term, Locale locale) {
-    if (term == null) return term;
+  private static CharSequence escapeTerm(CharSequence term, Locale locale) {
+    if (term == null || term.isEmpty()) return term;
 
-    // Escape single Chars
+    // escape single chars
     term = escapeChar(term, locale);
     term = escapeWhiteChar(term, locale);
 
-    // Escape Parser Words
-    for (int i = 0; i < escapableWordTokens.length; i++) {
-      if (escapableWordTokens[i].equalsIgnoreCase(term.toString())) return "\\" + term;
+    // escape parser words
+    for (String escapableWordToken : escapableWordTokens) {
+      if (escapableWordToken.equalsIgnoreCase(term.toString())) return "\\" + term;
     }
     return term;
   }
 
   /**
-   * replace with ignore case
+   * Prepend every case-insensitive occurrence of the {@code sequence1} in the {@code string} with
+   * the {@code escapeChar}. When the {@code sequence1} is empty, every character in the {@code
+   * string} is escaped.
    *
-   * @param string string to get replaced
+   * @param string string to apply escaping to
    * @param sequence1 the old character sequence in lowercase
-   * @param escapeChar the new character to prefix sequence1 in return string.
-   * @return the new String
+   * @param escapeChar the escape character to prefix sequence1 in the returned string
+   * @return CharSequence with every occurrence of {@code sequence1} prepended with {@code
+   *     escapeChar}
    */
-  private static CharSequence replaceIgnoreCase(
+  private static CharSequence escapeIgnoringCase(
       CharSequence string, CharSequence sequence1, CharSequence escapeChar, Locale locale) {
     if (escapeChar == null || sequence1 == null || string == null) throw new NullPointerException();
 
-    // empty string case
     int count = string.length();
     int sequence1Length = sequence1.length();
+
+    // empty search string - escape every character
     if (sequence1Length == 0) {
-      StringBuilder result = new StringBuilder((count + 1) * escapeChar.length());
-      result.append(escapeChar);
+      StringBuilder result = new StringBuilder(count * (1 + escapeChar.length()));
       for (int i = 0; i < count; i++) {
-        result.append(string.charAt(i));

Review Comment:
   The `escapeIgnoringCase` method is `private`. It is called in three places, all looking like this:
   ```
       for (String escapableQuotedChar : escapableQuotedChars) {
         buffer = escapeIgnoringCase(buffer, escapableQuotedChar.toLowerCase(locale), "\\", locale);
       }
   ```
   
   I.e. the input for the search string `sequence1` parameter is always controlled and is never an empty string:
   ```
     private static final String[] escapableTermChars = {
       "\"", "<", ">", "=", "!", "(", ")", "^", "[", "{", ":", "]", "}", "~", "/"
     };
   
     private static final String[] escapableQuotedChars = {"\""};
   
     private static final String[] escapableWhiteChars = {" ", "\t", "\n", "\r", "\f", "\b", "\u3000"};
   ```
   (unless some weird locale drops one of those characters when converting to lower case)
   
   I wonder if this whole "empty search string" block should be replaced with an `IllegalArgumentException`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1444446785


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   Can you check if the newer ```{@code xxx}``` markup works correctly here? Sorry, I'm busy at the moment. Also, I would not worry about intellij - if javadoc produces valid output, intellij has a bug (feel free to report it!). Escape sequences may be translated very early by javac lexer - I'm sure there is a way to escape them though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1444450852


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   I think this gives you a hint on how to escape it in the source -
   https://stackoverflow.com/questions/21522770/unicode-escape-syntax-in-java
   
   Interesting, I didn't know about it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "sabi0 (via GitHub)" <gi...@apache.org>.

sabi0 commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1444504974


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   Thank you for the link. I did not know about the `\uu...` either.
   
   Unfortunately, javadoc seems to swallow all of those 'u's anyway:
   ```
   <div class="block">Returns a String where the escape char has been removed, or kept only once if there was a
    double escape.
   
    <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.</div>
   ```
   
   The `{@code ...}` markup works the same:
   ```
   <code>\u0041</code>   => A
   <code>\uu0041</code>  => A
   <code>\\u0041</code>  => \\u0041
   
   {@code \u0041}   => A
   {@code \uu0041}  => A
   {@code \\u0041}  => \\u0041
   ```
   
   JDK Javadoc uses Unicode escape for the slash itself: `{@code \u005Cu0800}`:
   https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/DataInput.java#L116



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1444446785


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   Can you check if the newer {@code xxx} markup works correctly here? Sorry, I'm busy at the moment. Also, I would not worry about intellij - if javadoc produces valid output, intellij has a bug (feel free to report it!). Escape sequences may be translated very early by javac lexer - I'm sure there is a way to escape them though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "sabi0 (via GitHub)" <gi...@apache.org>.

sabi0 commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1444441897


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   Probably `QueryParser.jj` never rendered this part correctly:
   https://github.com/sabi0/lucene/blob/343992fcbb4b31249f07354014723f18d0508d8a/src/java/org/apache/lucene/queryParser/QueryParser.jj#L1071
   
   And then it got "solidified" with LUCENE-1567:
   https://github.com/sabi0/lucene/commit/343992fcbb4b31249f07354014723f18d0508d8a#diff-bc1f2f880b43ac551a81b97846a7e7e9119f13de581f6564a35794fa0c78ab36R212
   
   There is another occurrence of this text in the `QueryParserBase` which tries to "fix" the problem:
   ```
      * <p>Supports escaped unicode characters, e. g. translates <code>\\u0041</code> to <code>A</code>
   ```
   
   But, for instance, IntelliJ still renders that Javadoc wrongly.
   Unicode escape-sequence has higher precedence than character escaping?
   
   What do you think of fixing it like this:
   ```
      * <p>Supports escaped Unicode characters, e.g. translates \<code>u0041</code> to <code>A</code>.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1445347859


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -40,105 +40,109 @@ public class EscapeQuerySyntaxImpl implements EscapeQuerySyntax {
     "AND", "OR", "NOT", "TO", "WITHIN", "SENTENCE", "PARAGRAPH", "INORDER"
   };
 
-  private static final CharSequence escapeChar(CharSequence str, Locale locale) {
-    if (str == null || str.length() == 0) return str;
+  private static CharSequence escapeChar(CharSequence str, Locale locale) {
+    if (str == null || str.isEmpty()) return str;
 
     CharSequence buffer = str;
 
-    // regular escapable Char for terms
-    for (int i = 0; i < escapableTermChars.length; i++) {
-      buffer = replaceIgnoreCase(buffer, escapableTermChars[i].toLowerCase(locale), "\\", locale);
+    // regular escapable char for terms
+    for (String escapableTermChar : escapableTermChars) {
+      buffer = escapeIgnoringCase(buffer, escapableTermChar.toLowerCase(locale), "\\", locale);
     }
 
-    // First Character of a term as more escaping chars
-    for (int i = 0; i < escapableTermExtraFirstChars.length; i++) {
-      if (buffer.charAt(0) == escapableTermExtraFirstChars[i].charAt(0)) {
-        buffer = "\\" + buffer.charAt(0) + buffer.subSequence(1, buffer.length());
+    // first char of a term as more escaping chars
+    for (String escapableTermExtraFirstChar : escapableTermExtraFirstChars) {
+      if (buffer.charAt(0) == escapableTermExtraFirstChar.charAt(0)) {
+        buffer = "\\" + buffer;
         break;
       }
     }
 
     return buffer;
   }
 
-  private final CharSequence escapeQuoted(CharSequence str, Locale locale) {
-    if (str == null || str.length() == 0) return str;
+  private static CharSequence escapeQuoted(CharSequence str, Locale locale) {
+    if (str == null || str.isEmpty()) return str;
 
     CharSequence buffer = str;
 
-    for (int i = 0; i < escapableQuotedChars.length; i++) {
-      buffer = replaceIgnoreCase(buffer, escapableTermChars[i].toLowerCase(locale), "\\", locale);
+    for (String escapableQuotedChar : escapableQuotedChars) {
+      buffer = escapeIgnoringCase(buffer, escapableQuotedChar.toLowerCase(locale), "\\", locale);
     }
     return buffer;
   }
 
-  private static final CharSequence escapeTerm(CharSequence term, Locale locale) {
-    if (term == null) return term;
+  private static CharSequence escapeTerm(CharSequence term, Locale locale) {
+    if (term == null || term.isEmpty()) return term;
 
-    // Escape single Chars
+    // escape single chars
     term = escapeChar(term, locale);
     term = escapeWhiteChar(term, locale);
 
-    // Escape Parser Words
-    for (int i = 0; i < escapableWordTokens.length; i++) {
-      if (escapableWordTokens[i].equalsIgnoreCase(term.toString())) return "\\" + term;
+    // escape parser words
+    for (String escapableWordToken : escapableWordTokens) {
+      if (escapableWordToken.equalsIgnoreCase(term.toString())) return "\\" + term;
     }
     return term;
   }
 
   /**
-   * replace with ignore case
+   * Prepend every case-insensitive occurrence of the {@code sequence1} in the {@code string} with
+   * the {@code escapeChar}. When the {@code sequence1} is empty, every character in the {@code
+   * string} is escaped.
    *
-   * @param string string to get replaced
+   * @param string string to apply escaping to
    * @param sequence1 the old character sequence in lowercase
-   * @param escapeChar the new character to prefix sequence1 in return string.
-   * @return the new String
+   * @param escapeChar the escape character to prefix sequence1 in the returned string
+   * @return CharSequence with every occurrence of {@code sequence1} prepended with {@code
+   *     escapeChar}
    */
-  private static CharSequence replaceIgnoreCase(
+  private static CharSequence escapeIgnoringCase(
       CharSequence string, CharSequence sequence1, CharSequence escapeChar, Locale locale) {
     if (escapeChar == null || sequence1 == null || string == null) throw new NullPointerException();
 
-    // empty string case
     int count = string.length();
     int sequence1Length = sequence1.length();
+
+    // empty search string - escape every character
     if (sequence1Length == 0) {
-      StringBuilder result = new StringBuilder((count + 1) * escapeChar.length());
-      result.append(escapeChar);
+      StringBuilder result = new StringBuilder(count * (1 + escapeChar.length()));
       for (int i = 0; i < count; i++) {
-        result.append(string.charAt(i));

Review Comment:
   I am not that familiar with this code but I think it'd be good to keep the cosmetic cleanups separate from functional changes - if you don't mind, I'll push this change first, then you can come up with a more focused cleanup?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Cleanup and fix EscapeQuerySyntaxImpl [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss commented on code in PR #12973:
URL: https://github.com/apache/lucene/pull/12973#discussion_r1437830890


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -184,7 +186,7 @@ public CharSequence escape(CharSequence text, Locale locale, Type type) {
    * Returns a String where the escape char has been removed, or kept only once if there was a
    * double escape.
    *
-   * <p>Supports escaped unicode characters, e. g. translates <code>A</code> to <code>A</code>.
+   * <p>Supports escaped Unicode characters, e.g. translates <code>A</code> to <code>A</code>.

Review Comment:
   Seems like the comment is trying to say unicode escape sequences are replaced into their characters? Right now it says A->A, which doesn't make sense to me.



##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:
##########
@@ -40,105 +40,109 @@ public class EscapeQuerySyntaxImpl implements EscapeQuerySyntax {
     "AND", "OR", "NOT", "TO", "WITHIN", "SENTENCE", "PARAGRAPH", "INORDER"
   };
 
-  private static final CharSequence escapeChar(CharSequence str, Locale locale) {
-    if (str == null || str.length() == 0) return str;
+  private static CharSequence escapeChar(CharSequence str, Locale locale) {
+    if (str == null || str.isEmpty()) return str;
 
     CharSequence buffer = str;
 
-    // regular escapable Char for terms
-    for (int i = 0; i < escapableTermChars.length; i++) {
-      buffer = replaceIgnoreCase(buffer, escapableTermChars[i].toLowerCase(locale), "\\", locale);
+    // regular escapable char for terms
+    for (String escapableTermChar : escapableTermChars) {
+      buffer = escapeIgnoringCase(buffer, escapableTermChar.toLowerCase(locale), "\\", locale);
     }
 
-    // First Character of a term as more escaping chars
-    for (int i = 0; i < escapableTermExtraFirstChars.length; i++) {
-      if (buffer.charAt(0) == escapableTermExtraFirstChars[i].charAt(0)) {
-        buffer = "\\" + buffer.charAt(0) + buffer.subSequence(1, buffer.length());
+    // first char of a term as more escaping chars
+    for (String escapableTermExtraFirstChar : escapableTermExtraFirstChars) {
+      if (buffer.charAt(0) == escapableTermExtraFirstChar.charAt(0)) {
+        buffer = "\\" + buffer;
         break;
       }
     }
 
     return buffer;
   }
 
-  private final CharSequence escapeQuoted(CharSequence str, Locale locale) {
-    if (str == null || str.length() == 0) return str;
+  private static CharSequence escapeQuoted(CharSequence str, Locale locale) {
+    if (str == null || str.isEmpty()) return str;
 
     CharSequence buffer = str;
 
-    for (int i = 0; i < escapableQuotedChars.length; i++) {
-      buffer = replaceIgnoreCase(buffer, escapableTermChars[i].toLowerCase(locale), "\\", locale);
+    for (String escapableQuotedChar : escapableQuotedChars) {
+      buffer = escapeIgnoringCase(buffer, escapableQuotedChar.toLowerCase(locale), "\\", locale);
     }
     return buffer;
   }
 
-  private static final CharSequence escapeTerm(CharSequence term, Locale locale) {
-    if (term == null) return term;
+  private static CharSequence escapeTerm(CharSequence term, Locale locale) {
+    if (term == null || term.isEmpty()) return term;
 
-    // Escape single Chars
+    // escape single chars
     term = escapeChar(term, locale);
     term = escapeWhiteChar(term, locale);
 
-    // Escape Parser Words
-    for (int i = 0; i < escapableWordTokens.length; i++) {
-      if (escapableWordTokens[i].equalsIgnoreCase(term.toString())) return "\\" + term;
+    // escape parser words
+    for (String escapableWordToken : escapableWordTokens) {
+      if (escapableWordToken.equalsIgnoreCase(term.toString())) return "\\" + term;
     }
     return term;
   }
 
   /**
-   * replace with ignore case
+   * Prepend every case-insensitive occurrence of the {@code sequence1} in the {@code string} with
+   * the {@code escapeChar}. When the {@code sequence1} is empty, every character in the {@code
+   * string} is escaped.
    *
-   * @param string string to get replaced
+   * @param string string to apply escaping to
    * @param sequence1 the old character sequence in lowercase
-   * @param escapeChar the new character to prefix sequence1 in return string.
-   * @return the new String
+   * @param escapeChar the escape character to prefix sequence1 in the returned string
+   * @return CharSequence with every occurrence of {@code sequence1} prepended with {@code
+   *     escapeChar}
    */
-  private static CharSequence replaceIgnoreCase(
+  private static CharSequence escapeIgnoringCase(
       CharSequence string, CharSequence sequence1, CharSequence escapeChar, Locale locale) {
     if (escapeChar == null || sequence1 == null || string == null) throw new NullPointerException();
 
-    // empty string case
     int count = string.length();
     int sequence1Length = sequence1.length();
+
+    // empty search string - escape every character
     if (sequence1Length == 0) {
-      StringBuilder result = new StringBuilder((count + 1) * escapeChar.length());
-      result.append(escapeChar);
+      StringBuilder result = new StringBuilder(count * (1 + escapeChar.length()));
       for (int i = 0; i < count; i++) {
-        result.append(string.charAt(i));

Review Comment:
   Would it be possible to add a test for this, since you've found what looks like a bug? Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org