You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/06/16 17:00:08 UTC

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1541: RegExp - add case insensitive matching option

jpountz commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r440900284



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -489,6 +497,19 @@ public RegExp(String s) throws IllegalArgumentException {
     this(s, ALL);
   }
   
+  /**
+   * Constructs new <code>RegExp</code> from a string. Same as
+   * <code>RegExp(s, ALL)</code>.
+   * 
+   * @param s regexp string
+   * @param caseSensitive case sensitive matching
+   * @exception IllegalArgumentException if an error occurred while parsing the
+   *              regular expression
+   */
+  public RegExp(String s, boolean caseSensitive) throws IllegalArgumentException {
+    this(s, ALL, caseSensitive);
+  }  

Review comment:
       same here, maybe it's fine to require passing syntax flags when you want to configure case sensitivity?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -499,10 +520,30 @@ public RegExp(String s) throws IllegalArgumentException {
    *              regular expression
    */
   public RegExp(String s, int syntax_flags) throws IllegalArgumentException {
+    this(s, syntax_flags, true);
+  }
+  /**
+   * Constructs new <code>RegExp</code> from a string.
+   * 
+   * @param s regexp string
+   * @param syntax_flags boolean 'or' of optional syntax constructs to be
+   *          enabled
+   * @param caseSensitive case sensitive matching
+   * @exception IllegalArgumentException if an error occurred while parsing the
+   *              regular expression
+   */
+  public RegExp(String s, int syntax_flags, boolean caseSensitive) throws IllegalArgumentException {    
     originalString = s;
-    flags = syntax_flags;
+    // Trim any bits unrelated to syntax flags
+    syntax_flags  = syntax_flags & 0xff;
+    if (caseSensitive) {
+      flags = syntax_flags;
+    } else {      
+      // Add in the case-insensitive setting
+      flags = syntax_flags  | UNICODE_CASE_INSENSITIVE;

Review comment:
       ```suggestion
         flags = syntax_flags | UNICODE_CASE_INSENSITIVE;
   ```

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -743,6 +792,30 @@ private Automaton toAutomatonInternal(Map<String,Automaton> automata,
     }
     return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int maxDeterminizedStates) {
+    Automaton case1 = Automata.makeChar(codepoint);
+    int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+    Automaton result;
+    if (altCase != codepoint) {
+      result = Operations.union(case1, Automata.makeChar(altCase));
+      result = MinimizationOperations.minimize(result, maxDeterminizedStates);          
+    } else {
+      result = case1;                      
+    }          
+    return result;
+  }
+  
+  private Automaton toCaseInsensitiveString(int maxDeterminizedStates) {
+    List<Automaton> list = new ArrayList<>();
+    s.codePoints().forEach(
+        p -> {
+          list.add(toCaseInsensitiveChar(p, maxDeterminizedStates));
+        }
+    );

Review comment:
       I'd rather like a regular for loop, this is a bit abusing lambdas to my taste. :)

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -743,6 +792,30 @@ private Automaton toAutomatonInternal(Map<String,Automaton> automata,
     }
     return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int maxDeterminizedStates) {
+    Automaton case1 = Automata.makeChar(codepoint);
+    int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+    Automaton result;
+    if (altCase != codepoint) {
+      result = Operations.union(case1, Automata.makeChar(altCase));
+      result = MinimizationOperations.minimize(result, maxDeterminizedStates);          
+    } else {
+      result = case1;                      
+    }          
+    return result;
+  }

Review comment:
       I think that this is incorrect as there is no 1:1 mapping between lowercase and uppercase letters, for instance `ς` and `σ` both have `Σ` as their uppercase variant. And if someone uses `Σ` in their regexes, `ς` wouldn't match as `toLowerCase(Σ)` returns `σ`.
   
   Should we make this only about ASCII for now, like Java's Pattern class? https://docs.oracle.com/en/java/javase/13/docs/api/java.base/java/util/regex/Pattern.html#CASE_INSENSITIVE We could add support for full Unicode later but this doesn't look like a low hanging fruit to me?

##########
File path: lucene/core/src/java/org/apache/lucene/search/RegexpQuery.java
##########
@@ -68,6 +68,19 @@ public RegexpQuery(Term term) {
     this(term, RegExp.ALL);
   }
   
+  /**
+   * Constructs a query for terms matching <code>term</code>.
+   * <p>
+   * By default, all regular expression features are enabled.
+   * </p>
+   * 
+   * @param term regular expression.
+   * @param caseSensitive set to false for case insensitive matching 
+   */
+  public RegexpQuery(Term term, boolean caseSensitive) {
+    this(term, RegExp.ALL, caseSensitive);
+  }  

Review comment:
       In order not to have a combinatorial explosion of the number of ctors, I think we could consider dropping this one: I think it's fine to require users to provide flags if they also want to configure case sensitivity?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org