You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by "mawiesne (via GitHub)" <gi...@apache.org> on 2023/02/26 14:48:18 UTC

[GitHub] [opennlp] mawiesne commented on a diff in pull request #506: OPENNLP-141 Tokenizers alphanumeric optimization only recognizes a-z as alpha chars

mawiesne commented on code in PR #506:
URL: https://github.com/apache/opennlp/pull/506#discussion_r1118098845


##########
opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java:
##########
@@ -25,24 +25,45 @@
 
 public class Factory {
 
-  public static final String DEFAULT_ALPHANUMERIC = "^[A-Za-z0-9]+$";
+  public static final Pattern DEFAULT_ALPHANUMERIC = Pattern.compile("^[A-Za-z0-9]+$");
+
+  private static final Pattern PORTOGUESE = Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
+  private static final Pattern FRENCH = Pattern.compile("^[a-zA-Z0-9àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ]+$");
+
+  // For reference: https://www.sttmedia.com/characterfrequency-dutch
+  private static final Pattern DUTCH = Pattern.compile("^[A-Za-z0-9äöüëèéïijÄÖÜËÉÈÏIJ]+$");
+  private static final Pattern GERMAN = Pattern.compile("^[A-Za-z0-9äöüÄÖÜß]+$");
 
   /**
-   * Gets the alphanumeric pattern for the language. Please save the value
-   * locally because this call is expensive.
+   * Gets the alphanumeric pattern for a language.
    *
-   * @param languageCode The language code. If {@code null}, or unknown,
-   *                     the default pattern will be returned.
-   * @return The alphanumeric pattern for the language or the default pattern.
+   * @param languageCode The ISO_639-1 code. If {@code null}, or unknown, the
+   *                     {@link #DEFAULT_ALPHANUMERIC} pattern will be returned.
+   * @return The alphanumeric {@link Pattern} for the language, or the default pattern.
    */
   public Pattern getAlphanumeric(String languageCode) {
+    // For reference: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
     if ("pt".equals(languageCode) || "por".equals(languageCode)) {
-      return Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
+      return PORTOGUESE;

Review Comment:
   Expert speaking. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org