You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by ki...@apache.org on 2022/01/07 09:16:24 UTC

[opennlp] branch master updated: OPENNLP-1350: Improve normaliser MAIL_REGEX (#399)

This is an automated email from the ASF dual-hosted git repository.

kinow pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/opennlp.git


The following commit(s) were added to refs/heads/master by this push:
     new cb0f3e6  OPENNLP-1350: Improve normaliser MAIL_REGEX (#399)
cb0f3e6 is described below

commit cb0f3e6c92dbce0f5307a3fbedf6f87fab0d2307
Author: Jon Marius Venstad <jo...@users.noreply.github.com>
AuthorDate: Fri Jan 7 10:16:15 2022 +0100

    OPENNLP-1350: Improve normaliser MAIL_REGEX (#399)
    
    The `MAIL_REGEX` in `UrlCharSSequenceNormalizer` causes `replaceAll(...)` to become extremely costly when given an input string with a long sequence of characters from the first character set in the regex, but which ultimately fails to match the whole regex. This pull request fixes that, and also another detail:
    
    Allow + in the local part, and disallow _ in the domain part. There are other characters that are allowed in the local part as well, but these are less common (https://en.wikipedia.org/wiki/Email_address).
    
    The speedup for unfortunate input is achieved by adding a negative lookbehind with a single characters from the first character set.
    Currently, the replaceAll(" ") on a string of ~100K characters from the set [-_.0-9A-Za-z] runs in ~1minute on modern hardware; adding a negative lookbehind with one of the characters from that set reduces this to a few milliseconds, and is functionally equivalent. (Consider the current pattern and a match from position i to k. If the character at i-1 is in the character set, there would also be a match from i-1 to k, which would already be replaced.)
---
 .../java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java | 2 +-
 .../opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java  | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java b/opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
index 847f86d..188e389 100644
--- a/opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
+++ b/opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
@@ -26,7 +26,7 @@ public class UrlCharSequenceNormalizer implements CharSequenceNormalizer {
   private static final Pattern URL_REGEX =
       Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
   private static final Pattern MAIL_REGEX =
-      Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+      Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");
 
   private static final UrlCharSequenceNormalizer INSTANCE = new UrlCharSequenceNormalizer();
 
diff --git a/opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java b/opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java
index 72eb83a..d5ac1a9 100644
--- a/opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java
+++ b/opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java
@@ -43,5 +43,9 @@ public class UrlCharSequenceNormalizerTest {
     Assert.assertEquals(
         "asdf   2nnfdf  ", normalizer.normalize("asdf asd.fdfa@hasdk23.com.br" +
             " 2nnfdf asd.fdfa@hasdk23.com.br"));
+    Assert.assertEquals(
+        "asdf   2nnfdf", normalizer.normalize("asdf asd+fdfa@hasdk23.com.br 2nnfdf"));
+    Assert.assertEquals(
+        "asdf  _br 2nnfdf", normalizer.normalize("asdf asd.fdfa@hasdk23.com_br 2nnfdf"));
   }
 }