You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by GitBox <gi...@apache.org> on 2022/01/06 07:53:01 UTC
[GitHub] [opennlp] jonmv commented on a change in pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

jonmv commented on a change in pull request #399:
URL: https://github.com/apache/opennlp/pull/399#discussion_r779358503



##########
File path: opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -26,7 +26,7 @@
   private static final Pattern URL_REGEX =
       Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
   private static final Pattern MAIL_REGEX =
-      Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+      Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

Review comment:
       Yes, the addition of `+` is just an improvement. Also, `_` is not valid in the domain part—this is just an error in the current regex. 
   
   The _negative_ lookbehind ensures a regex is _not_ a match if the lookbehind matches right before the pattern.
   Consider the string `foo aaa@b`. With the current regex, all of `aaa@b`, `aa@b`, and `a@b` would match. Because it's used with a `replaceAll`, however, only the first (longest) match will actually be used, i.e., `aaa@b`, and we'll retain `foo  `.
   With the negative lookbehind, `aaa@b` will still match, as the previous character (` `) does _not_ match the lookbehind (so the match is _not_ ignored (double negation, sorry)). However, `aa@b` is no longer a match, as the previous character in this case is `a`, which _does_ match the lookbehind, and the match is ignored. Likewise, `a@b` no longer matches. 
   For strings that _are_ email addresses, we get the same behaviour: the longest match is replaced with ` `. 
   
   Now consider the string `foo aaa`. Neither of the regexes match anything here. The current regex, however, evaluates the whole of `foo` as a possible local-part match until it reaches ` `, which is not allowed. It then tries with `oo`, and `o`. Likewise, it tries `aaa` until it reaches the end of input, then `aa`, and `a`. 
   The new regex, on the other hand, tries `foo` as a local-part because the start of input does _not_ match the lookbehind, terminates on ` ` (as the current regex), but then it immediately fails on `oo` because `f` _does_ match the lookbehind, so `oo` and `o` immediately fail. Likewise, only `aaa` is examined until the end; `aa` and `a` fail early. 
   This does not make much of a difference with these short strings, but you can see the complexity is quadratic (current) vs linear (proposed) with the length of strings like `aaa`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org