You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/01/06 07:54:00 UTC

[jira] [Commented] (OPENNLP-1350) MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for certain input, and is also a bit imprecise

    [ https://issues.apache.org/jira/browse/OPENNLP-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469752#comment-17469752 ] 

ASF GitHub Bot commented on OPENNLP-1350:
-----------------------------------------

jonmv commented on a change in pull request #399:
URL: https://github.com/apache/opennlp/pull/399#discussion_r779358503



##########
File path: opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -26,7 +26,7 @@
   private static final Pattern URL_REGEX =
       Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
   private static final Pattern MAIL_REGEX =
-      Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+      Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

Review comment:
       Yes, the addition of `+` is just an improvement. Also, `_` is not valid in the domain part—this is just an error in the current regex. 
   
   The _negative_ lookbehind ensures a regex is _not_ a match if the lookbehind matches right before the pattern.
   Consider the string `foo aaa@b`. With the current regex, all of `aaa@b`, `aa@b`, and `a@b` would match. Because it's used with a `replaceAll`, however, only the first (longest) match will actually be used, i.e., `aaa@b`, and we'll retain `foo  `.
   With the negative lookbehind, `aaa@b` will still match, as the previous character (` `) does _not_ match the lookbehind (so the match is _not_ ignored (double negation, sorry)). However, `aa@b` is no longer a match, as the previous character in this case is `a`, which _does_ match the lookbehind, and the match is ignored. Likewise, `a@b` no longer matches. 
   For strings that _are_ email addresses, we get the same behaviour: the longest match is replaced with ` `. 
   
   Now consider the string `foo aaa`. Neither of the regexes match anything here. The current regex, however, evaluates the whole of `foo` as a possible local-part match until it reaches ` `, which is not allowed. It then tries with `oo`, and `o`. Likewise, it tries `aaa` until it reaches the end of input, then `aa`, and `a`. 
   The new regex, on the other hand, tries `foo` as a local-part because the start of input does _not_ match the lookbehind, terminates on ` ` (as the current regex), but then it immediately fails on `oo` because `f` _does_ match the lookbehind, so `oo` and `o` immediately fail. Likewise, only `aaa` is examined until the end; `aa` and `a` fail early. 
   This does not make much of a difference with these short strings, but you can see the complexity is quadratic (current) vs linear (proposed) with the length of strings like `aaa`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for certain input, and is also a bit imprecise
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-1350
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1350
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Language Detector
>    Affects Versions: 1.9.3
>            Reporter: Jon Marius Venstad
>            Priority: Minor
>
> The regex used to strip email addresses from input, in UrlCharSequenceNormalizer, has quadratic complexity when used with {{{}String.replaceAll{}}}, and when input is a long sequence of characters from the first character set, i.e., {{{}[-_.0-9A-Za-z]{}}}, which fails to match the whole regex; then, the regex is evaluated again for each suffix of this sequence, with linear cost each time. 
> This problem is promptly solved by adding a negative lookbehind with a single character from that same set, to the first part of the regex. 
>  
> Additionally, the character {{_}} is allowed in the domain part of the mail address, where it is in fact illegal. Likewise, the character {{+}} is disallowed in the local part (the first first), where it _is{_} legal, and even quite common. The set of legal characters in the first part is actually quite bonkers, per the RFC, but such usage is probably less common. See [https://en.wikipedia.org/wiki/Email_address] for details. 
>  
> The suggested fix is to change the {{MAIL_REGEX}} declaration to
> {code:java}
> private static final Pattern MAIL_REGEX =
>       Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+"); {code}
> For a sequence of ~100k characters, the run time is ~1minute "on my machine". With this change, it reduces to a few milliseconds. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)