You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Bruno P. Kinoshita (Jira)" <ji...@apache.org> on 2022/01/07 09:18:00 UTC

[jira] [Resolved] (OPENNLP-1350) MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for certain input, and is also a bit imprecise

     [ https://issues.apache.org/jira/browse/OPENNLP-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruno P. Kinoshita resolved OPENNLP-1350.
-----------------------------------------
    Resolution: Fixed

> MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for certain input, and is also a bit imprecise
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-1350
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1350
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Language Detector
>    Affects Versions: 1.9.3
>            Reporter: Jon Marius Venstad
>            Assignee: Bruno P. Kinoshita
>            Priority: Minor
>             Fix For: 1.9.4
>
>
> The regex used to strip email addresses from input, in UrlCharSequenceNormalizer, has quadratic complexity when used with {{{}String.replaceAll{}}}, and when input is a long sequence of characters from the first character set, i.e., {{{}[-_.0-9A-Za-z]{}}}, which fails to match the whole regex; then, the regex is evaluated again for each suffix of this sequence, with linear cost each time. 
> This problem is promptly solved by adding a negative lookbehind with a single character from that same set, to the first part of the regex. 
>  
> Additionally, the character {{_}} is allowed in the domain part of the mail address, where it is in fact illegal. Likewise, the character {{+}} is disallowed in the local part (the first first), where it _is{_} legal, and even quite common. The set of legal characters in the first part is actually quite bonkers, per the RFC, but such usage is probably less common. See [https://en.wikipedia.org/wiki/Email_address] for details. 
>  
> The suggested fix is to change the {{MAIL_REGEX}} declaration to
> {code:java}
> private static final Pattern MAIL_REGEX =
>       Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+"); {code}
> For a sequence of ~100k characters, the run time is ~1minute "on my machine". With this change, it reduces to a few milliseconds. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)