You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by GitBox <gi...@apache.org> on 2021/12/16 21:04:01 UTC

[GitHub] [opennlp] jonmv opened a new pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

jonmv opened a new pull request #399:
URL: https://github.com/apache/opennlp/pull/399


   Addresses OPENNLP-1350
   
   The `MAIL_REGEX` in `UrlCharSSequenceNormalizer` causes `replaceAll(...)` to become extremely costly when given an input string with a long sequence of characters from the first character set in the regex, but which ultimately fails to match the whole regex. This pull request fixes that, and also another detail:
   
   Allow `+` in the local part, and disallow `_` in the domain part. There are other characters that are allowed in the local part as well, but these are less common (https://en.wikipedia.org/wiki/Email_address).
   
   The speedup for unfortunate input is achieved by adding a negative lookbehind with a single characters from the first character set. 
   Currently, the replaceAll(" ") on a string of ~100K characters from the set `[-_.0-9A-Za-z]` runs in ~1minute on modern hardware; adding a negative lookbehind with one of the characters from that set reduces this to a few milliseconds, and is functionally equivalent. (Consider the current pattern and a match from position `i` to `k`. If the character at `i-1` is in the character set, there would also be a match from `i-1` to `k`, which would already be replaced.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jonmv commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jonmv commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007223297






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow closed pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow closed pull request #399:
URL: https://github.com/apache/opennlp/pull/399


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007225768






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jzonthemtn commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jzonthemtn commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007410839


   Thanks @jonmv and @kinow!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jzonthemtn commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jzonthemtn commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1002630262


   Hi @jonmv, thanks for the contribution! Are there any test changes needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow merged pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow merged pull request #399:
URL: https://github.com/apache/opennlp/pull/399


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jonmv commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jonmv commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007223297


   Sorry about that, a little edit error in the expected outcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jzonthemtn edited a comment on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jzonthemtn edited a comment on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1002630262


   Hi @jonmv, thanks for the contribution! Are there any test changes needed or is it only a performance improvement?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jonmv commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jonmv commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1006386562


   > I think we should add a unit test for these two improvements, just to make sure that this is not reverted accidentally later. WDYT?
   
   Yes, you're right. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007227208


   Squashed commits into a single commit, then re-worded single commit to have the prefix OPENNLP-1350. Finally, rebased, then push-force'd :+1: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007234138


   Closed/opened but Travis CI bot didn't come here to comment. The build is in progress though: https://app.travis-ci.com/github/apache/opennlp/builds/244439208
   
   JIRA is up to date, will come back later to see if Travis is done, then merge it if no issues found.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jzonthemtn commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jzonthemtn commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007410839


   Thanks @jonmv and @kinow!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jonmv commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jonmv commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1006353294


   Hi, the unit tests do not need to be updated. They were valid before, and still are. This is primarily a performance optimisation. Additionally, the PR allows `+` in the local-part of emails, and disallows `_` in the domain part. This is just an improvement. The actual email local-part rules are terrifying, and probably not worth expressing in a regex—the current is a compromise which should capture common email addresses out there. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1006378664


   > Hi, the unit tests do not need to be updated. They were valid before, and still are.
   
   :+1: 
   
   >This is primarily a performance optimisation. Additionally, the PR allows `+` in the local-part of emails, and disallows `_` in the domain part. This is just an improvement.
   
   I think this is a good improvement, I use `+` a lot for $work and also for myself. I think we should add a unit test for these two improvements, just to make sure that this is not reverted accidentally later. WDYT?
   
   > The actual email local-part rules are terrifying, and probably not worth expressing in a regex—the current is a compromise which should capture common email addresses out there.
   
   :+1: and I think the lookbehind and performance should be fine, no need for any benchmark IMO. I'll go through your comment above (thanks!) just to educate myself :neckbeard: 
   
   Thank you!
   -Bruno


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow commented on a change in pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow commented on a change in pull request #399:
URL: https://github.com/apache/opennlp/pull/399#discussion_r776977532



##########
File path: opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -26,7 +26,7 @@
   private static final Pattern URL_REGEX =
       Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
   private static final Pattern MAIL_REGEX =
-      Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+      Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

Review comment:
       I'd need more time to confirm what the lookbehind is doing. But from a first look, I like that the new regular expression understands emails like "someone+test@gmail.com".
   
   The previous regex would capture "test@gmail.com", missing the "someone+" part I think. Which can lead to really strange results, depending how that value is used in models/systems.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow closed pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow closed pull request #399:
URL: https://github.com/apache/opennlp/pull/399


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jonmv commented on a change in pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jonmv commented on a change in pull request #399:
URL: https://github.com/apache/opennlp/pull/399#discussion_r779358503



##########
File path: opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -26,7 +26,7 @@
   private static final Pattern URL_REGEX =
       Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
   private static final Pattern MAIL_REGEX =
-      Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+      Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

Review comment:
       Yes, the addition of `+` is just an improvement. Also, `_` is not valid in the domain part—this is just an error in the current regex. 
   
   The _negative_ lookbehind ensures a regex is _not_ a match if the lookbehind matches right before the pattern.
   Consider the string `foo aaa@b`. With the current regex, all of `aaa@b`, `aa@b`, and `a@b` would match. Because it's used with a `replaceAll`, however, only the first (longest) match will actually be used, i.e., `aaa@b`, and we'll retain `foo  `.
   With the negative lookbehind, `aaa@b` will still match, as the previous character (` `) does _not_ match the lookbehind (so the match is _not_ ignored (double negation, sorry)). However, `aa@b` is no longer a match, as the previous character in this case is `a`, which _does_ match the lookbehind, and the match is ignored. Likewise, `a@b` no longer matches. 
   For strings that _are_ email addresses, we get the same behaviour: the longest match is replaced with ` `. 
   
   Now consider the string `foo aaa`. Neither of the regexes match anything here. The current regex, however, evaluates the whole of `foo` as a possible local-part match until it reaches ` `, which is not allowed. It then tries with `oo`, and `o`. Likewise, it tries `aaa` until it reaches the end of input, then `aa`, and `a`. 
   The new regex, on the other hand, tries `foo` as a local-part because the start of input does _not_ match the lookbehind, terminates on ` ` (as the current regex), but then it immediately fails on `oo` because `f` _does_ match the lookbehind, so `oo` and `o` immediately fail. Likewise, only `aaa` is examined until the end; `aa` and `a` fail early. 
   This does not make much of a difference with these short strings, but you can see the complexity is quadratic (current) vs linear (proposed) with the length of strings like `aaa`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] jonmv commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
jonmv commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007233550


   Great, thanks :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow commented on pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow commented on pull request #399:
URL: https://github.com/apache/opennlp/pull/399#issuecomment-1007225768


   Got it, let me review the JIRA issue fields to see if it needs updating, and squash the commits. Then if CI passes I think we are ready to merge it. Thanks @jonmv !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow commented on a change in pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow commented on a change in pull request #399:
URL: https://github.com/apache/opennlp/pull/399#discussion_r779922393



##########
File path: opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -26,7 +26,7 @@
   private static final Pattern URL_REGEX =
       Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
   private static final Pattern MAIL_REGEX =
-      Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+      Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

Review comment:
       Perfect explanation, all made sense to me @jonmv , thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [opennlp] kinow merged pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Posted by GitBox <gi...@apache.org>.
kinow merged pull request #399:
URL: https://github.com/apache/opennlp/pull/399


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org