You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by GitBox <gi...@apache.org> on 2020/06/02 14:49:02 UTC

[GitHub] [opennlp] jzonthemtn commented on a change in pull request #355: OPENNLP-1266 -- Limit regexes in UrlCharSequenceNormalizer

jzonthemtn commented on a change in pull request #355:
URL: https://github.com/apache/opennlp/pull/355#discussion_r433935594



##########
File path: opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java
##########
@@ -44,4 +44,15 @@ public void normalizeEmail() throws Exception {
         "asdf   2nnfdf  ", normalizer.normalize("asdf asd.fdfa@hasdk23.com.br" +
             " 2nnfdf asd.fdfa@hasdk23.com.br"));
   }
+

Review comment:
       I think this is a good change but I do worry about limiting the length of the URL in the regex. What if we added an argument to the `UrlCharSequenceNormalizer` constructor to make this an option to the user? That way the user can choose between the trade off of speed vs. potentially missing URLs and there won't be any risk of changing the expected behavior of OpenNLP language detector applications out in the wild. Thoughts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org