You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/06/01 00:38:00 UTC

[jira] [Commented] (OPENNLP-1265) Improve speed of lang detect

    [ https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853507#comment-16853507 ] 

Tim Allison commented on OPENNLP-1265:
--------------------------------------

Side issue...looks like the url normalizer uses unbounded regexes.  This was a problem with a file that had a long, long string of dna -- atcgcgat on TIKA-2777.

If you turn off all of the normalizers except the url normalizer and get rid of the spaces in the input string, the time goes to:

...it has been 20 minutes...I'll update this when/if it finishes this year.

If you bound the regexes to 100, the time is acceptable, but still discomforting:
{noformat}
  private static final Pattern URL_REGEX =
      Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,100}");
  private static final Pattern MAIL_REGEX =
      Pattern.compile("[-_.0-9A-Za-z]{1,100}@[-_0-9A-Za-z]{1,100}[-_.0-9A-Za-z]{1,100}");
{noformat}
25167 : lat=50
25537 : lat=50
25116 : lat=50


Bounding the regexes doesn't help on the regular string, of course, but guard rails are good:
5938 : por=50
6331 : por=50
5989 : por=50

Happy to open a separate ticket.  Let me know how I can help...


> Improve speed of lang detect
> ----------------------------
>
>                 Key: OPENNLP-1265
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1265
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)