You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/05/31 23:05:00 UTC

[jira] [Comment Edited] (OPENNLP-1265) Improve speed of lang detect

    [ https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853462#comment-16853462 ] 

Tim Allison edited comment on OPENNLP-1265 at 5/31/19 11:04 PM:
----------------------------------------------------------------

Baseline:
Input string: 10000x "estava em uma marcenaria na Rua Bruno "
model: langdetect-183.bin
runs: 4 (don't show results for first warmup run)

Results (millis, lang)
13366 : por=50
13608 : por=50
14035 : por=50

If we switch to working with string based ngrams instead of StringList, there's a 2x improvement:
6087 : por=50
6202 : por=50
6146 : por=50

see: https://github.com/tballison/opennlp/blob/OPENNLP-1265/opennlp-tools/src/main/java/opennlp/tools/ngram/NGramModelSimplified.java


was (Author: tallison@mitre.org):
Baseline:
Input string: 10000x "estava em uma marcenaria na Rua Bruno "
model: langdetect-183.bin
runs: 4 (don't show results for first warmup run)

Results (millis)
13366 : {por=50}
13608 : {por=50}
14035 : {por=50}

If we switch to working with string based ngrams instead of StringList, there's a 2x improvement:
6087 : {por=50}
6202 : {por=50}
6146 : {por=50}

see: https://github.com/tballison/opennlp/blob/OPENNLP-1265/opennlp-tools/src/main/java/opennlp/tools/ngram/NGramModelSimplified.java

> Improve speed of lang detect
> ----------------------------
>
>                 Key: OPENNLP-1265
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1265
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)