You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/06/01 00:48:00 UTC

[jira] [Comment Edited] (OPENNLP-1265) Improve speed of lang detect

    [ https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853463#comment-16853463 ] 

Tim Allison edited comment on OPENNLP-1265 at 6/1/19 12:47 AM:
---------------------------------------------------------------

How much are the normalizers slowing things down?  We need normalization, but let's see if one of them is slowing things down more than others.

Baseline is with the simple string based ngrams above.

Let's try turning off each default normalizer, one by one:

Turn off emoji:
5442 : por=50
5605 : por=50
5528 : por=50

Turn off url (alone, turn back on emoji):
4317 : por=50
4219 : por=50
4257 : por=50

Turn off twitter
5746 : por=50
5737 : por=50
5803 : por=50

Turn off number
6204 : por=50
6208 : por=50
5974 : por=50

Turn off shrink char
5371 : por=50
5619 : por=50
5352 : por=50

Now, for kicks, let's turn off all the normalizers:
2494 : por=50
2573 : por=50
2485 : por=50

The URL normalizer seems to be the one w the largest effect.


was (Author: tallison@mitre.org):
How much are the normalizers slowing things down?  We need normalization, but let's see if one of them is slowing things down more than others.

Baseline is with the simple string based ngrams above.

Let's try turning of each default normalizer, one by one:

Turn off emoji:
5442 : por=50
5605 : por=50
5528 : por=50

Turn off url (alone, turn back on emoji):
4317 : por=50
4219 : por=50
4257 : por=50

Turn off twitter
5746 : por=50
5737 : por=50
5803 : por=50

Turn off number
6204 : por=50
6208 : por=50
5974 : por=50

Turn off shrink char
5371 : por=50
5619 : por=50
5352 : por=50

Now, for kicks, let's turn off all the normalizers:
2494 : por=50
2573 : por=50
2485 : por=50

The URL normalizer seems to be the one w the largest effect.

> Improve speed of lang detect
> ----------------------------
>
>                 Key: OPENNLP-1265
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1265
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)