You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/06/04 19:16:00 UTC

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

    [ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856039#comment-16856039 ] 

Tim Allison commented on TIKA-2790:
-----------------------------------

I was able to get 4x improvement in speed, which is still slower than Optimaize and, far, far slower than Yalder.  IIUC, both Optimaize and Yalder do not process the full string.  Rather, they sample or have some kind of stopping criterion.  I think we can work towards that in our own wrapper of OpenNLP, and, hopefully, we can push that upstream back into OpenNLP.

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)