You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2011/06/27 02:15:47 UTC

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

    [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055242#comment-13055242 ] 

Jan Høydahl commented on TIKA-369:
----------------------------------

Any new thoughts on this one? Seems like LUCENE-826 might be better and more complete than the current LangId in Tika.
Also, there is an idea of using dictionary based matching for small texts. Perhaps based on lucene-hunspell and Ooo dictionaries? What do you think of such a hybrid solution?

> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: Surprise and Coincidence.pdf, lingdet-mccs.pdf, textcat.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira