You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2009/07/15 20:50:15 UTC

[jira] Commented: (TIKA-209) Language detection is weak.

    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731621#action_12731621 ] 

Ted Dunning commented on TIKA-209:
----------------------------------


I haven't looked at the nutch code in forever, but my memory is that it didn't use the best statistics for the task.  Here is an approach that seems to be more accurate:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1958

Sadly, I don't have a Java implementation of this handy.  I can give out an ancient C implementation.





> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.