You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 16:31:06 UTC

[jira] [Closed] (NUTCH-144) corrupt language identifier tri files and bad language recognition for german

     [ https://issues.apache.org/jira/browse/NUTCH-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-144.
-------------------------------

    Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> corrupt language identifier tri files and bad language recognition for german
> -----------------------------------------------------------------------------
>
>                 Key: NUTCH-144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-144
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Bernhard Messer
>            Priority: Minor
>
> Hi,
> i had a look at the generated language guesser tri files. As far as i can say, several of them (de.ngp, da.ngp, es.ngp) seems to be corrupt which leeds to bad language recognition ratio. For example the german tri file should contain the german special characters "ä", "ö", "ü" with their frequency. The text "grüne Hüte" which is typical german, is recognized as danish. May be the problem comes from wrong character encoding during training.
> Jerome, could you provide the training files so that the language identifier can be retrained ?
> regards
>  Bernhard

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira