You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Bernhard Messer (JIRA)" <ji...@apache.org> on 2005/12/17 17:51:34 UTC
[jira] Created: (NUTCH-144) corrupt language identifier tri files and bad language recognition for german
corrupt language identifier tri files and bad language recognition for german
-----------------------------------------------------------------------------
Key: NUTCH-144
URL: http://issues.apache.org/jira/browse/NUTCH-144
Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Bernhard Messer
Priority: Minor
Hi,
i had a look at the generated language guesser tri files. As far as i can say, several of them (de.ngp, da.ngp, es.ngp) seems to be corrupt which leeds to bad language recognition ratio. For example the german tri file should contain the german special characters "ä", "ö", "ü" with their frequency. The text "grüne Hüte" which is typical german, is recognized as danish. May be the problem comes from wrong character encoding during training.
Jerome, could you provide the training files so that the language identifier can be retrained ?
regards
Bernhard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-144) corrupt language identifier tri files and bad language recognition for german
Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-144?page=comments#action_12360690 ]
Jerome Charron commented on NUTCH-144:
--------------------------------------
Bernhard,
the training files used was those from European Parliament Proceedings Parallel Corpus 1996-2003 Release v2.
http://people.csail.mit.edu/koehn/publications/europarl/
More details here: http://wiki.apache.org/nutch/LanguageIdentifierBenchs
I already have some doubt about encoding problems. See one quick note here : http://wiki.apache.org/nutch/LanguageIdentifier
Regards
Jérôme
> corrupt language identifier tri files and bad language recognition for german
> -----------------------------------------------------------------------------
>
> Key: NUTCH-144
> URL: http://issues.apache.org/jira/browse/NUTCH-144
> Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Bernhard Messer
> Priority: Minor
>
> Hi,
> i had a look at the generated language guesser tri files. As far as i can say, several of them (de.ngp, da.ngp, es.ngp) seems to be corrupt which leeds to bad language recognition ratio. For example the german tri file should contain the german special characters "ä", "ö", "ü" with their frequency. The text "grüne Hüte" which is typical german, is recognized as danish. May be the problem comes from wrong character encoding during training.
> Jerome, could you provide the training files so that the language identifier can be retrained ?
> regards
> Bernhard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-144) corrupt language identifier tri files and bad language recognition for german
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-144?page=comments#action_12360668 ]
Stefan Groschupf commented on NUTCH-144:
----------------------------------------
A good source for such documents is:
http://www.gutenberg.org/catalog/
> corrupt language identifier tri files and bad language recognition for german
> -----------------------------------------------------------------------------
>
> Key: NUTCH-144
> URL: http://issues.apache.org/jira/browse/NUTCH-144
> Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Bernhard Messer
> Priority: Minor
>
> Hi,
> i had a look at the generated language guesser tri files. As far as i can say, several of them (de.ngp, da.ngp, es.ngp) seems to be corrupt which leeds to bad language recognition ratio. For example the german tri file should contain the german special characters "ä", "ö", "ü" with their frequency. The text "grüne Hüte" which is typical german, is recognized as danish. May be the problem comes from wrong character encoding during training.
> Jerome, could you provide the training files so that the language identifier can be retrained ?
> regards
> Bernhard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira