You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/07/04 17:33:42 UTC
[Nutch Wiki] Update of "NewLanguageIdentifier" by JeromeCharron
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JeromeCharron:
http://wiki.apache.org/nutch/NewLanguageIdentifier
New page:
== Architecture ==
TODO
== NGram profile format ==
TODO
== Generating some NGrams profiles ==
TODO
== Open Issues ==
* ''Labs'' tests are quite good (LanguageIdentifierBenchs), but in ''real life'', they are not. In fact, in its actual version, the NewLanguageIdentifier expects that the provided text to analyze is UTF-8 encoded. However, it is not the case for a lot of fetched documents. So, the NewLanguageIdentifier needs to refer to a {{{content-encoding}}} meta-data. This data must be provided by a (todo) EncodingDetectorPlugin (see [http://issues.apache.org/jira/browse/NUTCH-25 NUTCH-25] issue).