You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jérôme Charron <je...@gmail.com> on 2005/06/05 01:16:16 UTC

Re: language identifier

After some long nights on benching and profiling the language
identifier plugin, I just attach a new patch to language identifier
plugin on Jira (http://issues.apache.org/jira/browse/NUTCH-60).
This patch provides some configuration options that enable to specify
the size of the data to use for language analysis and the NGrams sizes
to uses.
It also provides some optimizations that reduce the processing time
from 70% to 20%, depending on the configuration (size of data to
process), with an average gain of 25%.
I will provides more detailled results of my benchs on the Wiki as
soon as possible
(http://wiki.apache.org/nutch/LanguageIdentifierBenchs) and some
possible ways of improvements on
http://wiki.apache.org/nutch/NewLanguageIdentifier.

Jerome

-- 
http://motrech.free.fr/
http://frutch.free.fr/