You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Byron Miller <by...@yahoo.com> on 2005/10/27 17:27:51 UTC

Index performance with language identifier enabled

Still running the SVN release and trying to get a good
system built with language identifier enabled BUT
running into bad performance..

I get 50 records a sec with lang ident enabled and
nearly 200 otherwise.. I know there is some overhead
for this, but cutting the speed this much is drastic
in the scale of what i'm trying to work with.

Is there any tips/pointers to beefing this up? Anyone
else have any index benchmarks with/without this
enabled and how they may have tuned this?

-byron
http://www.mozdex.com

Re: Index performance with language identifier enabled

Posted by Byron Miller <by...@yahoo.com>.

Thanks for the headsup on this information!  I'll be
sure to let you know how my luck goes in tweaking out
these parameters.

-byron

--- Jérôme Charron <je...@gmail.com> wrote:

> > Is there any tips/pointers to beefing this up?
> Anyone
> > else have any index benchmarks with/without this
> > enabled and how they may have tuned this?
> 
> Hi Byron,
> 
> As described on the nutch wiki page
>
http://wiki.apache.org/nutch/LanguageIdentifierPlugin,
> there is 3 config
> parameters you can use to tune the language
> identifier (these parameters are
> described in the nutch-default.xml file):
> lang.ngram.min.length
> lang.ngram.max.length
> lang.analyze.max.length
> 
> You can find some benchmarks on the following page:
>
http://wiki.apache.org/nutch/LanguageIdentifierBenchs
> By default, the language identifier uses the same
> config as the one that was
> previously hard coded.
> But you can increase performances by using only
> 3-grams for detection
> instead of (1-gram, 2-gram, 3-gram and 4-gram by
> default):
> lang.ngram.min.length=3
> lang.ngram.max.length=3
> You can also reduce the amount of data analyzed.
> Default is setted to 2048
> bytes, but experimental results shows that 1024 down
> to 512 bytes could be
> sufficient.
> 
> I will greatly appreciate if you can give us
> feedback and
> performance/precision measures of the language
> identifier.
> 
> Regards
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>

Re: Index performance with language identifier enabled

Posted by Byron Miller <by...@yahoo.com>.

Before with nutch .7 svn defaults

051027 135317 DONE indexing segment 20051019145225-2:
total 100155 records in 2108.737 s (47.51186 rec/s).
051027 135317 done indexing

after

051027 142316 DONE indexing segment 20051019145225-3:
total 103838 records in 1413.624 s (73.48762 rec/s).
051027 142316 done indexing

Using 3 ngrams at 512 data.

Great improvement, will continue to tweak. I have a
few dozen more indexes to run through/merge before i
can actually test the results at this dataset to see
if its large enough to detect lang precisely.

I was pushing 3600 open files so i'm gonna bump up
some of the other parameters to tweak my config even
more.


This is on a P4 HT with 2 gigs ram and dual sata
drives.

--- Jérôme Charron <je...@gmail.com> wrote:

> > Is there any tips/pointers to beefing this up?
> Anyone
> > else have any index benchmarks with/without this
> > enabled and how they may have tuned this?
> 
> Hi Byron,
> 
> As described on the nutch wiki page
>
http://wiki.apache.org/nutch/LanguageIdentifierPlugin,
> there is 3 config
> parameters you can use to tune the language
> identifier (these parameters are
> described in the nutch-default.xml file):
> lang.ngram.min.length
> lang.ngram.max.length
> lang.analyze.max.length
> 
> You can find some benchmarks on the following page:
>
http://wiki.apache.org/nutch/LanguageIdentifierBenchs
> By default, the language identifier uses the same
> config as the one that was
> previously hard coded.
> But you can increase performances by using only
> 3-grams for detection
> instead of (1-gram, 2-gram, 3-gram and 4-gram by
> default):
> lang.ngram.min.length=3
> lang.ngram.max.length=3
> You can also reduce the amount of data analyzed.
> Default is setted to 2048
> bytes, but experimental results shows that 1024 down
> to 512 bytes could be
> sufficient.
> 
> I will greatly appreciate if you can give us
> feedback and
> performance/precision measures of the language
> identifier.
> 
> Regards
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>

Re: Index performance with language identifier enabled

Posted by Jérôme Charron <je...@gmail.com>.

> Is there any tips/pointers to beefing this up? Anyone
> else have any index benchmarks with/without this
> enabled and how they may have tuned this?

Hi Byron,

As described on the nutch wiki page
http://wiki.apache.org/nutch/LanguageIdentifierPlugin, there is 3 config
parameters you can use to tune the language identifier (these parameters are
described in the nutch-default.xml file):
lang.ngram.min.length
lang.ngram.max.length
lang.analyze.max.length

You can find some benchmarks on the following page:
http://wiki.apache.org/nutch/LanguageIdentifierBenchs
By default, the language identifier uses the same config as the one that was
previously hard coded.
But you can increase performances by using only 3-grams for detection
instead of (1-gram, 2-gram, 3-gram and 4-gram by default):
lang.ngram.min.length=3
lang.ngram.max.length=3
You can also reduce the amount of data analyzed. Default is setted to 2048
bytes, but experimental results shows that 1024 down to 512 bytes could be
sufficient.

I will greatly appreciate if you can give us feedback and
performance/precision measures of the language identifier.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/