You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Byron Miller <by...@yahoo.com> on 2005/10/28 14:38:13 UTC

Indexer Performance - up to 200+ rec/s with Lang identification enabled

051028 083415 DONE indexing segment 20051019000305:
total 100000 records in 520.156 s (192.3077 rec/s).
051028 083415 done indexing


Been doing some testing and i've pretty much peaked
out at 192-200 rec/s on a 2.8ghz machine with lang
ident enabled on 512bytes data @ 3ngrams which after
tweaking even exceeded before i tried lang ident.

Still not seeing any heavy IO, so i'm going to try and
see where my limits are - seems after a while of
increasing max this and that i don't see any
performance differences and even some degradation...
will try and plot this out :)

BTW, is this something that could be done on the fetch
process so the db contains the language and that could
be used to control your fetch list creation to begin
with? 

Re: Indexer Performance - up to 200+ rec/s with Lang identification enabled

Posted by Ken Krugler <kk...@transpac.com>.
>051028 083415 DONE indexing segment 20051019000305:
>total 100000 records in 520.156 s (192.3077 rec/s).
>051028 083415 done indexing
>
>Been doing some testing and i've pretty much peaked
>out at 192-200 rec/s on a 2.8ghz machine with lang
>ident enabled on 512bytes data @ 3ngrams which after
>tweaking even exceeded before i tried lang ident.

I wonder what's going on with our fetch performance - we're at about 
50 pages/second, on a 3GHz quad CPU Xeon box with SCSI RAID 5 disks 
and a 100Mbps pipe.

>Still not seeing any heavy IO, so i'm going to try and
>see where my limits are - seems after a while of
>increasing max this and that i don't see any
>performance differences and even some degradation...
>will try and plot this out :)
>
>BTW, is this something that could be done on the fetch
>process so the db contains the language and that could
>be used to control your fetch list creation to begin
>with?

If I understand your question correctly, you want to focus on 
fetching pages for particular languages, or rather defer fetching of 
pages that aren't in a target language, right?

Once you've parsed a page & identified (to some level of confidence) 
the language, you could use the language to adjust the nextScore 
value for outlinks to pages that don't currently exist. Then in 
FetchListTool use this nextScore value, and provide some topN value 
such that the top links are going to be in your target language.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200