You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Håvard W. Kongsgård" <nu...@niap.org> on 2006/10/24 19:06:53 UTC

Nutch slow how to speed up?

I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), 
searching with queries like 'China Nuclear Forces' takes 20 – 25 s.

My config:
http.content.limit = 6165536
dfs.replication = 1
mapred.submit.replication = 2
mapred.child.java.opts = -Xmx800m

My data:
TOTAL urls: 3748140
retry 0: 3614731
retry 1: 85999
retry 2: 20772
retry 3: 26638
min score: 0.0
avg score: 0.64956105
max score: 3922.723
status 1 (DB_unfetched): 1316016
status 2 (DB_fetched): 2168397
status 3 (DB_gone): 263727

Status: HEALTHY
Total size: 254534723272 B
Total blocks: 5140 (avg. block size 49520374 B)
Total dirs: 260
Total files: 1466
Over-replicated blocks: 8 (0.15564202 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0015564

The filesystem under path '/' is HEALTHY

Re: Nutch slow how to speed up?

Posted by Sami Siren <ss...@gmail.com>.

If your data to be searched lies in dfs it is slow. You need to first 
copy it out to local file system. Split your data into smaller slices 
which you then distribute evenly on your search nodes.

This part of process is not that well covered and I am looking for much 
improvement in this area from this proposal:

http://mail-archives.apache.org/mod_mbox/lucene-general/200610.mbox/%3c453699EA.3050501@apache.org%3e

--
  Sami Siren



Håvard W. Kongsgård wrote:
> DistributedSearch
> 2x datanodes, 2x Task Trackers
> 
> Sami Siren wrote:
>> You are using DistributedSearch? and local filesystem to store index 
>> and related data?
>>
>> -- 
>>  Sami Siren
>>
>>
>> Håvard W. Kongsgård wrote:
>>> I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 
>>> memory), searching with queries like 'China Nuclear Forces' takes 20 
>>> – 25 s.
>>>
>>> My config:
>>> http.content.limit = 6165536
>>> dfs.replication = 1
>>> mapred.submit.replication = 2
>>> mapred.child.java.opts = -Xmx800m
>>>
>>> My data:
>>> TOTAL urls: 3748140
>>> retry 0: 3614731
>>> retry 1: 85999
>>> retry 2: 20772
>>> retry 3: 26638
>>> min score: 0.0
>>> avg score: 0.64956105
>>> max score: 3922.723
>>> status 1 (DB_unfetched): 1316016
>>> status 2 (DB_fetched): 2168397
>>> status 3 (DB_gone): 263727
>>>
>>> Status: HEALTHY
>>> Total size: 254534723272 B
>>> Total blocks: 5140 (avg. block size 49520374 B)
>>> Total dirs: 260
>>> Total files: 1466
>>> Over-replicated blocks: 8 (0.15564202 %)
>>> Under-replicated blocks: 0 (0.0 %)
>>> Target replication factor: 1
>>> Real replication factor: 1.0015564
>>>
>>> The filesystem under path '/' is HEALTHY
>>>
>>
>>
> 
>

Re: Nutch slow how to speed up?

Posted by "Håvard W. Kongsgård" <nu...@niap.org>.

DistributedSearch
2x datanodes, 2x Task Trackers

Sami Siren wrote:
> You are using DistributedSearch? and local filesystem to store index 
> and related data?
>
> -- 
>  Sami Siren
>
>
> Håvard W. Kongsgård wrote:
>> I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 
>> memory), searching with queries like 'China Nuclear Forces' takes 20 
>> – 25 s.
>>
>> My config:
>> http.content.limit = 6165536
>> dfs.replication = 1
>> mapred.submit.replication = 2
>> mapred.child.java.opts = -Xmx800m
>>
>> My data:
>> TOTAL urls: 3748140
>> retry 0: 3614731
>> retry 1: 85999
>> retry 2: 20772
>> retry 3: 26638
>> min score: 0.0
>> avg score: 0.64956105
>> max score: 3922.723
>> status 1 (DB_unfetched): 1316016
>> status 2 (DB_fetched): 2168397
>> status 3 (DB_gone): 263727
>>
>> Status: HEALTHY
>> Total size: 254534723272 B
>> Total blocks: 5140 (avg. block size 49520374 B)
>> Total dirs: 260
>> Total files: 1466
>> Over-replicated blocks: 8 (0.15564202 %)
>> Under-replicated blocks: 0 (0.0 %)
>> Target replication factor: 1
>> Real replication factor: 1.0015564
>>
>> The filesystem under path '/' is HEALTHY
>>
>
>

Re: Nutch slow how to speed up?

Posted by Sami Siren <ss...@gmail.com>.

You are using DistributedSearch? and local filesystem to store index and 
related data?

--
  Sami Siren


Håvard W. Kongsgård wrote:
> I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), 
> searching with queries like 'China Nuclear Forces' takes 20 – 25 s.
> 
> My config:
> http.content.limit = 6165536
> dfs.replication = 1
> mapred.submit.replication = 2
> mapred.child.java.opts = -Xmx800m
> 
> My data:
> TOTAL urls: 3748140
> retry 0: 3614731
> retry 1: 85999
> retry 2: 20772
> retry 3: 26638
> min score: 0.0
> avg score: 0.64956105
> max score: 3922.723
> status 1 (DB_unfetched): 1316016
> status 2 (DB_fetched): 2168397
> status 3 (DB_gone): 263727
> 
> Status: HEALTHY
> Total size: 254534723272 B
> Total blocks: 5140 (avg. block size 49520374 B)
> Total dirs: 260
> Total files: 1466
> Over-replicated blocks: 8 (0.15564202 %)
> Under-replicated blocks: 0 (0.0 %)
> Target replication factor: 1
> Real replication factor: 1.0015564
> 
> The filesystem under path '/' is HEALTHY
>