You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <ca...@gmail.com> on 2005/09/23 19:40:40 UTC
SocketTimeoutException
In my crawling of large number of selected sites, the number of threads
is automatically determined by the number of pages on the fetchlist in
each fetch/updatedb cycle. Max number of threads is set to 1000. When
using 1000 threads, there are lots of SocketTimeoutException in fetching
toward the end of the fetch cycle. Any suggestion for reducing
SocketTimeoutException? I also notice that the SocketTimeoutException
errors are not counted in the error count for segment status. Why is that?
Relevant parameters set:
http.timeout=10000
http.max.delays=100000
fetcher.threads.fetch= from 10 to 1000 depending on the size of fetchlist
Appreciate your help,
AJ
HD question for large DB
Posted by EM <em...@cpuedge.com>.
What would be a good Hard Drive solution for a large DB. Once I get into
the range of 0.5 - 2 million pages doing anything with the DB becomes
slow. I have SATA disk with a sustained transfer rate of about 80MB/s.
What would it require if the DB is to contain 500 million pages? RAID?
RAID of 4 drives, 16 drives? Cheap drives, Large drives?
Can anyone who's been through this give me some pointers?
Please note that I don't need to use the same system for
indexing/searching, only for DB operations.
Regards,
EM
RE: SocketTimeoutException
Posted by Fuad Efendi <fu...@efendi.ca>.
AJ,
Number of threads 1000 - very good, but... at least existing J2SE from
SUN, it will perform ugly!
Preferabe: 32 Processes, 32 Threads each... At least with "The Grinder"
and 2 Gb of memory (I don't know about Nutch!)... You should really
calculate everything, CPU, memory for each thread, ...
1000 Threads - too much...
P.S.
Increase "timeout", or decrease "threads".
-----Original Message-----
From: AJ Chen [mailto:canovaj@gmail.com]
Sent: Friday, September 23, 2005 1:41 PM
To: nutch-user
Subject: SocketTimeoutException
In my crawling of large number of selected sites, the number of threads
is automatically determined by the number of pages on the fetchlist in
each fetch/updatedb cycle. Max number of threads is set to 1000. When
using 1000 threads, there are lots of SocketTimeoutException in fetching
toward the end of the fetch cycle. Any suggestion for reducing
SocketTimeoutException? I also notice that the SocketTimeoutException
errors are not counted in the error count for segment status. Why is
that?
Relevant parameters set:
http.timeout=10000
http.max.delays=100000
fetcher.threads.fetch= from 10 to 1000 depending on the size of
fetchlist
Appreciate your help,
AJ