You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tomislav Poljak <tp...@gmail.com> on 2007/12/11 02:08:52 UTC

fetching 1MM pages

Hi,
I have a few questions about fetching 1MM pages. I am trying to fetch
1MM pages on cluster of 2 machines (EC2 Systems), using 4 map and 4
reduce tasks each using 200 threads. Fetchlist is generated with
generate.max.per.host=5 and I get a fetchlist of about 5000 urls (so it
should be at least 1000 different hosts). When fetching speed goes to 40
pg/s (4 task x 10 pg/s) but after a short while it goes down to 8 pg/s
(4 task x 2 pg/s) and this is the final speed (when tasks finishes).
This is slow (8 pg/s) so how can I speed it up? Bandwidth is not a
problem.

Also in hadoop.log with each fetch there is a blok of 1000 lines with
exception:
2007-12-11 00:15:23,994 ERROR http.Http - at java.util.regex.Pattern
$Curly.match0(Pattern.java:3773)

it starts with:


2007-12-11 00:15:23,992 ERROR http.Http - java.lang.StackOverflowError
2007-12-11 00:15:23,994 ERROR http.Http - at java.util.regex.Pattern
$CharProperty.match(Pattern.java:3344)
2007-12-11 00:15:23,994 ERROR http.Http - at java.util.regex.Pattern
$Curly.match0(Pattern.java:3760)
2007-12-11 00:15:23,994 ERROR http.Http - at java.util.regex.Pattern
$Curly.match0(Pattern.java:3773)
2007-12-11 00:15:23,994 ERROR http.Http - at java.util.regex.Pattern
$Curly.match0(Pattern.java:3773)

So how can I fix this? Regexp in regex-urlfilter.txt is pretty simple.

Thanks,
      Tomislav