You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by wynz lo <wy...@gmail.com> on 2008/06/18 00:18:26 UTC

problems with link limits

Hi everyone,

I've spent hours searching around trying to solve this and it's starting to
drive me a little nuts. You all might be my last hope in staying out of a
padded room.

I have one small site I'm trying to crawl. The site is a handful of
different JSPs that are essentially templates for people's profiles. The
different profile pages are generated by passing a uri parameter. Nutch is
actually doing a fine job of crawling the smaller pages, but the main index
is causing trouble.

The main index has a single list of 772 links in alphabetical order like
this:

http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2111&name=Adams+Rebecca
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual4421&name=Decker+Alice
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual5602&name=Lincoln+Robert
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2452&name=Small+Harry
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2431&name=Whittaker+Bob
...and so on...

Nutch fetches about the first 90-110 (usually all the A's and B's) but
that's it. I got real excited when I found the db.max.outlinks.per.page
setting was at a default of 100. However, changing that to -1 or a high
value doesn't fix the problem. When I change it to a small value, like 15,
the fetcher grabs even fewer links, so it is definitely working.

Any suggestions? Thanks so much.

Wynz