You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Robert Scavilla <rs...@gmail.com> on 2018/05/09 17:08:46 UTC

Nutch 1.14 not crawling all links?

 Hello and thank you for your help. I'm confused why nutch 1.14 (I've had
the same issues with earlier versions) is not crawling full websites. I set
the number of rounds to a generous number and the crawl quits without
crawling the whole site with the message "No New Links Found". This happens
even when I use the sitemap.xml as a seed url.

Any help is greatly appreciated.

Best,
...bob

Re: Nutch 1.14 not crawling all links?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Bob,

it's impossible to make any diagnostics without the full log files
the complete configuration and a detailed description what is missing.

It could be a bug, of course. But it's more likely a configuration issue,
you should check the log files. Also have a look at:
- the robots.txt of the crawled sites
- your URL filters
- http.content.limit

These are often the reason for links not found or not fetched.


> even when I use the sitemap.xml as a seed url.

You need to use the SitemapProcessor
  bin/nutch sitemap

Best,
Sebastian

On 05/09/2018 07:08 PM, Robert Scavilla wrote:
>  Hello and thank you for your help. I'm confused why nutch 1.14 (I've had
> the same issues with earlier versions) is not crawling full websites. I set
> the number of rounds to a generous number and the crawl quits without
> crawling the whole site with the message "No New Links Found". This happens
> even when I use the sitemap.xml as a seed url.
> 
> Any help is greatly appreciated.
> 
> Best,
> ...bob
>