You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tom Running <ru...@gmail.com> on 2016/03/01 05:39:41 UTC

Nutch cannot crawl entire website

Hello,

I am using nutch 2.3.1

I preform the commands:
./nutch inject ../urls/seed.txt
./nutch generate -topN 2500
./nutch fetch -all

The problem is, the data only displays the raw HTML from the first
URL/page. All the other URLS that were accumulated by the generate command
are not actually crawled.

I cannot get nutch to crawl the other generated urls...I also cannot get
nutch to crawl the entire website. What are the options that I need to use
to crawl an entire site?

Does anyone have any insights or recommendations?

Thank you so much for your help,
-T

RE: Nutch cannot crawl entire website

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - i am not familiar with 2x but if those are your commands, then you are either missing the parse job or fetcher.parse=true, and not performing an updatedb job to write discovered records back to the DB.

Markus 

-----Original message-----
> From:Tom Running <ru...@gmail.com>
> Sent: Tuesday 1st March 2016 5:39
> To: user@nutch.apache.org
> Subject: Nutch cannot crawl entire website
> 
> Hello,
> 
> I am using nutch 2.3.1
> 
> I preform the commands:
> ./nutch inject ../urls/seed.txt
> ./nutch generate -topN 2500
> ./nutch fetch -all
> 
> The problem is, the data only displays the raw HTML from the first
> URL/page. All the other URLS that were accumulated by the generate command
> are not actually crawled.
> 
> I cannot get nutch to crawl the other generated urls...I also cannot get
> nutch to crawl the entire website. What are the options that I need to use
> to crawl an entire site?
> 
> Does anyone have any insights or recommendations?
> 
> Thank you so much for your help,
> -T
>

Re: Nutch cannot crawl entire website

Posted by Cihad Guzel <cg...@gmail.com>.

Hi Tom

Please check some nutch limit properties as like "file.content.limit" ,
"http.content.limit" or "fetcher.max.crawl.delay" etc.

If the Crawl-Delay in robots.txt is set to greater than the value
(fetcher.max.crawl.delay) then the fetcher will skip this page.

2016-03-01 6:39 GMT+02:00 Tom Running <ru...@gmail.com>:

> Hello,
>
> I am using nutch 2.3.1
>
> I preform the commands:
> ./nutch inject ../urls/seed.txt
> ./nutch generate -topN 2500
> ./nutch fetch -all
>
> The problem is, the data only displays the raw HTML from the first
> URL/page. All the other URLS that were accumulated by the generate command
> are not actually crawled.
>
> I cannot get nutch to crawl the other generated urls...I also cannot get
> nutch to crawl the entire website. What are the options that I need to use
> to crawl an entire site?
>
> Does anyone have any insights or recommendations?
>
> Thank you so much for your help,
> -T
>