You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by caezar <ca...@gmail.com> on 2009/06/25 16:27:18 UTC

How to tell Nutch to crawl ONLY the URLs I've injected

Hi All,

Here is the problem: I need Nutch to crawl ONLY the URLs I've injected.
Currently, by setting db.ignore.external.links to true I've made Nutch not
to automatically crawl URLs found as external links from on crawled pages.
But it is still crawling URLs found as internal links (seems that
db.ignore.internal.links does not affects this). I don't want to create URL
filters, because there are millions of URLs, and not possible to write a
regexps for them. So is there a way to achieve this?
-- 
View this message in context: http://www.nabble.com/How-to-tell-Nutch-to-crawl-ONLY-the-URLs-I%27ve-injected-tp24204304p24204304.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to tell Nutch to crawl ONLY the URLs I've injected

Posted by "Xiangjun(XJ) Wang" <xw...@experthub.com>.

you can set db.update.additions.allowed to false, and then no new 
discovered url will be added.

XJ

caezar wrote:
> Hi All,
>
> Here is the problem: I need Nutch to crawl ONLY the URLs I've injected.
> Currently, by setting db.ignore.external.links to true I've made Nutch not
> to automatically crawl URLs found as external links from on crawled pages.
> But it is still crawling URLs found as internal links (seems that
> db.ignore.internal.links does not affects this). I don't want to create URL
> filters, because there are millions of URLs, and not possible to write a
> regexps for them. So is there a way to achieve this?
>

Re: How to tell Nutch to crawl ONLY the URLs I've injected

Posted by caezar <ca...@gmail.com>.

Thats not the case: I'll need to keep information actual (so I'll need to
perform crawls after indexing)

kevin chen-6 wrote:
> 
> If all you want is crawl your own URLS, you can do following:
> (1) inject all URLS
> (2) keep generating segments to fetch without update crawldb.
> (3) After you are done with fetching, update, index and your are done.
> 

-- 
View this message in context: http://www.nabble.com/How-to-tell-Nutch-to-crawl-ONLY-the-URLs-I%27ve-injected-tp24204304p24216405.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to tell Nutch to crawl ONLY the URLs I've injected

Posted by kevin chen <ke...@bdsing.com>.

If all you want is crawl your own URLS, you can do following:
(1) inject all URLS
(2) keep generating segments to fetch without update crawldb.
(3) After you are done with fetching, update, index and your are done.

On Thu, 2009-06-25 at 07:27 -0700, caezar wrote:
> Hi All,
> 
> Here is the problem: I need Nutch to crawl ONLY the URLs I've injected.
> Currently, by setting db.ignore.external.links to true I've made Nutch not
> to automatically crawl URLs found as external links from on crawled pages.
> But it is still crawling URLs found as internal links (seems that
> db.ignore.internal.links does not affects this). I don't want to create URL
> filters, because there are millions of URLs, and not possible to write a
> regexps for them. So is there a way to achieve this?