You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Xiao Li <sh...@gmail.com> on 2012/02/06 06:07:08 UTC

Just fetch a specified URL list

I have compiled a URL list (1 million URLs). I just want to use Nutch to
only crawl these URLs. How can I do it? I have tried to specified the
parameter "-depth 1 -topN 1000000". But Nutch still crawls some non-on-list
URLs.

Re: Just fetch a specified URL list

Posted by ka...@plutoz.com.
put this in your nutch-site.xml

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

-----Original Message-----
From: "Markus Jelsma" <ma...@openindex.io>
Sent: Monday, February 6, 2012 7:46am
To: user@nutch.apache.org
Subject: Re: Just fetch a specified URL list

If you have them in some text file(s) then use the freegenerator to generate a 
segment.

On Monday 06 February 2012 06:07:08 Xiao Li wrote:
> I have compiled a URL list (1 million URLs). I just want to use Nutch to
> only crawl these URLs. How can I do it? I have tried to specified the
> parameter "-depth 1 -topN 1000000". But Nutch still crawls some non-on-list
> URLs.

-- 
Markus Jelsma - CTO - Openindex



Re: Just fetch a specified URL list

Posted by Markus Jelsma <ma...@openindex.io>.
If you have them in some text file(s) then use the freegenerator to generate a 
segment.

On Monday 06 February 2012 06:07:08 Xiao Li wrote:
> I have compiled a URL list (1 million URLs). I just want to use Nutch to
> only crawl these URLs. How can I do it? I have tried to specified the
> parameter "-depth 1 -topN 1000000". But Nutch still crawls some non-on-list
> URLs.

-- 
Markus Jelsma - CTO - Openindex