You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kr...@adv-boeblingen.de on 2014/08/12 15:07:46 UTC
How to recrawl changing the seed.txt list
Hello,
I am using apache-nutch-1.8.
I have an application which should crawl about 50 to 100 urls .
Problem is that customer wants to change the urls from time to time (also
delete someones)
What is the correct way?
I wanted to do it this way:
1. Change the list of urls in seed.txt
2. Change the list in regex-urlfilter.txt (add +^http://www.rlp.de/ i.e.
for every url)
3. Delete the crawl directory and subdirectories
4. Delete the solr index
5. Run a cronjob every night bin/crawl urls/seed.txt crawl
http://localhost:8983/solr 5
Is this o.k.?
Thanx for help Martin
Re: How to recrawl changing the seed.txt list
Posted by Julien Nioche <li...@gmail.com>.
Hi,
Yes, that should be fine. The only thing I would do differently would be :
2. Change the list in regex-urlfilter.txt (add +^http://www.rlp.de/ i.e.
> for every url)
allow any URLs instead of specifying all the hostnames one by one but set
the following property to true in nutch-site.xml :
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
This should simplify things a bit.
When calling the crawl script maybe use a larger value than 5 : this is the
number of rounds and does not necessarily correspond to the actual depth of
the URLs. The script will stop when there aren't any more URLs to put in a
fetchlist so you might as well use a larger value.
HTH
Julien
On 12 August 2014 14:07, <kr...@adv-boeblingen.de> wrote:
> Hello,
>
> I am using apache-nutch-1.8.
>
> I have an application which should crawl about 50 to 100 urls .
>
> Problem is that customer wants to change the urls from time to time (also
> delete someones)
>
> What is the correct way?
>
> I wanted to do it this way:
>
> 1. Change the list of urls in seed.txt
> 2. Change the list in regex-urlfilter.txt (add +^http://www.rlp.de/ i.e.
> for every url)
> 3. Delete the crawl directory and subdirectories
> 4. Delete the solr index
> 5. Run a cronjob every night bin/crawl urls/seed.txt crawl
> http://localhost:8983/solr 5
>
> Is this o.k.?
>
> Thanx for help Martin
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble