You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kr...@adv-boeblingen.de on 2014/08/12 15:07:46 UTC

How to recrawl changing the seed.txt list

Hello,

I am using apache-nutch-1.8.

I have an application which should crawl about 50 to 100 urls .

Problem is that customer wants to change the urls from time to time (also 
delete someones)

What is the correct way?

I wanted to do it this way:

1. Change the list of urls in seed.txt
2. Change the list in regex-urlfilter.txt (add +^http://www.rlp.de/ i.e. 
for every url)
3. Delete the crawl directory and subdirectories
4. Delete the solr index
5. Run a cronjob every night    bin/crawl urls/seed.txt crawl 
http://localhost:8983/solr 5

Is this o.k.?

Thanx for help Martin

Re: How to recrawl changing the seed.txt list

Posted by Julien Nioche <li...@gmail.com>.

Hi,

Yes, that should be fine. The only thing I would do differently would be :

2. Change the list in regex-urlfilter.txt (add +^http://www.rlp.de/ i.e.
> for every url)

allow any URLs instead of specifying all the hostnames one by one but set
the following property to true in nutch-site.xml :

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

 This should simplify things a bit.

When calling the crawl script maybe use a larger value than 5 : this is the
number of rounds and does not necessarily correspond to the actual depth of
the URLs. The script will stop when there aren't any more URLs to put in a
fetchlist so you might as well use a larger value.

HTH

Julien

On 12 August 2014 14:07, <kr...@adv-boeblingen.de> wrote:

> Hello,
>
> I am using apache-nutch-1.8.
>
> I have an application which should crawl about 50 to 100 urls .
>
> Problem is that customer wants to change the urls from time to time (also
> delete someones)
>
> What is the correct way?
>
> I wanted to do it this way:
>
> 1. Change the list of urls in seed.txt
> 2. Change the list in regex-urlfilter.txt (add +^http://www.rlp.de/ i.e.
> for every url)
> 3. Delete the crawl directory and subdirectories
> 4. Delete the solr index
> 5. Run a cronjob every night    bin/crawl urls/seed.txt crawl
> http://localhost:8983/solr 5
>
> Is this o.k.?
>
> Thanx for help Martin

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble