You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by xiao yang <ya...@gmail.com> on 2010/01/20 09:16:01 UTC

How to change url score?

I'm crawling a group of web sites for some time. Now I want to add a new
site: http://xxx.com
Here is the process:

1. put xxx.com into a file: urls, and put it on Hadoop
2. run bin/nutch crawl urls -dir crawl -depth 5 -threads 1 -topN 1000

However, the newly added site is not crawled for its score is too low.

URL: http://xxx.com/
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Jan 17 14:59:08 CST 2010
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null

How can I change the score manually so this site will be included in the
next crawl round?

Thanks!
Xiao

Re: How to change url score?

Posted by Julien Nioche <li...@gmail.com>.
Hi,

The SVN version of Nutch has a new functionality for the Injector
which allows you to specify the score of a URL (see
https://issues.apache.org/jira/browse/NUTCH-655)

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2010/1/20 xiao yang <ya...@gmail.com>:
> I'm crawling a group of web sites for some time. Now I want to add a new
> site: http://xxx.com
> Here is the process:
>
> 1. put xxx.com into a file: urls, and put it on Hadoop
> 2. run bin/nutch crawl urls -dir crawl -depth 5 -threads 1 -topN 1000
>
> However, the newly added site is not crawled for its score is too low.
>
> URL: http://xxx.com/
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun Jan 17 14:59:08 CST 2010
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
>
> How can I change the score manually so this site will be included in the
> next crawl round?
>
> Thanks!
> Xiao
>