You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Danicela nutch <Da...@mail.com> on 2011/10/04 12:03:05 UTC

Giving priority to seeds

Hi,

 I want to make a ScoringFilter plugin which will give priority to seeds file.

 I mean, I have a crawdb and a seeds file with links, I set a topN=5 to test, and I want that my seeds links are fetched first, before what I have in the crawldb.

 For that, I tried to implement ScoringFilter methods, particularly injectedScore(Text text, CrawlDatum cd), I made a 'cd.setScore(100f)'. The score is correctly given but it's not used and in my 5 pages segment I don't have these links.

 Maybe I made something wrong ?

 Thanks in advance.

Re: Giving priority to seeds

Posted by Tim Pease <ti...@gmail.com>.

On Oct 4, 2011, at 4:03 AM, Danicela nutch wrote:

> Hi,
> 
> I want to make a ScoringFilter plugin which will give priority to seeds file.
> 
> I mean, I have a crawdb and a seeds file with links, I set a topN=5 to test, and I want that my seeds links are fetched first, before what I have in the crawldb.
> 
> For that, I tried to implement ScoringFilter methods, particularly injectedScore(Text text, CrawlDatum cd), I made a 'cd.setScore(100f)'. The score is correctly given but it's not used and in my 5 pages segment I don't have these links.
> 
> Maybe I made something wrong ?
> 
> Thanks in advance.

If your goal is to simply crawl the seed list first, you can use the FreeGenerator tool to create a fetch segment containing just the URLs from the seed list. Assuming you are using hadoop to run your crawler ...

1) hadoop jar nutch.job org.apache.nutch.tools.FreeGenerator /hdfs/path/to/seeds/dir /hdfs/path/to/segments/dir
2) hadoop jar nutch.job org.apache.nutch.fetcher.Fetcher /hdfs/path/to/segments/dir/20111004123015 -noParsing
3) hadoop jar nutch.job org.apache.nutch.parse.ParseSegment /hdfs/path/to/segments/dir/20111004123015
4) hadoop jar nutch.job org.apache.nutch.crawl.CrawlDb /hdfs/path/to/crawldb /hdfs/path/to/segments/dir/20111004123015

Apologies for being incredibly verbose there. That will fetch all your seed URLs, parse them, and update the crawl database.

For our crawl setup, we run the FreeGenerator each time we create a new collection of segment files to fetch and parse. This ensures that we always crawl the home pages of our various websites since that is where new content is posted each day. This ensures we are getting the latest content into nutch/solr as quickly as possible.

Great question. Hope this helps; and I especially hope it helps you avoid the work of writing your own ScoringFilter plugin!

Blessings,
TwP

Re: Giving priority to seeds

Posted by Julien Nioche <li...@gmail.com>.

you can specify the score of a seed using the metadata while injecting with
nutch.score=xxxx
see  https://issues.apache.org/jira/browse/NUTCH-655

Julien

On 4 October 2011 11:03, Danicela nutch <Da...@mail.com> wrote:

> Hi,
>
>  I want to make a ScoringFilter plugin which will give priority to seeds
> file.
>
>  I mean, I have a crawdb and a seeds file with links, I set a topN=5 to
> test, and I want that my seeds links are fetched first, before what I have
> in the crawldb.
>
>  For that, I tried to implement ScoringFilter methods, particularly
> injectedScore(Text text, CrawlDatum cd), I made a 'cd.setScore(100f)'. The
> score is correctly given but it's not used and in my 5 pages segment I don't
> have these links.
>
>  Maybe I made something wrong ?
>
>  Thanks in advance.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com