You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by MyD <my...@googlemail.com> on 2009/04/02 15:12:51 UTC

Nutch Topical / Focused Crawl

Hi @ all,

I'd like to turn Nutch into an focused / topical crawler. It's a part  
of my final year thesis. Further, I'd like that others can contribute  
from my work. I started to analyze the code and think that I found the  
right peace of code. I just wanted to know if I am on the right track.  
I think the right peace of code to implement a decision to fetch  
further is in the method output of the Fetcher class every time we  
call the collect method of the OutputCollector object.

private ParseStatus output(Text key, CrawlDatum datum, Content content,
ProtocolStatus pstatus, int status) {
...
output.collect(...);
...
}

Would you mind to let me know the the best way to turn this decision  
into an plugin? I was thinking to go a similar way like the scoring  
filters. Thanks in advance.

Cheers,
MyD

Re: Nutch Topical / Focused Crawl

Posted by Ken Krugler <kk...@transpac.com>.

>Hi @ all,
>
>I'd like to turn Nutch into an focused / topical crawler. It's a 
>part of my final year thesis. Further, I'd like that others can 
>contribute from my work. I started to analyze the code and think 
>that I found the right peace of code. I just wanted to know if I am 
>on the right track. I think the right peace of code to implement a 
>decision to fetch further is in the method output of the Fetcher 
>class every time we call the collect method of the OutputCollector 
>object.
>
>private ParseStatus output(Text key, CrawlDatum datum, Content content,
>ProtocolStatus pstatus, int status) {
>...
>output.collect(...);
>...
>}
>
>Would you mind to let me know the the best way to turn this decision 
>into an plugin? I was thinking to go a similar way like the scoring 
>filters. Thanks in advance.

Don't have the code in front of me right now, but we did something 
like this for a focused tech pages crawl with Krugle a few years 
back. Our goal was to influence the OPIC scores to ensure that pages 
we thought were likely to be "good" technical pages got fetched 
sooner.

Assuming you're using the scoring-opic plugin, then you'd create a 
custom ScoringFilter that gets executed after the scoring-opic plugin.

But the actual process of hooking every up was pretty complicated and 
error prone, unfortunately. We had to define our own keys for storing 
our custom scores inside of the parse_data Metadataa, the content 
Metadata, and the CrawlDB Metadata.

And we had to implement following methods for our scoring plugin:

setConf()
injectScore()
initialScore();
generateSortValue();
passScoreBeforeParsing();
passScoreAfterParsing();
shouldHarvestOutlinks();
distributeScoreToOutlink();
updateDbScore();
indexerScore();

-- Ken
-- 
Ken Krugler
+1 530-210-6378