You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by MyD <My...@googlemail.com> on 2009/03/29 11:39:39 UTC

Nutch Topical / Focused Crawl

Hi @ all,

I'd like to turn Nutch into an focused / topical crawler. It's a part of my
final year thesis. Further, I'd like that others can contribute from my
work. I started to analyze the code and think that I found the right peace
of code. I just wanted to know if I am on the right track. I think the right
peace of code to implement a decision to fetch further is in the method
output of the Fetcher class every time we call the collect method of the
OutputCollector object.

private ParseStatus output(Text key, CrawlDatum datum, Content content,
ProtocolStatus pstatus, int status) {
...
output.collect(...);
...
}

Would you mind to let me know the the best way to turn this decision into an
plugin? I was thinking to go a similar way like the scoring filters. Thanks
in advance.

Cheers,
MyD
-- 
View this message in context: http://www.nabble.com/Nutch-Topical---Focused-Crawl-tp22765848p22765848.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Nutch Topical / Focused Crawl

Posted by MyD <My...@googlemail.com>.
I just found an interesting thesis which explains how to turn / modify Nutch
into a focused / topical crawler. This thesis helped me a lot. Maybe useful
to others...

http://wing.comp.nus.edu.sg/publications/theses/2009/markusHaenseThesis.pdf



MyD wrote:
> 
> Hi @ all,
> 
> I'd like to turn Nutch into an focused / topical crawler. I started to
> analyze the code and think that I found the right peace of code. I just
> wanted to know if I am on the right track. I think the right peace of code
> to implement a decision to fetch further is in the method output of the
> Fetcher class every time we call the collect method of the OutputCollector
> object.
> 
> private ParseStatus output(Text key, CrawlDatum datum, Content content,
> ProtocolStatus pstatus, int status) {
> ...
> output.collect(...);
> ...
> }
> 
> Would you mind to let me know the the best way to turn this decision into
> an plugin? I was thinking to go a similar way like the scoring filters.
> Thanks in advance.
> 
> Cheers,
> MyD
> 

-- 
View this message in context: http://www.nabble.com/Nutch-Topical---Focused-Crawl-tp22765848p25764131.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.