You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Santiago Pérez <el...@gmail.com> on 2009/12/18 12:50:50 UTC

Creating an alternative Linkdb with part of the outlinks

Hej,

I am using Nutch for indexing websites and it is working well (most of the
times). 

I've checked that Nutch extract the outlinks from the raw HTML code of each
parsed site for expand the crawling proccess.

I would like to keep this structure but I would alsko like to extract the
outlinks from a specific part of the web page (like only from the content of
a new) for creating also an alternative LinkDB in order to know how news are
linked and being linked by another news in their content.
Can anybody give an idea for focusing where and how can I add that new
feature?

Thanks in advance from a newbie ;)
-- 
View this message in context: http://old.nabble.com/Creating-an-alternative-Linkdb-with-part-of-the-outlinks-tp26842352p26842352.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.