You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ismael <kr...@gmail.com> on 2008/07/07 18:01:03 UTC
Hello. I need to get the links followed by nutch to reach a page; something
like the anchors, but getting all the information inside the link instead of
the text of the link.
I don't know if this can be done building a plugin, or if I must modify the
Nutch code to get this information. I went through the Nutch code, and I
still didn't find where this information is collected, but I am on it.
As an example, what I need is that given the next link:
<a href="/main.html" title="Title"><img src="/src.gif" border=0
style="background-position:bottom;"> </a>
when I access to the anchor field of the "/main.html" fetched page in the
Nutch index, the text should be the entire <a href...></a> link.
I really only need the <img> tag, so if it is easier to get that, that
solutions also helps me.
Any help would be appreciated; thanks for reading.