You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ismael <kr...@gmail.com> on 2008/07/07 18:01:03 UTC

Help to get the entire link in the anchor field instead of the anchor to a fetched page.

Hello. I need to get the links followed by nutch to reach a page; something
like the anchors, but getting all the information inside the link instead of
the text of the link.

I don't know if this can be done building a plugin, or if I must modify the
Nutch code to get this information. I went through the Nutch code, and I
still didn't find where this information is collected, but I am on it.


As an example, what I need is that given the next link:

<a href="/main.html" title="Title"><img src="/src.gif" border=0
style="background-position:bottom;"> </a>

when I access to the anchor field of the "/main.html" fetched page in the
Nutch index, the text should be the entire <a href...></a> link.


I really only need the <img> tag, so if it is easier to get that, that
solutions also helps me.

Any help would be appreciated; thanks for reading.