You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Bolle, Jeffrey F." <jb...@mitre.org> on 2007/12/19 16:31:23 UTC

Anchor links

All,
Is there a way to have Nutch (sorry for not being more specific in
terms of the crawler, indexer, parser, etc.) ignore anchor links
internal to the page (but not ignore pages internal to the site)?  I
have some pages being indexed, archives of mailing lists, that have a
whole ton of anchors and Nutch re-fetches and re-parses the same page
countless times, each time on the different anchor link.  I know there
is the property to ignore internal links, but I want other pages on the
same host to be included, just not self-referencing links within a
page.

Any help would be appreciated.  Thanks.

Jeff

Re: Anchor links

Posted by Brian Whitman <br...@variogr.am>.
On Dec 19, 2007, at 10:31 AM, Bolle, Jeffrey F. wrote:

> All,
> Is there a way to have Nutch (sorry for not being more specific in
> terms of the crawler, indexer, parser, etc.) ignore anchor links
> internal to the page (but not ignore pages internal to the site)?  I
> have some pages being indexed, archives of mailing lists, that have a
> whole ton of anchors and Nutch re-fetches and re-parses the same page
> countless times, each time on the different anchor link.  I know there
> is the property to ignore internal links, but I want other pages on  
> the
> same host to be included, just not self-referencing links within a
> page.



In your urlnormalizer regex conf file (regex-normalize.xml) you can  
remove everything after the # symbol like so:

	<!-- remove anchors, who needs em -->
	<regex>
	   <pattern>\#(.*)</pattern>
	   <substitution></substitution>
	</regex>