You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eran Zinman <zz...@gmail.com> on 2010/01/18 12:03:04 UTC

Boost urls to crawl by anchor text

Hi all,

I've created a custom scoring filter plugin which overrides the
ScoringFilter class.

My main goal is once a certain page is fetched and parsed, I wish to analyze
it's outlinks and decide to which links to go next. One of the criterias
which help me decide - is the link anchor text.

For example, if a certain link from the current page has an anchor text that
contain the word "Games" I which to boost it so it will be fetched on the
next round.

>From what I've seen, the *updateDbScore(Text url, CrawlDatum old, CrawlDatum
datum, List<CrawlDatum> inlinked)* function receives only the URL text and I
have no access to the URL anchor text -

any idea how I can get the anchor text of a certain URL in the
"updateDbScore" function?

Thanks,
Eran