You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eran Zinman <zz...@gmail.com> on 2010/01/18 12:03:04 UTC
Boost urls to crawl by anchor text
Hi all,
I've created a custom scoring filter plugin which overrides the
ScoringFilter class.
My main goal is once a certain page is fetched and parsed, I wish to analyze
it's outlinks and decide to which links to go next. One of the criterias
which help me decide - is the link anchor text.
For example, if a certain link from the current page has an anchor text that
contain the word "Games" I which to boost it so it will be fetched on the
next round.
>From what I've seen, the *updateDbScore(Text url, CrawlDatum old, CrawlDatum
datum, List<CrawlDatum> inlinked)* function receives only the URL text and I
have no access to the URL anchor text -
any idea how I can get the anchor text of a certain URL in the
"updateDbScore" function?
Thanks,
Eran