You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/05/15 17:00:34 UTC
modifying inbound link text calc
I'm trying to get rid of some spammy sites in our index.
First, I wonder if anyone has any suggestions on changes to the default
install config of Nutch that will help drive better sites to the top and
spammier sites down.
Secondly, I boosted the inbound anchor text config - but if anything
that made things worse. A lot of the spammier sites heavily use search
terms intheir internal anchors. So I'm wondering - is there any easy
way to distinguish between anchor text from within the same domain vs.
anchor text from external domains, and give them different weightings?
I expect this isn't the case currently - anyone have any opinions on how
difficult this would be to change?
Thanks,
g.
Re: modifying inbound link text calc
Posted by Andrzej Bialecki <ab...@getopt.org>.
Insurance Squared Inc. wrote:
> I'm trying to get rid of some spammy sites in our index.
> First, I wonder if anyone has any suggestions on changes to the
> default install config of Nutch that will help drive better sites to
> the top and spammier sites down.
What is a "better" site? Depending on how you define this, and how
precise is your definition, you should get clear indications how to
improve the quality.
>
> Secondly, I boosted the inbound anchor text config - but if anything
> that made things worse. A lot of the spammier sites heavily use
> search terms intheir internal anchors. So I'm wondering - is there
> any easy way to distinguish between anchor text from within the same
> domain vs. anchor text from external domains, and give them different
> weightings? I expect this isn't the case currently - anyone have any
> opinions on how difficult this would be to change?
The scoring API (just committed) gives you this option. Please see
ScoringFilter's method indexerScore.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com