You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/05/15 17:00:34 UTC

modifying inbound link text calc

I'm trying to get rid of some spammy sites in our index. 

First, I wonder if anyone has any suggestions on changes to the default 
install config of Nutch that will help drive better sites to the top and 
spammier sites down.

Secondly, I boosted the inbound anchor text config - but if anything 
that made things worse.  A lot of the spammier sites heavily use search 
terms intheir  internal anchors.  So I'm wondering - is there any easy 
way to distinguish between anchor text from within the same domain vs. 
anchor text from external domains, and give them different weightings?  
I expect this isn't the case currently - anyone have any opinions on how 
difficult this would be to change?

Thanks,
g.


Re: modifying inbound link text calc

Posted by Andrzej Bialecki <ab...@getopt.org>.
Insurance Squared Inc. wrote:
> I'm trying to get rid of some spammy sites in our index.
> First, I wonder if anyone has any suggestions on changes to the 
> default install config of Nutch that will help drive better sites to 
> the top and spammier sites down.

What is a "better" site? Depending on how you define this, and how 
precise is your definition, you should get clear indications how to 
improve the quality.

>
> Secondly, I boosted the inbound anchor text config - but if anything 
> that made things worse.  A lot of the spammier sites heavily use 
> search terms intheir  internal anchors.  So I'm wondering - is there 
> any easy way to distinguish between anchor text from within the same 
> domain vs. anchor text from external domains, and give them different 
> weightings?  I expect this isn't the case currently - anyone have any 
> opinions on how difficult this would be to change?

The scoring API  (just committed) gives you this option. Please see 
ScoringFilter's method indexerScore.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com