You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by karthik085 <ka...@gmail.com> on 2007/11/07 21:32:14 UTC

db.ignore.internal.links and ranking algorithms

Hi,

I was wondering how does db.ignore.internal.links affect rankings on
PageRank and OPIC algiorithm?  I searched on the forum - couldn't get a
clear-cut answer.

I am using Nutch 0.7.2 to crawl & index handful of sites. One site - has lot
of pages and interlinks - around 1/3 of my total pages are from this site -
hence, when I search for something and hit 'Show All Hits' - first 5-10
pages are from this site - before any results from other sites are shown.
How will db.ignore.internal.links help in this case?

Of course, I will have to recrawl with nutch-0.9 to use OPIC algorithm...:-(

Thanks.
-- 
View this message in context: http://www.nabble.com/db.ignore.internal.links-and-ranking-algorithms-tf4767180.html#a13635422
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: db.ignore.internal.links and ranking algorithms

Posted by karthik085 <ka...@gmail.com>.
That clears some of my problems. BTW, I am merging all the segments together.
Thanks.


Dennis Kubes-2 wrote:
> 
> I don't believe that you can set it for a single site. Setting the 
> variable would affect all sites, but it should increase relevancy if you 
> are doing web crawls and not constrained crawls.  This is because the 
> default nutch scores internal and external links equally.
> 
> Are you merging segments or merging segment indexes?  Either way I don't 
> think scores are recalculated.  Scores are calculated primarily on the 
> parse process (distributing score to outlinks) and on the update db 
> process (gathering inlink score and updating crawldatum).
> 
> Dennis Kubes
> 
> karthik085 wrote:
>> Thanks for the quick reply. 
>> 
>> Is there anyway I can set this score for one specific site? As I said
>> earlier, I crawl a handful of sites - 1 site has lot of search results as
>> they have high scores (many internal links and possibly, anchor
>> pollution) -
>> other site pages does not have many incoming internal links and anchor
>> text
>> are either useless or empty.
>> 
>> At the end, I merge all the crawled segments into one for faster
>> searching -
>> won't the scores be recalculated here again? Setting the score for
>> db.score.link.internal variable would then affect all sites. Won't it?
>> 
>> Please correct me if I am wrong.
>> 
>> 
>> Dennis Kubes-2 wrote:
>>> Well, the short answer is it doesn't  Even if you set internal links to 
>>> be ignored they are still calculated in the OPIC scoring and this 
>>> negatively affects search relevancy.  The way to handle this is to set 
>>> the db.score.link.internal variable to 0.0.  This way only external 
>>> links are counted in OPIC.
>>>
>>> I will post a wiki entry about this process soon.
>>>
>>> Dennis Kubes
>>>
>>> karthik085 wrote:
>>>> Hi,
>>>>
>>>> I was wondering how does db.ignore.internal.links affect rankings on
>>>> PageRank and OPIC algiorithm?  I searched on the forum - couldn't get a
>>>> clear-cut answer.
>>>>
>>>> I am using Nutch 0.7.2 to crawl & index handful of sites. One site -
>>>> has
>>>> lot
>>>> of pages and interlinks - around 1/3 of my total pages are from this
>>>> site
>>>> -
>>>> hence, when I search for something and hit 'Show All Hits' - first 5-10
>>>> pages are from this site - before any results from other sites are
>>>> shown.
>>>> How will db.ignore.internal.links help in this case?
>>>>
>>>> Of course, I will have to recrawl with nutch-0.9 to use OPIC
>>>> algorithm...:-(
>>>>
>>>> Thanks.
>>>
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/db.ignore.internal.links-and-ranking-algorithms-tf4767180.html#a13641033
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: db.ignore.internal.links and ranking algorithms

Posted by Dennis Kubes <ku...@apache.org>.
I don't believe that you can set it for a single site. Setting the 
variable would affect all sites, but it should increase relevancy if you 
are doing web crawls and not constrained crawls.  This is because the 
default nutch scores internal and external links equally.

Are you merging segments or merging segment indexes?  Either way I don't 
think scores are recalculated.  Scores are calculated primarily on the 
parse process (distributing score to outlinks) and on the update db 
process (gathering inlink score and updating crawldatum).

Dennis Kubes

karthik085 wrote:
> Thanks for the quick reply. 
> 
> Is there anyway I can set this score for one specific site? As I said
> earlier, I crawl a handful of sites - 1 site has lot of search results as
> they have high scores (many internal links and possibly, anchor pollution) -
> other site pages does not have many incoming internal links and anchor text
> are either useless or empty.
> 
> At the end, I merge all the crawled segments into one for faster searching -
> won't the scores be recalculated here again? Setting the score for
> db.score.link.internal variable would then affect all sites. Won't it?
> 
> Please correct me if I am wrong.
> 
> 
> Dennis Kubes-2 wrote:
>> Well, the short answer is it doesn't  Even if you set internal links to 
>> be ignored they are still calculated in the OPIC scoring and this 
>> negatively affects search relevancy.  The way to handle this is to set 
>> the db.score.link.internal variable to 0.0.  This way only external 
>> links are counted in OPIC.
>>
>> I will post a wiki entry about this process soon.
>>
>> Dennis Kubes
>>
>> karthik085 wrote:
>>> Hi,
>>>
>>> I was wondering how does db.ignore.internal.links affect rankings on
>>> PageRank and OPIC algiorithm?  I searched on the forum - couldn't get a
>>> clear-cut answer.
>>>
>>> I am using Nutch 0.7.2 to crawl & index handful of sites. One site - has
>>> lot
>>> of pages and interlinks - around 1/3 of my total pages are from this site
>>> -
>>> hence, when I search for something and hit 'Show All Hits' - first 5-10
>>> pages are from this site - before any results from other sites are shown.
>>> How will db.ignore.internal.links help in this case?
>>>
>>> Of course, I will have to recrawl with nutch-0.9 to use OPIC
>>> algorithm...:-(
>>>
>>> Thanks.
>>
> 

Re: db.ignore.internal.links and ranking algorithms

Posted by karthik085 <ka...@gmail.com>.
Thanks for the quick reply. 

Is there anyway I can set this score for one specific site? As I said
earlier, I crawl a handful of sites - 1 site has lot of search results as
they have high scores (many internal links and possibly, anchor pollution) -
other site pages does not have many incoming internal links and anchor text
are either useless or empty.

At the end, I merge all the crawled segments into one for faster searching -
won't the scores be recalculated here again? Setting the score for
db.score.link.internal variable would then affect all sites. Won't it?

Please correct me if I am wrong.


Dennis Kubes-2 wrote:
> 
> Well, the short answer is it doesn't  Even if you set internal links to 
> be ignored they are still calculated in the OPIC scoring and this 
> negatively affects search relevancy.  The way to handle this is to set 
> the db.score.link.internal variable to 0.0.  This way only external 
> links are counted in OPIC.
> 
> I will post a wiki entry about this process soon.
> 
> Dennis Kubes
> 
> karthik085 wrote:
>> Hi,
>> 
>> I was wondering how does db.ignore.internal.links affect rankings on
>> PageRank and OPIC algiorithm?  I searched on the forum - couldn't get a
>> clear-cut answer.
>> 
>> I am using Nutch 0.7.2 to crawl & index handful of sites. One site - has
>> lot
>> of pages and interlinks - around 1/3 of my total pages are from this site
>> -
>> hence, when I search for something and hit 'Show All Hits' - first 5-10
>> pages are from this site - before any results from other sites are shown.
>> How will db.ignore.internal.links help in this case?
>> 
>> Of course, I will have to recrawl with nutch-0.9 to use OPIC
>> algorithm...:-(
>> 
>> Thanks.
> 
> 

-- 
View this message in context: http://www.nabble.com/db.ignore.internal.links-and-ranking-algorithms-tf4767180.html#a13636316
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: db.ignore.internal.links and ranking algorithms

Posted by Dennis Kubes <ku...@apache.org>.
Well, the short answer is it doesn't  Even if you set internal links to 
be ignored they are still calculated in the OPIC scoring and this 
negatively affects search relevancy.  The way to handle this is to set 
the db.score.link.internal variable to 0.0.  This way only external 
links are counted in OPIC.

I will post a wiki entry about this process soon.

Dennis Kubes

karthik085 wrote:
> Hi,
> 
> I was wondering how does db.ignore.internal.links affect rankings on
> PageRank and OPIC algiorithm?  I searched on the forum - couldn't get a
> clear-cut answer.
> 
> I am using Nutch 0.7.2 to crawl & index handful of sites. One site - has lot
> of pages and interlinks - around 1/3 of my total pages are from this site -
> hence, when I search for something and hit 'Show All Hits' - first 5-10
> pages are from this site - before any results from other sites are shown.
> How will db.ignore.internal.links help in this case?
> 
> Of course, I will have to recrawl with nutch-0.9 to use OPIC algorithm...:-(
> 
> Thanks.