You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vangelis karv <ka...@hotmail.com> on 2014/05/22 17:59:16 UTC

Importance of Score

(Apache Nutch 2.2.1)

Hi again!
GeneratorJob marks the best topN sites for fetching. Does it choose Urls with the highest score or random Urls? If it chooses randomly, then whats the point of the score field?? 
Thank you!

Re: Importance of Score

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Vangelis,

> Cons: Scoring is not used for selection Domains (hosts) at the start of a region
> (mapper input) have the highest chance to get selected.
>
> I guess that the first line is wrong and should be updated.

Afaics, that belongs to section "Things for future development", resp. "Suggestions".
If I didn't miss something that not relevant for the current 2.x.

Sebastian


On 05/23/2014 09:46 AM, Vangelis karv wrote:
> Thanks Sebastian for your trouble! 
> In http://wiki.apache.org/nutch/Nutch2Crawling , just before the Fetch procedure, it says: 
> 
> Cons: Scoring is not used for selection Domains (hosts) at the start of a region (mapper input) have the highest chance to get selected. 
> 
> I guess that the first line is wrong and should be updated.
> 
> 
> 
>> Date: Thu, 22 May 2014 21:28:10 +0200
>> From: wastl.nagel@googlemail.com
>> To: user@nutch.apache.org
>> Subject: Re: Importance of Score
>>
>> Hi Vangelis,
>>
>>> Does it choose Urls with the highest score
>> Yes, it does. Have a look at generatorSortValue(...) in one the scoring filter plugins.
>> In case of scoring-opic (activated per default), URLs/docs are simply ranked by score
>> taken from CrawlDb. But other scoring filters may use different strategies to rank
>> and select URLs for fetching. And of course, you are able to adapt it to your own needs
>> by writing a new scoring filter. Finally, scoring filters can be combined by chaining:
>> the initSort parameter is the value returned by the preceding scoring filter.
>>
>> Sebastian
>>
>> On 05/22/2014 05:59 PM, Vangelis karv wrote:
>>> (Apache Nutch 2.2.1)
>>>
>>> Hi again!
>>> GeneratorJob marks the best topN sites for fetching. Does it choose Urls with the highest score or random Urls? If it chooses randomly, then whats the point of the score field?? 
>>> Thank you!
>>>
>>>  		 	   		  
>>>
>>
>  		 	   		  
>

RE: Importance of Score

Posted by Vangelis karv <ka...@hotmail.com>.

Thanks Sebastian for your trouble! 
In http://wiki.apache.org/nutch/Nutch2Crawling , just before the Fetch procedure, it says: 

Cons: Scoring is not used for selection Domains (hosts) at the start of a region (mapper input) have the highest chance to get selected. 

I guess that the first line is wrong and should be updated.



> Date: Thu, 22 May 2014 21:28:10 +0200
> From: wastl.nagel@googlemail.com
> To: user@nutch.apache.org
> Subject: Re: Importance of Score
> 
> Hi Vangelis,
> 
> > Does it choose Urls with the highest score
> Yes, it does. Have a look at generatorSortValue(...) in one the scoring filter plugins.
> In case of scoring-opic (activated per default), URLs/docs are simply ranked by score
> taken from CrawlDb. But other scoring filters may use different strategies to rank
> and select URLs for fetching. And of course, you are able to adapt it to your own needs
> by writing a new scoring filter. Finally, scoring filters can be combined by chaining:
> the initSort parameter is the value returned by the preceding scoring filter.
> 
> Sebastian
> 
> On 05/22/2014 05:59 PM, Vangelis karv wrote:
> > (Apache Nutch 2.2.1)
> > 
> > Hi again!
> > GeneratorJob marks the best topN sites for fetching. Does it choose Urls with the highest score or random Urls? If it chooses randomly, then whats the point of the score field?? 
> > Thank you!
> > 
> >  		 	   		  
> > 
>

Re: Importance of Score

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Vangelis,

> Does it choose Urls with the highest score
Yes, it does. Have a look at generatorSortValue(...) in one the scoring filter plugins.
In case of scoring-opic (activated per default), URLs/docs are simply ranked by score
taken from CrawlDb. But other scoring filters may use different strategies to rank
and select URLs for fetching. And of course, you are able to adapt it to your own needs
by writing a new scoring filter. Finally, scoring filters can be combined by chaining:
the initSort parameter is the value returned by the preceding scoring filter.

Sebastian

On 05/22/2014 05:59 PM, Vangelis karv wrote:
> (Apache Nutch 2.2.1)
> 
> Hi again!
> GeneratorJob marks the best topN sites for fetching. Does it choose Urls with the highest score or random Urls? If it chooses randomly, then whats the point of the score field?? 
> Thank you!
> 
>  		 	   		  
>

Re: Importance of Score

Posted by Talat Uyarer <ta...@uyarer.com>.

Hi Vangelis,

In Nutch 2.x we use partitioner for distrubiting urls. in reduce of
generatorjob we take only topN/recude count urls. We don't choose random by
default but we don't take with highest score.

Am i wrong Sebastian ?
Talat
22 May 2014 18:59 tarihinde "Vangelis karv" <ka...@hotmail.com> yazdı:

> (Apache Nutch 2.2.1)
>
> Hi again!
> GeneratorJob marks the best topN sites for fetching. Does it choose Urls
> with the highest score or random Urls? If it chooses randomly, then whats
> the point of the score field??
> Thank you!
>
>