You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eric Osgood <er...@lakemeadonline.com> on 2009/10/06 21:33:03 UTC

Targeting Specific Links

Is there a way to inspect the list of links that nutch finds per page  
and then at that point choose which links I want to include / exclude?  
that is the ideal remedy to my problem.

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering
Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu
eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/eosgood
www.lakemeadonline.com


Re: Targeting Specific Links

Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric Osgood wrote:
> Andrzej,
> 
> Based on what you suggested below, I have begun to write my own scoring 
> plugin:

Great!

> 
> in distributeScoreToOutlinks() if the link contains the string im 
> looking for, I set its score to kept_score and add a flag to the 
> metaData in parseData ("KEEP", "true"). How do I check for this flag in 
> generatorSortValue()? I only see a way to check the score, not a flag.

The flag should have been automagically added to the target CrawlDatum 
metadata after you have updated your crawldb (see the details in 
CrawlDbReducer). Then in generatorSortValue() you can check for the 
presence of this flag by using the datum.getMetaData().

BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any 
special way ... I thought it did. It's easy to add this, though - in 
Generator.java:161 just add this:

if (sort == Float.MIN_VALUE) {
	return;
}


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Targeting Specific Links

Posted by Eric Osgood <er...@lakemeadonline.com>.
Also,

In the scoring-links plugin, I set the return value for  
ScoringFilter.generatorSortValue() to Float.MinValue for all urls and  
it still fetched everything - maybe Float.MinValue isn't the correct  
value to set so a link never gets fetched?

Thanks,

Eric

On Oct 22, 2009, at 1:10 PM, Eric Osgood wrote:

> Andrzej,
>
> Based on what you suggested below, I have begun to write my own  
> scoring plugin:
>
> in distributeScoreToOutlinks() if the link contains the string im  
> looking for, I set its score to kept_score and add a flag to the  
> metaData in parseData ("KEEP", "true"). How do I check for this flag  
> in generatorSortValue()? I only see a way to check the score, not a  
> flag.
>
> Thanks,
>
> Eric
>
>
> On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:
>
>> Eric Osgood wrote:
>>> Andrzej,
>>> How would I check for a flag during fetch?
>>
>> You would check for a flag during generation - please check  
>> ScoringFilter.generatorSortValue(), that's where you can check for  
>> a flag and set the sort value to Float.MIN_VALUE - this way the  
>> link will never be selected for fetching.
>>
>> And you would put the flag in CrawlDatum metadata when  
>> ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().
>>
>>> Maybe this explanation can shed some light:
>>> Ideally, I would like to check the list of links for each page,  
>>> but still needing a total of X links per page, if I find the links  
>>> I want, I add them to the list up until X, if I don' reach X, I  
>>> add other links until X is reached. This way, I don't waste crawl  
>>> time on non-relevant links.
>>
>> You can modify the collection of target links passed to  
>> distributeScoreToOutlinks() - this way you can affect both which  
>> links are stored and what kind of metadata each of them gets.
>>
>> As I said, you can also use just plain URLFilters to filter out  
>> unwanted links, but that API gives you much less control because  
>> it's a simple yes/no that considers just URL string. The advantage  
>> is that it's much easier to implement than a ScoringFilter.
>>
>>
>> -- 
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>
> Eric Osgood
> ---------------------------------------------
> Cal Poly - Computer Engineering, Moon Valley Software
> ---------------------------------------------
> eosgood@calpoly.edu, eric@lakemeadonline.com
> ---------------------------------------------
> www.calpoly.edu/~eosgood, www.lakemeadonline.com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com


Re: Targeting Specific Links

Posted by Eric Osgood <er...@lakemeadonline.com>.
Andrzej,

Based on what you suggested below, I have begun to write my own  
scoring plugin:

in distributeScoreToOutlinks() if the link contains the string im  
looking for, I set its score to kept_score and add a flag to the  
metaData in parseData ("KEEP", "true"). How do I check for this flag  
in generatorSortValue()? I only see a way to check the score, not a  
flag.

Thanks,

Eric


On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> Andrzej,
>> How would I check for a flag during fetch?
>
> You would check for a flag during generation - please check  
> ScoringFilter.generatorSortValue(), that's where you can check for a  
> flag and set the sort value to Float.MIN_VALUE - this way the link  
> will never be selected for fetching.
>
> And you would put the flag in CrawlDatum metadata when  
> ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().
>
>> Maybe this explanation can shed some light:
>> Ideally, I would like to check the list of links for each page, but  
>> still needing a total of X links per page, if I find the links I  
>> want, I add them to the list up until X, if I don' reach X, I add  
>> other links until X is reached. This way, I don't waste crawl time  
>> on non-relevant links.
>
> You can modify the collection of target links passed to  
> distributeScoreToOutlinks() - this way you can affect both which  
> links are stored and what kind of metadata each of them gets.
>
> As I said, you can also use just plain URLFilters to filter out  
> unwanted links, but that API gives you much less control because  
> it's a simple yes/no that considers just URL string. The advantage  
> is that it's much easier to implement than a ScoringFilter.
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com


Re: Targeting Specific Links

Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric Osgood wrote:
> Andrzej,
> 
> How would I check for a flag during fetch?

You would check for a flag during generation - please check 
ScoringFilter.generatorSortValue(), that's where you can check for a 
flag and set the sort value to Float.MIN_VALUE - this way the link will 
never be selected for fetching.

And you would put the flag in CrawlDatum metadata when ParseOutputFormat 
calls ScoringFilter.distributeScoreToOutlinks().

> 
> Maybe this explanation can shed some light:
> Ideally, I would like to check the list of links for each page, but 
> still needing a total of X links per page, if I find the links I want, I 
> add them to the list up until X, if I don' reach X, I add other links 
> until X is reached. This way, I don't waste crawl time on non-relevant 
> links.

You can modify the collection of target links passed to 
distributeScoreToOutlinks() - this way you can affect both which links 
are stored and what kind of metadata each of them gets.

As I said, you can also use just plain URLFilters to filter out unwanted 
links, but that API gives you much less control because it's a simple 
yes/no that considers just URL string. The advantage is that it's much 
easier to implement than a ScoringFilter.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Targeting Specific Links

Posted by Eric Osgood <er...@lakemeadonline.com>.
Andrzej,

How would I check for a flag during fetch?

Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but  
still needing a total of X links per page, if I find the links I want,  
I add them to the list up until X, if I don' reach X, I add other  
links until X is reached. This way, I don't waste crawl time on non- 
relevant links.

Thanks,

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@calpoly.edu, eric@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/eosgood, www.lakemeadonline.com


On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> Is there a way to inspect the list of links that nutch finds per  
>> page and then at that point choose which links I want to include /  
>> exclude? that is the ideal remedy to my problem.
>
> Yes, look at ParseOutputFormat, you can make this decision there.  
> There are two standard etension points where you can hook up -  
> URLFilters and ScoringFilters.
>
> Please note that if you use URLFilters to filter out URL-s too early  
> then they will be rediscovered again and again. A better method to  
> handle this, but also more complicated, is to still include such  
> links but give them a special flag (in metadata) that prevents  
> fetching. This requires that you implement a custom scoring plugin.
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>




Re: Targeting Specific Links

Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric Osgood wrote:
> Is there a way to inspect the list of links that nutch finds per page 
> and then at that point choose which links I want to include / exclude? 
> that is the ideal remedy to my problem.

Yes, look at ParseOutputFormat, you can make this decision there. There 
are two standard etension points where you can hook up - URLFilters and 
ScoringFilters.

Please note that if you use URLFilters to filter out URL-s too early 
then they will be rediscovered again and again. A better method to 
handle this, but also more complicated, is to still include such links 
but give them a special flag (in metadata) that prevents fetching. This 
requires that you implement a custom scoring plugin.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com