You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Carl Cerecke <ca...@nzs.com> on 2007/07/09 01:39:42 UTC

Restricting crawl to a certain topic

Hi,

I'm wondering what the best approach is to restrict a crawl to a certain 
topic. I know that I can restrict what is crawled by a regex on the URL, 
but I also need to restrict pages based on their content (whether they 
are on topic or not).

For example, say I wanted to crawl pages about Antarctica. First I start 
off with a handful of pages and inject them into the crawldb, and I 
generate a fetchlist, and can start sucking the pages down. I update the 
crawldb with links from what has just been sucked down, and then during 
the next fetch (and subsequent fetches), I want to filter which pages 
end up in the segment based on their content (using, perhaps some sort 
of antarctica-related-keyword score). Somehow I also need to tell the 
crawldb about the URLS which I've sucked down but aren't 
antarctica-related pages (so we don't suck them down again).

This seems like the sort of problem other people have solved. Any 
pointers? Am I on the right track here? Using nutch 0.9

Cheers,
Carl.

Re: Restricting crawl to a certain topic

Posted by Andrzej Bialecki <ab...@getopt.org>.

Carl Cerecke wrote:
> Carl Cerecke wrote:
>> Andrzej Bialecki wrote:
>>> Carl Cerecke wrote:

>> I've given this a crack and it mostly seems to work, except I'm not 
>> sure how to get the score back into the crawldb. After reading the 
>> Javadoc, I figured that passScoreAfterParsing() was the method I need 
>> to implement. All others are just simple one-liners for this case. 
>> Unfortunately, passScoreAfterParsing() is alone in not having a 
>> CrawlDatum argument, so I can't call datum.setScore(); I did notice 
>> that OPICScoringFilter does this in passScoreAfterParsing: 
>> parse.getData().getContentMeta().set(Nutch.SCORE_KEY,  ...); and I 
>> tried that in my own scoring filter, but just get the zero from 
>> datum.setScore(0.0f) in initalScore().


Nutch.SCORE_KEY is only used to pass the score value to outlinks.


>>
>> Couple of questions then:
>> 1. Does it make sense to put the relevancy scoring code into 
>> passScoreAfterParsing()
>> 2. If so, how do I get the score into the crawldb?
>>
>> I'm a bit vague on how all these bits connect together under the hood 
>> at the moment.....
> 
> Spent all day on this, but no luck. I'm sure I'm missing something 
> obvious. Glad for any pointers in the right direction.

The somewhat awkward API for ScoringFilter comes from the fact that 
different data is available at different steps, and similarly different 
output data is updated at different steps. When passScoreAfterParsing 
executes we don't update the db.

The only method to update the original db entry is a bit indirect - 
first, you need to create an "adjust" value (using 
CrawlDatum.STATUS_LINKED) in distributeScoreToOulinks, and then detect 
this "adjust" value in updateDbScore among other inlinks, and update the 
CrawlDatum datum with a new score value.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

FW: Restricting crawl to a certain topic

Posted by Milan Krendzelak <mk...@mtld.mobi>.

Hi Carl,
 
how about this:
 
create new plugin which will run during the indexing. Its extention point will be ScoringFilter ( this should be set up in plugin.xml)
 
your plugin will implement ScoringFilter interface. In you case, I guess, will be enough to implement indexerScore.
 
Simple implementation of this function is like this:
 
String scoreFactorStr = parse.getData().getParseMeta().get("Score_Factor");
if( !isOk(scoreFactorStr) ) { // if content is OK
  return (doc.getBoost() * 0.0001f );  // boost document down - suppress document score
}
 
return doc.getBoost();
 
 
Milan Krendzelak 


________________________________

From: Carl Cerecke [mailto:carl@nzs.com]
Sent: Thu 12/07/2007 05:24
To: nutch-user@lucene.apache.org
Subject: Re: Restricting crawl to a certain topic



Carl Cerecke wrote:
> Andrzej Bialecki wrote:
>> Carl Cerecke wrote:
>>> Hi,
>>>
>>> I'm wondering what the best approach is to restrict a crawl to a
>>> certain topic. I know that I can restrict what is crawled by a regex
>>> on the URL, but I also need to restrict pages based on their content
>>> (whether they are on topic or not).
>>>
>>> For example, say I wanted to crawl pages about Antarctica. First I
>>> start off with a handful of pages and inject them into the crawldb,
>>> and I generate a fetchlist, and can start sucking the pages down. I
>>> update the crawldb with links from what has just been sucked down,
>>> and then during the next fetch (and subsequent fetches), I want to
>>> filter which pages end up in the segment based on their content
>>> (using, perhaps some sort of antarctica-related-keyword score).
>>> Somehow I also need to tell the crawldb about the URLS which I've
>>> sucked down but aren't antarctica-related pages (so we don't suck
>>> them down again).
>>>
>>> This seems like the sort of problem other people have solved. Any
>>> pointers? Am I on the right track here? Using nutch 0.9
>>
>> The easiest way to do this is to implement a ScoringFilter plugin,
>> which promotes wanted pages and demotes unwanted ones. Please see
>> Javadoc for the ScoringFilter for details.
>
> I've given this a crack and it mostly seems to work, except I'm not sure
> how to get the score back into the crawldb. After reading the Javadoc, I
> figured that passScoreAfterParsing() was the method I need to implement.
> All others are just simple one-liners for this case. Unfortunately,
> passScoreAfterParsing() is alone in not having a CrawlDatum argument, so
> I can't call datum.setScore(); I did notice that OPICScoringFilter does
> this in passScoreAfterParsing:
> parse.getData().getContentMeta().set(Nutch.SCORE_KEY,  ...); and I tried
> that in my own scoring filter, but just get the zero from
> datum.setScore(0.0f) in initalScore().
>
> Couple of questions then:
> 1. Does it make sense to put the relevancy scoring code into
> passScoreAfterParsing()
> 2. If so, how do I get the score into the crawldb?
>
> I'm a bit vague on how all these bits connect together under the hood at
> the moment.....

Spent all day on this, but no luck. I'm sure I'm missing something
obvious. Glad for any pointers in the right direction.

Cheers,
Carl.

Re: Restricting crawl to a certain topic

Posted by Carl Cerecke <ca...@nzs.com>.

Carl Cerecke wrote:
> Andrzej Bialecki wrote:
>> Carl Cerecke wrote:
>>> Hi,
>>>
>>> I'm wondering what the best approach is to restrict a crawl to a 
>>> certain topic. I know that I can restrict what is crawled by a regex 
>>> on the URL, but I also need to restrict pages based on their content 
>>> (whether they are on topic or not).
>>>
>>> For example, say I wanted to crawl pages about Antarctica. First I 
>>> start off with a handful of pages and inject them into the crawldb, 
>>> and I generate a fetchlist, and can start sucking the pages down. I 
>>> update the crawldb with links from what has just been sucked down, 
>>> and then during the next fetch (and subsequent fetches), I want to 
>>> filter which pages end up in the segment based on their content 
>>> (using, perhaps some sort of antarctica-related-keyword score). 
>>> Somehow I also need to tell the crawldb about the URLS which I've 
>>> sucked down but aren't antarctica-related pages (so we don't suck 
>>> them down again).
>>>
>>> This seems like the sort of problem other people have solved. Any 
>>> pointers? Am I on the right track here? Using nutch 0.9
>>
>> The easiest way to do this is to implement a ScoringFilter plugin, 
>> which promotes wanted pages and demotes unwanted ones. Please see 
>> Javadoc for the ScoringFilter for details.
> 
> I've given this a crack and it mostly seems to work, except I'm not sure 
> how to get the score back into the crawldb. After reading the Javadoc, I 
> figured that passScoreAfterParsing() was the method I need to implement. 
> All others are just simple one-liners for this case. Unfortunately, 
> passScoreAfterParsing() is alone in not having a CrawlDatum argument, so 
> I can't call datum.setScore(); I did notice that OPICScoringFilter does 
> this in passScoreAfterParsing: 
> parse.getData().getContentMeta().set(Nutch.SCORE_KEY,  ...); and I tried 
> that in my own scoring filter, but just get the zero from 
> datum.setScore(0.0f) in initalScore().
> 
> Couple of questions then:
> 1. Does it make sense to put the relevancy scoring code into 
> passScoreAfterParsing()
> 2. If so, how do I get the score into the crawldb?
> 
> I'm a bit vague on how all these bits connect together under the hood at 
> the moment.....

Spent all day on this, but no luck. I'm sure I'm missing something 
obvious. Glad for any pointers in the right direction.

Cheers,
Carl.

Re: Restricting crawl to a certain topic

Posted by Carl Cerecke <ca...@nzs.com>.

Andrzej Bialecki wrote:
> Carl Cerecke wrote:
>> Hi,
>>
>> I'm wondering what the best approach is to restrict a crawl to a 
>> certain topic. I know that I can restrict what is crawled by a regex 
>> on the URL, but I also need to restrict pages based on their content 
>> (whether they are on topic or not).
>>
>> For example, say I wanted to crawl pages about Antarctica. First I 
>> start off with a handful of pages and inject them into the crawldb, 
>> and I generate a fetchlist, and can start sucking the pages down. I 
>> update the crawldb with links from what has just been sucked down, and 
>> then during the next fetch (and subsequent fetches), I want to filter 
>> which pages end up in the segment based on their content (using, 
>> perhaps some sort of antarctica-related-keyword score). Somehow I also 
>> need to tell the crawldb about the URLS which I've sucked down but 
>> aren't antarctica-related pages (so we don't suck them down again).
>>
>> This seems like the sort of problem other people have solved. Any 
>> pointers? Am I on the right track here? Using nutch 0.9
> 
> The easiest way to do this is to implement a ScoringFilter plugin, which 
> promotes wanted pages and demotes unwanted ones. Please see Javadoc for 
> the ScoringFilter for details.

I've given this a crack and it mostly seems to work, except I'm not sure 
how to get the score back into the crawldb. After reading the Javadoc, I 
figured that passScoreAfterParsing() was the method I need to implement. 
All others are just simple one-liners for this case. Unfortunately, 
passScoreAfterParsing() is alone in not having a CrawlDatum argument, so 
I can't call datum.setScore(); I did notice that OPICScoringFilter does 
this in passScoreAfterParsing: 
parse.getData().getContentMeta().set(Nutch.SCORE_KEY,  ...); and I tried 
that in my own scoring filter, but just get the zero from 
datum.setScore(0.0f) in initalScore().

Couple of questions then:
1. Does it make sense to put the relevancy scoring code into 
passScoreAfterParsing()
2. If so, how do I get the score into the crawldb?

I'm a bit vague on how all these bits connect together under the hood at 
the moment.....

Cheers,
Carl.

Re: Restricting crawl to a certain topic

Posted by Andrzej Bialecki <ab...@getopt.org>.

Carl Cerecke wrote:
> Hi,
> 
> I'm wondering what the best approach is to restrict a crawl to a certain 
> topic. I know that I can restrict what is crawled by a regex on the URL, 
> but I also need to restrict pages based on their content (whether they 
> are on topic or not).
> 
> For example, say I wanted to crawl pages about Antarctica. First I start 
> off with a handful of pages and inject them into the crawldb, and I 
> generate a fetchlist, and can start sucking the pages down. I update the 
> crawldb with links from what has just been sucked down, and then during 
> the next fetch (and subsequent fetches), I want to filter which pages 
> end up in the segment based on their content (using, perhaps some sort 
> of antarctica-related-keyword score). Somehow I also need to tell the 
> crawldb about the URLS which I've sucked down but aren't 
> antarctica-related pages (so we don't suck them down again).
> 
> This seems like the sort of problem other people have solved. Any 
> pointers? Am I on the right track here? Using nutch 0.9

The easiest way to do this is to implement a ScoringFilter plugin, which 
promotes wanted pages and demotes unwanted ones. Please see Javadoc for 
the ScoringFilter for details.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Restricting crawl to a certain topic

Posted by Brian Whitman <br...@variogr.am>.

We've been trying to get this done -- check here for a start:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/ 
200702.mbox/%3c44E30A14-A697-412F-9C14-BD7731544821@variogr.am%3e


On Jul 8, 2007, at 7:39 PM, Carl Cerecke wrote:

> Hi,
>
> I'm wondering what the best approach is to restrict a crawl to a  
> certain topic. I know that I can restrict what is crawled by a  
> regex on the URL, but I also need to restrict pages based on their  
> content (whether they are on topic or not).
>
> For example, say I wanted to crawl pages about Antarctica. First I  
> start off with a handful of pages and inject them into the crawldb,  
> and I generate a fetchlist, and can start sucking the pages down. I  
> update the crawldb with links from what has just been sucked down,  
> and then during the next fetch (and subsequent fetches), I want to  
> filter which pages end up in the segment based on their content  
> (using, perhaps some sort of antarctica-related-keyword score).  
> Somehow I also need to tell the crawldb about the URLS which I've  
> sucked down but aren't antarctica-related pages (so we don't suck  
> them down again).
>
> This seems like the sort of problem other people have solved. Any  
> pointers? Am I on the right track here? Using nutch 0.9
>
> Cheers,
> Carl.