You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vinsil <ga...@gmail.com> on 2006/09/06 15:48:35 UTC

Custom outlink scoring (0.8)

Hi,

I'd like to implement a custom "topical crawler" using Nutch 0.8. After
digging *a bit* the source code, I'm blocked when it comes to Hadoop.
Unfortunately, I don't have time now to dive into it.

The main idea would be to score the Outlinks based on their topical
relevance to a given subject so they can be ordered by this "relevance
score" during the next fetchlist generation  (using a custom ScoringFilter).
This would lead to a "best-first" fetching strategy with "best" meaning
"more relevant". As the scoring of an outlink would be partly based upon
local textual context around the outlink, the ideal place to compute this
score should (??) be during the parsing of the surrounding page.

A way to do this might be to:
  - compute and add a score metadata to the Outlinks during parsing.
  - retrieve that score in a custom ScoringFilter during fetchlist
generation.

>From what I've understood, the first step doesn't seem possible in Nutch 0.8
(??).
What would be the right way to implement such a behaviour?
Is it possible by creating a pair of custom HtmlParseFilter/ScoringFilter? 

Thanks a lot for your answers

Please all my apologies...
  - ...if i'm missing the point here but i'm new to Nutch.
  - ...for my poor English

Vinsil
-- 
View this message in context: http://www.nabble.com/Custom-outlink-scoring-%280.8%29-tf2227127.html#a6171789
Sent from the Nutch - User forum at Nabble.com.

Re: Custom outlink scoring (0.8)

Posted by Vinsil <ga...@gmail.com>.

> I don't know your requirements - it's up to you to decide what you 
> want to achieve.
One metadata per link is exactly what I need. I just asked as a sanity
check.

> Well, if you add kilobytes of metadata per CrawlDatum, then yes, it will 
> considerably slow down the processing, because of the increased amount 
> of data to transfer and process. Other than that - no.
I was thinking about more "Nutch-related" consequences in part of the code I
don't know. As the temporary metadata will be numerous but light, the
solution you proposed is then just perfect for what i need to do.

Thanks a lot for this workaround and for the additional informations, it is
of *great* help.

Best regards,
vinsil
-- 
View this message in context: http://www.nabble.com/Custom-outlink-scoring-%280.8%29-tf2227127.html#a6190015
Sent from the Nutch - User forum at Nabble.com.

Re: Custom outlink scoring (0.8)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Vinsil wrote:
> Thanks for this sweet workaround and all my apologies for the delay.
>
>   
>> You can use a workaround: prepare necessary metadata in a HtmlParseFilter,
>> which has access to the full DOM tree, and put it into ParseData.metadata.
>>     
> Using the surrounding page's metadata to pass data about the outlinks sounds
> like a *very* nice workaround to me.
> Would adding one metadata per Outlink make sense? 
>   

I don't know your requirements - it's up to you to decide what you want 
to achieve.

> These metadata could be removed in ScoringFilter.passScoreAfterParsing (I
> guess...). Their number should also be limited using
> db.max.outlinks.per.page. .  Their number should be limited using 
> Wouldn't there be ugly consequences of adding "so many" metadata even
> temporarily?
>   

Well, if you add kilobytes of metadata per CrawlDatum, then yes, it will 
considerably slow down the processing, because of the increased amount 
of data to transfer and process. Other than that - no.

>   
>> ... HtmlParseFilter which has access to the full DOM tree
>>     
> Is it through the DocumentFragment object that is passed to
> HtmlParseFilter.parse? 
>   

Yes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Custom outlink scoring (0.8)

Posted by Vinsil <ga...@gmail.com>.

Thanks for this sweet workaround and all my apologies for the delay.

> You can use a workaround: prepare necessary metadata in a HtmlParseFilter,
> which has access to the full DOM tree, and put it into ParseData.metadata.
Using the surrounding page's metadata to pass data about the outlinks sounds
like a *very* nice workaround to me.
Would adding one metadata per Outlink make sense? 
These metadata could be removed in ScoringFilter.passScoreAfterParsing (I
guess...). Their number should also be limited using
db.max.outlinks.per.page. .  Their number should be limited using 
Wouldn't there be ugly consequences of adding "so many" metadata even
temporarily?

> ... HtmlParseFilter which has access to the full DOM tree
Is it through the DocumentFragment object that is passed to
HtmlParseFilter.parse? 

Thanks,
Best regards,

vinsil
-- 
View this message in context: http://www.nabble.com/Custom-outlink-scoring-%280.8%29-tf2227127.html#a6187329
Sent from the Nutch - User forum at Nabble.com.

Re: Custom outlink scoring (0.8)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Vinsil wrote:
> Hi Andrzej,
>
> Thanks *a lot* for your prompt answer.
>
>   
>> You can use ScoringFilter.distributeScoreToOutlink to also modify the
>>     
> target CrawlDatum, e.g. store some metadata. 
>
> It sounds like a really nice solution but it seems like I cannot access the
> surrounding page textual content (and ideally the associated DOM tree)
> inside ScoringFilter.distributeScoreToOutlink, only its parsedata...  
> I would need that information to be able to compute the "relevance score"
> for the outlinks. But maybe am i missing the point here?
>   

You are correct, DOM tree is not available through this API. You can use 
a workaround: prepare necessary metadata in a HtmlParseFilter, which has 
access to the full DOM tree, and put it into ParseData.metadata.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Custom outlink scoring (0.8)

Posted by Vinsil <ga...@gmail.com>.

Hi Andrzej,

Thanks *a lot* for your prompt answer.

>You can use ScoringFilter.distributeScoreToOutlink to also modify the
target CrawlDatum, e.g. store some metadata. 

It sounds like a really nice solution but it seems like I cannot access the
surrounding page textual content (and ideally the associated DOM tree)
inside ScoringFilter.distributeScoreToOutlink, only its parsedata...  
I would need that information to be able to compute the "relevance score"
for the outlinks. But maybe am i missing the point here?


-- 
View this message in context: http://www.nabble.com/Custom-outlink-scoring-%280.8%29-tf2227127.html#a6172500
Sent from the Nutch - User forum at Nabble.com.

Re: Custom outlink scoring (0.8)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Vinsil wrote:
> Hi,
>
> I'd like to implement a custom "topical crawler" using Nutch 0.8. After
> digging *a bit* the source code, I'm blocked when it comes to Hadoop.
> Unfortunately, I don't have time now to dive into it.
>
> The main idea would be to score the Outlinks based on their topical
> relevance to a given subject so they can be ordered by this "relevance
> score" during the next fetchlist generation  (using a custom ScoringFilter).
> This would lead to a "best-first" fetching strategy with "best" meaning
> "more relevant". As the scoring of an outlink would be partly based upon
> local textual context around the outlink, the ideal place to compute this
> score should (??) be during the parsing of the surrounding page.
>
> A way to do this might be to:
>   - compute and add a score metadata to the Outlinks during parsing.
>   - retrieve that score in a custom ScoringFilter during fetchlist
> generation.
>
> >From what I've understood, the first step doesn't seem possible in Nutch 0.8
> (??).
> What would be the right way to implement such a behaviour?
> Is it possible by creating a pair of custom HtmlParseFilter/ScoringFilter? 
>   

You can use ScoringFilter.distributeScoreToOutlink to also modify the 
target CrawlDatum, e.g. store some metadata. Then, in 
ScoringFilter.updateDbScore you can use this metadata to modify the 
output datum based on the metadata collected from inlinked datums 
(coming from outlinks, and containing your metadata). This output datum 
is then stored in CrawlDB, so you can use its metadata in the next 
round, via ScoringFilter.generatorSortValue.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com