You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jorge Luis Betancourt Gonzalez <jl...@uci.cu> on 2012/12/22 16:46:39 UTC

CrawlDatun parameter in ScoringFilters and IndexingFilters

Hi:

I've been looking into the ScoringFilter interface and I've a question, the distributeScoreToOutlinks function receive one parameter called targets, which is a collection of URLs and CrawlDatum which correspond to the outlinks of the url which is been analyzed right now. On the other hand the function filter of the IndexingFilter interface receives also a CrawlDatum object which corresponds only to the URL -> NutchDocument thats is about to be indexed, my question is if the CrawlDatum object passed to an ScoringFilter as an outlink is the same that the IndexingFilter receives when that particularly outlink is about to be indexed. I've done some tests locally and it does, but I'm worried about the distributed case, this stills happens.

For instance I've this:

test.html has 2 outlinks:
test.html ----> test2.html
          ----> test3.html

So, when any Scoring plugin implementing ScoringFilter is called on test.html, the targets parameter has one item in the Collection for every outlink in test.html, so I can modify some in the CrawlDatum object inside the targets collection, but when the test2.html is indexed the changes will be passed to the indexing filters? I've done this locally and it works, but in a distribute enviroment running Nutch on top of hadoop the behavior will be the same?

Greetings! 

The following signature is added automatically by the mail server.

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci