You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nicolás Lichtmaier <ni...@reloco.com.ar> on 2007/02/22 17:12:06 UTC

Creating a new scoring filter.

Hi, I'm working in a fixed set of URLs and I'd like to replace the 
standard OPIC score plugin with something different. I'd like to create 
a scoring plugin which entirely bases its score on the document parsed 
data (yes, I will trust the document text itself to decide its relevance).

I've been reading the code and the ScoringFilter interface seems to be 
targeted for use by OPIC like algorithms. For example, the step called 
after parsing is called "passScoreAfterParsing()", telling me what am I 
supposed to do in that method, and the method setting the scores is 
called "distributeScoreToOutlink()". All of this scares me... would it 
be safe to use these methods differently and, e.g., modify the socument 
score in "passScoreAfterParsing()" instead of just "passing it"?

Should I post these kind of questions to the dev list instead?

Thanks!

Re: Creating a new scoring filter.

Posted by Nicolás Lichtmaier <ni...@reloco.com.ar>.

> I didn't understand the point of creating abstract base classes for
> plugins. I am not strictly opposing it or anything, I just don't see
> why it would make things simpler/more flexible. AFAICS, there is not
> much an abstract base class can do but to pass the arguments of
> assignScores to calculateScore/distributeScoreToOutlinks. I mean, here
> is how I envision a ContentBasedScoringFilter class(or a
> DistributingScoringFilter):
>
> abstract class ContentBasedScoringFilter implements ScoringFilter {
>   assignScores(args) { return calculateScore(args);  }
>   protected abstract calculateScore(args);
> }
>
> Or do you have something else in mind?

Yes, something like that. But I also thought that if you don't want to 
repeat the logic of traversing through links (with all the logic which 
is now in ParseOutputFormat), that logic could be in an abstract class 
which would just traverse them and call an abstract function for each one.

Re: Creating a new scoring filter.

Posted by Doğacan Güney <do...@gmail.com>.

Hi

On 2/27/07, Nicolás Lichtmaier <ni...@reloco.com.ar> wrote:
[snip]
>
> I think that good API design here means not assuming so many things
> about the plugin behaviour. You are right about this
> "distributeScoreToOutlinks()", but IMO it should be called something
> like assignScores(). Then you could add an abstract class
> DistributingScorePlugin (implementing the interface) which overrides
> assignScores() and calls an "abstract protected" method called
> distributeScoreToOutlink().". So the code for traversing the outlinks
> would be in DistributingScorePlugin.
>
> I would need another class, called ContentBasedScorePlugin. That class
> could call an abstract protected method called calculateScore() which
> would receive the parsed data and return the score.
>
> What do you think?
>
>

I didn't understand the point of creating abstract base classes for
plugins. I am not strictly opposing it or anything, I just don't see
why it would make things simpler/more flexible. AFAICS, there is not
much an abstract base class can do but to pass the arguments of
assignScores to calculateScore/distributeScoreToOutlinks. I mean, here
is how I envision a ContentBasedScoringFilter class(or a
DistributingScoringFilter):

abstract class ContentBasedScoringFilter implements ScoringFilter {
   assignScores(args) { return calculateScore(args);  }
   protected abstract calculateScore(args);
}

Or do you have something else in mind?

-- 
Doğacan Güney

Re: Creating a new scoring filter.

Posted by Nicolás Lichtmaier <ni...@reloco.com.ar>.

>> It doesn't seem a good way to do it. What if there are no outlinks? This
>> method won't be called at all. And anyway, it would be called once per
>> each outlink, which would multiplicate the work.
>
> Multiplication is easy to solve but you are right that it won't work
> if there are no outlinks.
>
> Maybe scoring filter api should change? A distributeScoreToOutlinks
> method may be more useful than the current one: (which will be called
> even if there are no outlinks)
>
> CrawlDatum distributeScoreToOutlinks(Text fromUrl, List<String>
> toUrlList,   List<CrawlDatum> datumList, ParseData parseData,
> CrawlDatum adjust)
>
> This method gives more control to the plugin since knowing all the
> outlinks the plugin can make more informed decisions. Like, right now,
> there is no way a scoring filter can be sure that it has distributed
> all its cash (e.g if db.score.internal.link is 0.5 and
> db.score.external.link is 1.0, filter will almost always distribute
> less than its cash).
>
> This will also work for your case, since you will just ignore the
> outlinks and return the adjust datum based on information in parse
> metadata.
>
> What do you (and others) think?

I think that good API design here means not assuming so many things 
about the plugin behaviour. You are right about this 
"distributeScoreToOutlinks()", but IMO it should be called something 
like assignScores(). Then you could add an abstract class 
DistributingScorePlugin (implementing the interface) which overrides 
assignScores() and calls an "abstract protected" method called 
distributeScoreToOutlink().". So the code for traversing the outlinks 
would be in DistributingScorePlugin.

I would need another class, called ContentBasedScorePlugin. That class 
could call an abstract protected method called calculateScore() which 
would receive the parsed data and return the score.

What do you think?

Re: Creating a new scoring filter.

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On 2/27/07, Nicolás Lichtmaier <ni...@reloco.com.ar> wrote:

[snip]

>
> It doesn't seem a good way to do it. What if there are no outlinks? This
> method won't be called at all. And anyway, it would be called once per
> each outlink, which would multiplicate the work.

Multiplication is easy to solve but you are right that it won't work
if there are no outlinks.

Maybe scoring filter api should change? A distributeScoreToOutlinks
method may be more useful than the current one: (which will be called
even if there are no outlinks)

CrawlDatum distributeScoreToOutlinks(Text fromUrl, List<String>
toUrlList,   List<CrawlDatum> datumList, ParseData parseData,
CrawlDatum adjust)

This method gives more control to the plugin since knowing all the
outlinks the plugin can make more informed decisions. Like, right now,
there is no way a scoring filter can be sure that it has distributed
all its cash (e.g if db.score.internal.link is 0.5 and
db.score.external.link is 1.0, filter will almost always distribute
less than its cash).

This will also work for your case, since you will just ignore the
outlinks and return the adjust datum based on information in parse
metadata.

What do you (and others) think?

>
> Thanks!
>
>

-- 
Doğacan Güney

Re: Creating a new scoring filter.

Posted by Nicolás Lichtmaier <ni...@reloco.com.ar>.

>> Yeah, but there I don't have the parse data for those new pages. What I
>> would like to do is override "passScoreAfterParsing()" and not pass
>> anything: just analyze the parsed data and decide a score. The problem
>> is that that function doesn't get passed the CrawlDatum... it seems I'll
>> need to modify Nutch itself.... =(
> Can you be a bit more specific about your problem?

I'm indexing a fixed set of URLs that I think are a specific type of 
document. I don't care about links (I'm using -noAdditions to prevent 
adding links to crawldb, I've backported that to 0.8.x and it's waiting 
for somebody to commit it =) 
https://issues.apache.org/jira/browse/NUTCH-438 ).

I just want to replace the scoring algorithm with one which test if that 
URL really is that specific type of document. I want to use the parse 
data of a document to calculate its relevance.

> Anyway, without the details, here is my guess on how you can do it:
> 1) In passScoreAfterParsing(), analyze the content and parse text and
> put the relevant score information in parse data's metadata.
> 2) In distributeScoreToOutlink() ignore the outlinks (just give them
> initialScore()),
> but check your parse data and return an adjust datum with the status
> STATUS_LINKED and score extracted from parse data. This adjust datum
> will update the score of the original datum in updatedb.
>
> Does this work for you?

It doesn't seem a good way to do it. What if there are no outlinks? This 
method won't be called at all. And anyway, it would be called once per 
each outlink, which would multiplicate the work.

Thanks!

Re: Re: Creating a new scoring filter.

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On 2/24/07, Nicolás Lichtmaier <ni...@reloco.com.ar> wrote:
>
> >> Hi, I'm working in a fixed set of URLs and I'd like to replace the
> >> standard OPIC score plugin with something different. I'd like to
> >> create a scoring plugin which entirely bases its score on the
> >> document parsed data (yes, I will trust the document text itself to
> >> decide its relevance).
> >>
> >> I've been reading the code and the ScoringFilter interface seems to
> >> be targeted for use by OPIC like algorithms. For example, the step
> >> called after parsing is called "passScoreAfterParsing()", telling me
> >> what am I supposed to do in that method, and the method setting the
> >> scores is called "distributeScoreToOutlink()". All of this scares
> >> me... would it be safe to use these methods differently and, e.g.,
> >> modify the socument score in "passScoreAfterParsing()" instead of
> >> just "passing it"?
> >
> > You can modify whichever way you want - it's up to you. These methods
> > simply ensure that the score data (not just the CrawlDatum.getScore(),
> > but possibly a multitude of metadata collected on the way) is passed
> > to appropriate segment parts.
> >
> > E.g. in distributeScoreToOutlink() you could simply set the default
> > score for new pages to a fixed value, without actually using the score
> > information from the source page.
> >
>
> Yeah, but there I don't have the parse data for those new pages. What I
> would like to do is override "passScoreAfterParsing()" and not pass
> anything: just analyze the parsed data and decide a score. The problem
> is that that function doesn't get passed the CrawlDatum... it seems I'll
> need to modify Nutch itself.... =(

Can you be a bit more specific about your problem?

Anyway, without the details, here is my guess on how you can do it:
1) In passScoreAfterParsing(), analyze the content and parse text and
put the relevant score information in parse data's metadata.
2) In distributeScoreToOutlink() ignore the outlinks (just give them
initialScore()),
but check your parse data and return an adjust datum with the status
STATUS_LINKED and score extracted from parse data. This adjust datum
will update the score of the original datum in updatedb.

Does this work for you?

>
> Thanks!
>
>


-- 
Doğacan Güney

Re: Re: Creating a new scoring filter.

Posted by Nicolás Lichtmaier <ni...@reloco.com.ar>.

>> Hi, I'm working in a fixed set of URLs and I'd like to replace the 
>> standard OPIC score plugin with something different. I'd like to 
>> create a scoring plugin which entirely bases its score on the 
>> document parsed data (yes, I will trust the document text itself to 
>> decide its relevance).
>>
>> I've been reading the code and the ScoringFilter interface seems to 
>> be targeted for use by OPIC like algorithms. For example, the step 
>> called after parsing is called "passScoreAfterParsing()", telling me 
>> what am I supposed to do in that method, and the method setting the 
>> scores is called "distributeScoreToOutlink()". All of this scares 
>> me... would it be safe to use these methods differently and, e.g., 
>> modify the socument score in "passScoreAfterParsing()" instead of 
>> just "passing it"?
>
> You can modify whichever way you want - it's up to you. These methods 
> simply ensure that the score data (not just the CrawlDatum.getScore(), 
> but possibly a multitude of metadata collected on the way) is passed 
> to appropriate segment parts.
>
> E.g. in distributeScoreToOutlink() you could simply set the default 
> score for new pages to a fixed value, without actually using the score 
> information from the source page.
>

Yeah, but there I don't have the parse data for those new pages. What I 
would like to do is override "passScoreAfterParsing()" and not pass 
anything: just analyze the parsed data and decide a score. The problem 
is that that function doesn't get passed the CrawlDatum... it seems I'll 
need to modify Nutch itself.... =(

Thanks!

Re: Creating a new scoring filter.

Posted by Andrzej Bialecki <ab...@getopt.org>.

(moved from nutch-user)

Nicolás Lichtmaier wrote:

>
> Should I post these kind of questions to the dev list instead?

Yes :)

> Hi, I'm working in a fixed set of URLs and I'd like to replace the 
> standard OPIC score plugin with something different. I'd like to 
> create a scoring plugin which entirely bases its score on the document 
> parsed data (yes, I will trust the document text itself to decide its 
> relevance).
>
> I've been reading the code and the ScoringFilter interface seems to be 
> targeted for use by OPIC like algorithms. For example, the step called 
> after parsing is called "passScoreAfterParsing()", telling me what am 
> I supposed to do in that method, and the method setting the scores is 
> called "distributeScoreToOutlink()". All of this scares me... would it 
> be safe to use these methods differently and, e.g., modify the 
> socument score in "passScoreAfterParsing()" instead of just "passing it"?

You can modify whichever way you want - it's up to you. These methods 
simply ensure that the score data (not just the CrawlDatum.getScore(), 
but possibly a multitude of metadata collected on the way) is passed to 
appropriate segment parts.

E.g. in distributeScoreToOutlink() you could simply set the default 
score for new pages to a fixed value, without actually using the score 
information from the source page.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com