You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Sami Siren <ss...@gmail.com> on 2007/10/18 18:21:32 UTC

Re: Scoring API issues (LONG)

Andrzej Bialecki wrote:
> Hi all,
> 
> I've been working recently on a custom scoring plugin, and I found out
> some issues with the scoring API that severely limit the way we can
> calculate static page scores. I'd like to restart the discussion about
> this API, and propose some changes. Any comments or suggestions are
> welcome!

Hi,

In practice I have found out that sometimes it's just easier (and even
more efficient) to write a custom mr job (yes, an additional phase into
the process) to calculate the scores for urls.

By using this strategy it would give users more freedom in selecting the
data (and algorithm) required and same time keep the other parts of the
process more slim.

-- 
 Sami Siren

Re: Scoring API issues (LONG)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
> Andrzej Bialecki wrote:
>> Hi all,
>>
>> I've been working recently on a custom scoring plugin, and I found out
>> some issues with the scoring API that severely limit the way we can
>> calculate static page scores. I'd like to restart the discussion about
>> this API, and propose some changes. Any comments or suggestions are
>> welcome!
> 
> Hi,
> 
> In practice I have found out that sometimes it's just easier (and even
> more efficient) to write a custom mr job (yes, an additional phase into
> the process) to calculate the scores for urls.

Same here. E.g. PageRank calc. requires running a separate job. Other 
scoring techniques that use a post-processed linkgraph also require 
running a separate MR job.

> By using this strategy it would give users more freedom in selecting the
> data (and algorithm) required and same time keep the other parts of the
> process more slim.

Right .. except the main (supposed) benefit of OPIC was that it would be 
possible to avoid running an additional analysis step - the scores were 
supposed to be re-calculated online as a part of other steps. It's not 
worked out this way, as we know, but this was the main motivation for 
introducing the scoring API ... although it seems more and more that 
this API is just a glorified OPIC, and it's not sufficiently re-usable 
to benefit other scoring algorithms ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com