You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/03/28 22:21:23 UTC

[jira] Created: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Scoring API: extension point, scoring filters and an OPIC plugin
----------------------------------------------------------------

         Key: NUTCH-240
         URL: http://issues.apache.org/jira/browse/NUTCH-240
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Andrzej Bialecki 


This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.

Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.

Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-240?page=all ]

Andrzej Bialecki  updated NUTCH-240:
------------------------------------

    Attachment: patch1.txt

Updated patch, includes the Generator.patch.txt. Changes:

* reduce creationf of new Objects in CrawlDbReducer

* simplify API by removing the need to set/restore score value in Generator.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ] 

Doug Cutting commented on NUTCH-240:
------------------------------------

First, I hope my critical remarks were not taken personally.  I am thankful for this and all of your contributions.

> Initially, I did as you suggest, i.e. I created a method to calculate one float value for the purpose of selecting topN. However, I wanted to avoid changing CrawlDatum.compareTo - if we put ScoringFilters there, it would be a big performance hit. OTOH, if we overwrite the primitive float in CrawlDatum.score it seemed to me we should store its earlier value, and then possibl restore - as the value for selecting topN may have nothing to do with the "real" score. 

In Generate.java, can't we just change the key type in the first pass to be a FloatWritable holding the score, and the value to be <CrawlDatum,Url>?  Then we'd never alter the CrawlDatum and there'd be no need to restore it.

> passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks strange, but that's what we do at the moment, I just extracted it into an interface. I'd love to skip this altogether, if there is a way.

I think we should spend a little more time thinking about how to make this a nice API before we start having folks implement it.  Once an interface is added, it's much harder to change.  I don't have much time to spend on this today, but might next week.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] 

Doug Cutting commented on NUTCH-240:
------------------------------------

+1 for committing Generator.patch.txt now.

0 for committing the rest until I've had more time to think about it.  I'm not against it, but, at a glance, I'm still hopeful we can do better.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ] 

Doug Cutting commented on NUTCH-240:
------------------------------------

The generator store/restore score stuff seems ugly.  And it is not used by OPIC.  Could we insteadhave a method that computes and returns a score to be used by the generator?  Then it is up to the generator to use this w/o modifying the CrawlDatum.

The passScoreBeforeParsing/passScoreAfterParsing/distributeScoreToOutlink protocol also seems awkward, although I don't yet have a suggestion for how to improve it.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373264 ] 

Andrzej Bialecki  commented on NUTCH-240:
-----------------------------------------

Oops, sorry, that was a last moment change ... I fixed it now, thanks for spotting this.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] 

Doug Cutting commented on NUTCH-240:
------------------------------------

Also, note that we can now extend Hadoop's new MapReduceBase to implement configure() and close() for many Mappers and Reducers, including the one's in this patch.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Shawn Gervais (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373256 ] 

Shawn Gervais commented on NUTCH-240:
-------------------------------------

This change seems to have caused an error to be thrown:

060405 034711 Generator: Partitioning selected urls by host, for politeness.
Exception in thread "main" java.lang.RuntimeException: class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper
        at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:262)
        at org.apache.hadoop.mapred.JobConf.setMapperClass(JobConf.java:249)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:263)
        at org.apache.nutch.crawl.Generator.main(Generator.java:317)

Just FYI.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-240?page=all ]

Andrzej Bialecki  updated NUTCH-240:
------------------------------------

    Attachment: patch2.txt

Minor refactoring: passScore* methods now allow access to more data. I found this useful when implementing a different scoring plugin.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt, patch2.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Assigned: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-240?page=all ]

Andrzej Bialecki  reassigned NUTCH-240:
---------------------------------------

    Assign To: Andrzej Bialecki 

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372580 ] 

Andrzej Bialecki  commented on NUTCH-240:
-----------------------------------------

> First, I hope my critical remarks were not taken personally. I am thankful for this and all of your contributions. 

Not at all, we're not arguing but argumenting - we both want to find the best solution.

Re: generate. Yes, that's a nice way out, it would satisfy the requirement I described above, without this awkward step.

Re: passScore* : let me explain a bit the requirements that lead me to this. In some cases there will be multiple metadata (not just a single primitive value) that drive the score, i.e. the final "score" and its distribution may depend on many values in CrawlDatum metadata (e.g. URL classification, expert evaluation, users' feedback, white/black-lists, etc). The passScore* API allows you to copy this arbitrary metadata from CrawlDatum-s (coming from CrawlDb -> crawl_generate) down to the parsing process and the score distribution step to outlinks. The distributeScore API would pick up this (or these plural) values and based its score distribution decisions on them.

This API just mimicks what was already there (only now you can use arbitrary metadata for scoring), and now we can plainly see it's an ugly way to do this. :) But the proper solution should allow passing arbitrary metadata from CrawlDb to the page scoring steps after parsing, and to the outlink score distribution process.

Another issue: the reason for returning an "adjust" value from distributeScoreToOutlink is that in some algorithms (among others OPIC - but we don't implement this part now...) the fact that a certain score was distributed to an outlink should affect the score of the page that is the source of this link.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12377200 ] 

Andrzej Bialecki  commented on NUTCH-240:
-----------------------------------------

If there are no further suggestions or objections, I'd like to move forward on this patch. I know the passScore* methods are a bit awkward, but that's what we do anyway, we just do it under the carpet :)

If folks have ideas how to improve this part, please speak up.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt, patch2.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-240?page=all ]

Andrzej Bialecki  updated NUTCH-240:
------------------------------------

    Attachment: patch.txt

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-240?page=all ]
     
Andrzej Bialecki  closed NUTCH-240:
-----------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

Patches applied. Any further API improvements are welcome, the current API is less than ideal but allows experimenting with various scoring strategies, which is IMHO more important at this moment than API purity.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>      Fix For: 0.8-dev
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt, patch2.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372948 ] 

Jerome Charron commented on NUTCH-240:
--------------------------------------

+1

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372379 ] 

Andrzej Bialecki  commented on NUTCH-240:
-----------------------------------------

Yes, one of the reasons I wanted to discuss these patches is that they uncovered some of the underlying ugliness... ;)

The reson for generator store/restore is that scoring plugins could take into account many more variables than just the score recorded in CrawlDatum.score. They could also have different strategies for prioritizing pages to be included in topN.

So, it's true this is not currently used by OPIC but I think without this it's not possible for plugins to affect the choice of topN.

Initially, I did as you suggest, i.e. I created a method to calculate one float value for the purpose of selecting topN. However, I wanted to avoid changing CrawlDatum.compareTo - if we put ScoringFilters there, it would be a big performance hit. OTOH, if we overwrite the primitive float in CrawlDatum.score it seemed to me we should store its earlier value, and then possibl restore - as the value for selecting topN may have nothing to do with the "real" score.

passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks strange, but that's what we do at the moment, I just extracted it into an interface. I'd love to skip this altogether, if there is a way.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-240?page=all ]

Andrzej Bialecki  updated NUTCH-240:
------------------------------------

    Attachment: Generator.patch.txt

This patch is an intermediate step towards the simplification of the scoring API. It changes Generator to use an arbitrary FloatWritable for selecting topN records.

If there are not objections, I'd like to commit this patch first, and then refactor the scoring API to use this new Generator.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: Generator.patch.txt, patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira