You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2015/10/26 19:00:30 UTC

[jira] [Commented] (NUTCH-2147) MetadataScoringFilter for Nutch

    [ https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974675#comment-14974675 ] 

Markus Jelsma commented on NUTCH-2147:
--------------------------------------

boolean CrawlDatum.evaluate(Expression expr) is what you need. Make sure you avoid creating new instances of Expression. In evaluate(), CrawlDatum already exposes most metadata fields (Long and BooleanWritable are missing so far) but not the status and other class attributes.

> MetadataScoringFilter for Nutch
> -------------------------------
>
>                 Key: NUTCH-2147
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2147
>             Project: Nutch
>          Issue Type: New Feature
>          Components: plugin, scoring
>    Affects Versions: 1.10
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>
> This issue originally started by envisioning an implementation of a LanguagePreferenceScoringFilter so that Nutch could easily be made into a directed crawler based on crawl administrator ranking preferences of languages we wish to crawl. 
> Right now this is not possible.
> We already detect and index language within the language-identifier plugin as well as within parse-tika irrc, however currently the presence of a language does not effect scoring of pages.
> The scope of this issue has changed to make it more generally applicable for a wider variety of use cases. This will therefore take advantage of NUTCH-1980 by pulling (amongst other things) Language entries from the CrawlDB Metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)