You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2015/07/01 09:38:04 UTC

[jira] [Updated] (NUTCH-1980) Jexl expressions for CrawlDbReader

     [ https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1980:
---------------------------------
    Description: 
Jexl expression support for the CrawlDbReader. This allows you to read items from the database based on their metadata with flexilibity and boolean logic. Some examples

* Get all english pages
-expr "lang=en"

* Get all english pages that have a low response time
-expr "lang=en && _rs_ > 5000"

  was:
We are already using Jexl expressions to filter records from HostDb dumps and it is really helpful when your CrawlDb is stuffed with metadata generated by parser filters, in our case mostly scores generated by classification plugins that run on text or structure.

In the case of the HostDb, it operates on hosts only, so it is easy to collect a set of sites that host mostly a specific language, pornographic content, or just host topics that your classifiers are trained for.

By adding this magic to the CrawlDbReader, you can get lists of actual records that contain the stuff you are looking for.

Most work is already in the HostDb patch so it is easy to translate to individual records. Patch tomorrow, probably...


> Jexl expressions for CrawlDbReader
> ----------------------------------
>
>                 Key: NUTCH-1980
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1980
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-1980-1.9.patch, NUTCH-1980-1.9.patch, NUTCH-1980-1.9.patch, NUTCH-1980.patch
>
>
> Jexl expression support for the CrawlDbReader. This allows you to read items from the database based on their metadata with flexilibity and boolean logic. Some examples
> * Get all english pages
> -expr "lang=en"
> * Get all english pages that have a low response time
> -expr "lang=en && _rs_ > 5000"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)