You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (Updated) (JIRA)" <ji...@apache.org> on 2012/02/06 17:37:59 UTC

[jira] [Updated] (NUTCH-1264) Configurable indexing plugin (index-metadata)

     [ https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1264:
---------------------------------

    Description: 
We currently have several plugins already distributed or proposed which do very comparable things : 
- parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them
- headings [NUTCH-1005] to generate headings fields in parse-metadata and index them
- index-extra [NUTCH-422] to index configurable fields 
- urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them
- index-static [NUTCH-940] to generate configurable static fields 

All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : 
- static values
- parse metadata
- content metadata
- crawldb metadata

and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins.

This plugin will replace index-extra [NUTCH-422] and will serve as a basis for further improvements.




  was:
We currently have several plugins already distributed or proposed which do very comparable things : 
- parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them
- headings [NUTCH-1005] to generate headings fields in parse-metadata and index them
- index-extra [NUTCH-422] to index configurable fields 
- urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them
- index-static [NUTCH-940] to generate configurable static fields 

All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : 
- static values
- parse metadata
- content metadata
- crawldb metadata

and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins.

This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] and will serve as a basis for further improvements.




        Summary: Configurable indexing plugin (index-metadata)   (was: Configurable indexing plugin (index-extra) )
    
> Configurable indexing plugin (index-metadata) 
> ----------------------------------------------
>
>                 Key: NUTCH-1264
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1264
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.5
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch
>
>
> We currently have several plugins already distributed or proposed which do very comparable things : 
> - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them
> - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them
> - index-extra [NUTCH-422] to index configurable fields 
> - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them
> - index-static [NUTCH-940] to generate configurable static fields 
> All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : 
> - static values
> - parse metadata
> - content metadata
> - crawldb metadata
> and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins.
> This plugin will replace index-extra [NUTCH-422] and will serve as a basis for further improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira