You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jeroen van Vianen (JIRA)" <ji...@apache.org> on 2010/06/23 15:20:50 UTC

[jira] Updated: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

     [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeroen van Vianen updated NUTCH-831:
------------------------------------

    Attachment: LuceneWriter.patch

Here's the patch to LuceneWriter

> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.