You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jeroen van Vianen (JIRA)" <ji...@apache.org> on 2010/06/23 15:20:50 UTC
[jira] Updated: (NUTCH-831) Allow configuration of how fields
crawled by Nutch are stored / indexed / tokenized
[ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeroen van Vianen updated NUTCH-831:
------------------------------------
Attachment: LuceneWriter.patch
Here's the patch to LuceneWriter
> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
> Key: NUTCH-831
> URL: https://issues.apache.org/jira/browse/NUTCH-831
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Jeroen van Vianen
> Priority: Minor
> Fix For: 1.1
>
> Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
> <name>lucene.field.store.content</name>
> <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.