You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Jeroen van Vianen (JIRA)" <ji...@apache.org> on 2010/06/23 15:17:52 UTC

[jira] Created: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
-----------------------------------------------------------------------------------

                 Key: NUTCH-831
                 URL: https://issues.apache.org/jira/browse/NUTCH-831
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
            Reporter: Jeroen van Vianen
            Priority: Minor
             Fix For: 1.1


Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.

I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.

See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.

There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:

<property>
  <name>lucene.field.store.content</name>
  <value>YES</value>
</property>

(content is the name of the field)

However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-831:
------------------------------------

    Fix Version/s: 1.2

- applied to 1.2 branch

> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-831.
-------------------------------------

         Assignee: Chris A. Mattmann
    Fix Version/s:     (was: 2.0)
       Resolution: Fixed

- fixed in r958828 and applied to branch-1.2. You can always use that version Jeroen until we get the 2.0 version stable in trunk.

Thanks for your contribution!

> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884300#action_12884300 ] 

Andrzej Bialecki  commented on NUTCH-831:
-----------------------------------------

In the future a maintenance patch like this could be applied to branch-1.2, especially since NUTCH-837 will remove this code completely from trunk.

> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884302#action_12884302 ] 

Chris A. Mattmann commented on NUTCH-831:
-----------------------------------------

Hey Andrzej,

Exactly, I applied the patch to branch-1.2 but not to the trunk. Looks like we're building up a few patches there. If we get a few more, I will gladly spin up a 1.2 release to push it out the door...

Cheers,
Chris


> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-831:
--------------------------------

    Fix Version/s: 1.2
                       (was: 1.1)

Moved to Fixed 1.2 
1.1 having been released it is not likely to contain this fix, as for 2.0 it will delegate the indexing to SOLR and won't contain any Lucene related code

> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883415#action_12883415 ] 

Chris A. Mattmann commented on NUTCH-831:
-----------------------------------------

I applied this patch to the Nutch 1.2 branch and all tests passed:

test:
     [echo] Testing plugin: urlnormalizer-regex
    [junit] Running org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.28 sec
    [junit] Running org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.209 sec

test:

BUILD SUCCESSFUL
Total time: 10 minutes 50 seconds
[chipotle:~/tmp/nutch-1.2] mattmann% 

I'll commit the patch there so you can have it in SVN and use it, but I'll set the fix version to nil since the movement is towards Solr in the trunk. Thanks for the contribution, regardless, Jeroen!

Cheers,
Chris


> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Jeroen van Vianen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeroen van Vianen updated NUTCH-831:
------------------------------------

    Attachment: LuceneWriter.patch

Here's the patch to LuceneWriter

> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-831:
------------------------------------

    Fix Version/s: 2.0
                       (was: 1.2)

> Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized.
> See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.