You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ernesto De Santis (JIRA)" <ji...@apache.org> on 2006/10/14 22:49:35 UTC
[jira] Created: (NUTCH-386) Plugin to index categories by url rules
Plugin to index categories by url rules
---------------------------------------
Key: NUTCH-386
URL: http://issues.apache.org/jira/browse/NUTCH-386
Project: Nutch
Issue Type: New Feature
Components: indexer, searcher
Reporter: Ernesto De Santis
Priority: Minor
The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "Beaucarnea (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666122#action_12666122 ]
Beaucarnea commented on NUTCH-386:
----------------------------------
Did you activate the plugin not only on crawler side but also on searcher side? I mean, did you include the plugin in the nutch-site.xml of your Nutch-webapplication in Tomcat?
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "abdessalem dridi (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636578#action_12636578 ]
abdessalem dridi commented on NUTCH-386:
----------------------------------------
how can i use this plugin ?
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "martin lopez (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710044#action_12710044 ]
martin lopez commented on NUTCH-386:
------------------------------------
Newbie Question: Why do update nutch-conf and not nutch-site?
Do you habe a large list with sensefully rules.properties?
Kind Regards
Martin Lopez
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "Stefano Tauriello (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666576#action_12666576 ]
Stefano Tauriello commented on NUTCH-386:
-----------------------------------------
Someone can help me?
It's very urgent, please.
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "Stefano Tauriello (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666129#action_12666129 ]
Stefano Tauriello commented on NUTCH-386:
-----------------------------------------
I've only modified default-nutch.xml files in Crawler side and Searcher side in this way:
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|index-url-category</value>
I've not modified nutch-site.xml.
What do you suggest?
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-386) Plugin to index categories
by url rules
Posted by "martin lopez (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710044#action_12710044 ]
martin lopez edited comment on NUTCH-386 at 5/15/09 6:07 PM:
-------------------------------------------------------------
Newbie Question: Why do update nutch-default and not nutch-site?
Do you habe a large list with sensefully rules.properties?
Kind Regards
Martin Lopez
was (Author: mlop49):
Newbie Question: Why do update nutch-conf and not nutch-site?
Do you habe a large list with sensefully rules.properties?
Kind Regards
Martin Lopez
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "Andrey (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490189 ]
Andrey commented on NUTCH-386:
------------------------------
hi, i'm try to add you plugin to nutch 9.0 and when i exec
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
2007-04-20 02:32:24,850 INFO indexer.IndexingFilters - Adding org.b2b.nutch.indexer.UrlCategoryIndexFilter
2007-04-20 02:32:24,857 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2007-04-20 02:32:25,296 WARN mapred.LocalJobRunner - job_o8ys6
java.lang.AbstractMethodError: org.b2b.nutch.indexer.UrlCategoryIndexFilter.filter(Lorg/apache/lucene/document/Document;Lorg/apache/nutch/parse/Parse;Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;Lorg/apache/nutch/crawl/Inlinks;)Lorg/apache/lucene/document/Document;
at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:110)
at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:215)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
2007-04-20 02:32:26,109 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-386) Plugin to index categories by url rules
Posted by "Ernesto De Santis (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-386?page=all ]
Ernesto De Santis updated NUTCH-386:
------------------------------------
Attachment: index-url-category-0.1.zip
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: http://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "Ernesto De Santis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490385 ]
Ernesto De Santis commented on NUTCH-386:
-----------------------------------------
Hi Andrey
I don´t know what could cause that error.
And I never use this plulin with nutch 9.0.
I´m not currently working with nutch, sorry.
Ernesto.
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "Agnieszka Zbrzezny (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696051#action_12696051 ]
Agnieszka Zbrzezny commented on NUTCH-386:
------------------------------------------
hello,
i'm trying to add your plugin to nutch 1.0. After bin/nutch crawl urls/ -dir crawl -depth 3 in hadoop.log is:
2009-04-06 09:09:54,128 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-04-06 09:09:54,145 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2009-04-06 09:09:54,147 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-04-06 09:09:54,167 INFO indexer.IndexingFilters - Adding org.b2b.nutch.indexer.UrlCategoryIndexFilter
2009-04-06 09:09:54,168 WARN mapred.LocalJobRunner - job_local_0016
java.lang.AbstractMethodError
at org.apache.nutch.indexer.IndexingFilters.<init>(IndexingFilters.java:73)
at org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:61)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
I use org.apache.hadoop.io.UTF8 and I include the plugin in the nutch-site.xml.
Thanks in advance,
Agnieszka
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-386) Plugin to index categories by url rules
Posted by "Beaucarnea (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Beaucarnea updated NUTCH-386:
-----------------------------
Attachment: index-url-category.jar
This plugin uses the deprecated org.apache.hadoop.io.UTF8 which caused an IOException.
I replaced it with org.apache.hadoop.io.Text and now the plugin works fine again.
The jar file contains the update.
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url
rules
Posted by "Stefano Tauriello (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666116#action_12666116 ]
Stefano Tauriello commented on NUTCH-386:
-----------------------------------------
I've overwritten your .jar and all is fine, but when i search with nutch a word matching a category, nutch finds nothing.
During the index process i see that urls are correctly indexed with category. Where is the problem?
> Plugin to index categories by url rules
> ---------------------------------------
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Reporter: Ernesto De Santis
> Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.