You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ernesto De Santis (JIRA)" <ji...@apache.org> on 2006/10/14 22:49:35 UTC

[jira] Created: (NUTCH-386) Plugin to index categories by url rules

Plugin to index categories by url rules
---------------------------------------

                 Key: NUTCH-386
                 URL: http://issues.apache.org/jira/browse/NUTCH-386
             Project: Nutch
          Issue Type: New Feature
          Components: indexer, searcher
            Reporter: Ernesto De Santis
            Priority: Minor


The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "Beaucarnea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666122#action_12666122 ] 

Beaucarnea commented on NUTCH-386:
----------------------------------

Did you activate the plugin not only on crawler side but also on searcher side? I mean, did you include the plugin in the nutch-site.xml of your Nutch-webapplication in Tomcat?

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "abdessalem dridi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636578#action_12636578 ] 

abdessalem dridi commented on NUTCH-386:
----------------------------------------

how can i use this plugin ?

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "martin lopez (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710044#action_12710044 ] 

martin lopez commented on NUTCH-386:
------------------------------------

Newbie Question: Why do update nutch-conf and not nutch-site?
Do you habe a large list with sensefully rules.properties?

Kind Regards

Martin Lopez

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "Stefano Tauriello (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666576#action_12666576 ] 

Stefano Tauriello commented on NUTCH-386:
-----------------------------------------

Someone can help me?
It's very urgent, please.

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "Stefano Tauriello (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666129#action_12666129 ] 

Stefano Tauriello commented on NUTCH-386:
-----------------------------------------

I've only modified default-nutch.xml files in Crawler side and Searcher side in this way:

 <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|index-url-category</value>

I've not modified nutch-site.xml.

What do you suggest?

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-386) Plugin to index categories by url rules

Posted by "martin lopez (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710044#action_12710044 ] 

martin lopez edited comment on NUTCH-386 at 5/15/09 6:07 PM:
-------------------------------------------------------------

Newbie Question: Why do update nutch-default and not nutch-site?
Do you habe a large list with sensefully rules.properties?

Kind Regards

Martin Lopez

      was (Author: mlop49):
    Newbie Question: Why do update nutch-conf and not nutch-site?
Do you habe a large list with sensefully rules.properties?

Kind Regards

Martin Lopez
  
> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "Andrey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490189 ] 

Andrey commented on NUTCH-386:
------------------------------

hi, i'm try to add you plugin to nutch 9.0 and when i exec
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

2007-04-20 02:32:24,850 INFO  indexer.IndexingFilters - Adding org.b2b.nutch.indexer.UrlCategoryIndexFilter
2007-04-20 02:32:24,857 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2007-04-20 02:32:25,296 WARN  mapred.LocalJobRunner - job_o8ys6
java.lang.AbstractMethodError: org.b2b.nutch.indexer.UrlCategoryIndexFilter.filter(Lorg/apache/lucene/document/Document;Lorg/apache/nutch/parse/Parse;Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;Lorg/apache/nutch/crawl/Inlinks;)Lorg/apache/lucene/document/Document;
	at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:110)
	at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:215)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
2007-04-20 02:32:26,109 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
	at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
	at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
	at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
	at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)




> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-386) Plugin to index categories by url rules

Posted by "Ernesto De Santis (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-386?page=all ]

Ernesto De Santis updated NUTCH-386:
------------------------------------

    Attachment: index-url-category-0.1.zip

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: http://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "Ernesto De Santis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490385 ] 

Ernesto De Santis commented on NUTCH-386:
-----------------------------------------

Hi Andrey

I don´t know what could cause that error. 
And I never use this plulin with nutch 9.0. 
I´m not currently working with nutch, sorry.

Ernesto.

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "Agnieszka Zbrzezny (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696051#action_12696051 ] 

Agnieszka Zbrzezny commented on NUTCH-386:
------------------------------------------

hello,
i'm trying to add your plugin to nutch 1.0. After  bin/nutch crawl urls/ -dir crawl -depth 3 in hadoop.log is: 

2009-04-06 09:09:54,128 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-04-06 09:09:54,145 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2009-04-06 09:09:54,147 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-04-06 09:09:54,167 INFO  indexer.IndexingFilters - Adding org.b2b.nutch.indexer.UrlCategoryIndexFilter
2009-04-06 09:09:54,168 WARN  mapred.LocalJobRunner - job_local_0016
java.lang.AbstractMethodError
        at org.apache.nutch.indexer.IndexingFilters.<init>(IndexingFilters.java:73)
        at org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:61)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

I use org.apache.hadoop.io.UTF8 and I include the plugin in the nutch-site.xml.

Thanks in advance,
Agnieszka

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-386) Plugin to index categories by url rules

Posted by "Beaucarnea (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Beaucarnea updated NUTCH-386:
-----------------------------

    Attachment: index-url-category.jar

This plugin uses the deprecated org.apache.hadoop.io.UTF8 which caused an IOException.
I replaced it with org.apache.hadoop.io.Text and now the plugin works fine again.
The jar file contains the update.

> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

Posted by "Stefano Tauriello (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666116#action_12666116 ] 

Stefano Tauriello commented on NUTCH-386:
-----------------------------------------

I've overwritten your .jar and all is fine, but when i search with nutch a word matching a category, nutch finds nothing.
During the index process i see that urls are correctly indexed with category. Where is the problem?


> Plugin to index categories by url rules
> ---------------------------------------
>
>                 Key: NUTCH-386
>                 URL: https://issues.apache.org/jira/browse/NUTCH-386
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>            Reporter: Ernesto De Santis
>            Priority: Minor
>         Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.