You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Hasan Diwan <ha...@gmail.com> on 2006/02/13 22:25:55 UTC

extension point... does not exist

I placed the URLs for a crawl in urls per the tutorial [1]. Then:
% ./bin/nutch crawl urls -dir crawl.test -depth 2
... gives me the following log:
060213 131631 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060213 131631 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
060213 131631 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
060213 131631 No FS indicated, using default:local
060213 131631 crawl started in: crawl.test
060213 131631 rootUrlFile = urls
060213 131631 threads = 10
060213 131631 depth = 2
060213 131632 Created webdb at LocalFS,/home/hdiwan/nutch-0.7.1/crawl.test/db
060213 131632 Starting URL processing
060213 131632 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
060213 131632 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
060213 131632 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-http/plugin.xml
060213 131632 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060213 131632 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient
060213 131632 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
060213 131632 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js
060213 131632 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml
060213 131632 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext
060213 131632 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/index-basic/plugin.xml
060213 131632 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/index-more
060213 131632 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml
060213 131632 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060213 131632 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
060213 131632 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060213 131632 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex/plugin.xml
060213 131632 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060213 131632 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
060213 131632 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
060213 131632 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
060213 131632 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
060213 131632 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
060213 131632 SEVERE org.apache.nutch.plugin.PluginRuntimeException:
extension point: org.apache.nutch.protocol.Protocol does not exist.
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
        at org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
        at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException: extension point:
org.apache.nutch.protocol.Protocol does not exist.
        at org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:147)
        at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
        ... 4 more
Caused by: org.apache.nutch.plugin.PluginRuntimeException: extension
point: org.apache.nutch.protocol.Protocol does not exist.
        at org.apache.nutch.plugin.PluginRepository.installExtensions(PluginRepository.java:78)
        at org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:61)
        at org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:144)
        ... 5 more
... org/apache/nutch/protocol/Protocol.java does exist, as does
org/apache/nutch/protocol/Protocol.class, jar tvf nutch-0.7.1.jar
holds the class file. I could do further investigation, but would like
some pointers as to where I should be looking first. Thanks!
--
Cheers,
Hasan Diwan <ha...@gmail.com>
1. http://lucene.apache.org/nutch/tutorial.html

Max pages in crawl cycle

Posted by Bostjan <bg...@siol.net>.

Hi,

I'm using nutch 0.7.

Is it possible to crawl only certain number of pages in single crawl cycle 
(depth)?  I looked at FetchList Tool class and I think it would be nice that 
emitFetchList method had a piece of code in its main loop that woud look 
something like this

    if (count > MAX_PAGES_IN_CRAWL_CYCLE) {
        break;
    }

Thanks,
Bostjan