You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Mike Reynols <au...@hotmail.com> on 2005/11/08 06:45:22 UTC
Request for info regarding filesystem based index.
Here's the problem:
I need to get the Nutch engine running on a collection of xml documents that
I have (containing news stories). The files are named in the following
manner:
example.xml.52908
example.xml.52909
example.xml.52910
example.xml.52911
...
example.xml.53365
example.xml.53366
Each xml file contains no html, just xml nodes (tags) and text. I have these
files (500 to start off with) all listed in my 'urls' file. I have followed
these steps
(http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6),
but to no solution. I'm wondering if I'm missing something.
When I run the crawl after these three modifications, I get the following
error:
[root@abc nutch-0.7]# bin/nutch crawl urls -dir crawl.test -depth 3
051107 234038 parsing file:/root/Downloads/nutch-0.7/conf/nutch-default.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/crawl-tool.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/nutch-site.xml
051107 234039 No FS indicated, using default:local
051107 234039 crawl started in: crawl.test
051107 234039 rootUrlFile = urls
051107 234039 threads = 10
051107 234039 depth = 3
051107 234039 Created webdb at
LocalFS,/root/Downloads/nutch-0.7/crawl.test/db
051107 234039 Starting URL processing
051107 234039 Plugins: looking in: /root/Downloads/nutch-0.7/plugins
051107 234039 not including:
/root/Downloads/nutch-0.7/plugins/clustering-carrot2
051107 234039 not including:
/root/Downloads/nutch-0.7/plugins/creativecommons
051107 234039 parsing:
/root/Downloads/nutch-0.7/plugins/index-basic/plugin.xml
051107 234039 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/index-more
051107 234039 not including:
/root/Downloads/nutch-0.7/plugins/language-identifier
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/ontology
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/parse-ext
051107 234039 parsing:
/root/Downloads/nutch-0.7/plugins/parse-html/plugin.xml
051107 234040 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-js
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-msword
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-pdf
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-rss
051107 234040 parsing:
/root/Downloads/nutch-0.7/plugins/parse-text/plugin.xml
051107 234040 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
051107 234040 parsing:
/root/Downloads/nutch-0.7/plugins/protocol-file/plugin.xml
051107 234040 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.file.File
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/protocol-ftp
051107 234040 parsing:
/root/Downloads/nutch-0.7/plugins/protocol-http/plugin.xml
051107 234040 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
051107 234040 not including:
/root/Downloads/nutch-0.7/plugins/protocol-httpclient
051107 234040 parsing:
/root/Downloads/nutch-0.7/plugins/query-basic/plugin.xml
051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/query-more
051107 234040 parsing:
/root/Downloads/nutch-0.7/plugins/query-site/plugin.xml
051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
051107 234040 parsing:
/root/Downloads/nutch-0.7/plugins/query-url/plugin.xml
051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
051107 234040 not including:
/root/Downloads/nutch-0.7/plugins/urlfilter-prefix
051107 234040 not including:
/root/Downloads/nutch-0.7/plugins/urlfilter-regex
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
at
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter not
found.
at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
... 4 more
[root@abc nutch-0.7]#
Now when I remove the property that was recommended in the last step of the
above outlined process, I get the following reoccuring errors, but the crawl
finishes (Unlike the above run, which caused the crawl to abort
prematurely):
051107 214422 fetching file:///root/Downloads/topix/example.xml.53324
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53324 failed
with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=file
051107 214422 fetching file:///root/Downloads/topix/example.xml.53077
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53077 failed
with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=file
051107 214422 fetching file:///root/Downloads/topix/example.xml.53376
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53376 failed
with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=file
So on and so forth, for 500 files. Now, the crawl actually finishes, but
nothing as you can see was ever indexed or processed (call it what you
will).
Now, I have looked through the documentation a thousand times and this is
holding me up now. If anyone here has had a similar problem or has a
solution, please enlighten me. Thanks a ton guys :)
Tyler
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/