You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Mike Reynols <au...@hotmail.com> on 2005/11/08 06:45:22 UTC

Request for info regarding filesystem based index.

Here's the problem:

I need to get the Nutch engine running on a collection of xml documents that 
I have (containing news stories). The files are named in the following 
manner:

example.xml.52908
example.xml.52909
example.xml.52910
example.xml.52911
...
example.xml.53365
example.xml.53366

Each xml file contains no html, just xml nodes (tags) and text. I have these 
files (500 to start off with) all listed in my 'urls' file. I have followed 
these steps 
(http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6), 
but to no solution. I'm wondering if I'm missing something.

When I run the crawl after these three modifications, I get the following 
error:

[root@abc nutch-0.7]# bin/nutch crawl urls -dir crawl.test -depth 3
051107 234038 parsing file:/root/Downloads/nutch-0.7/conf/nutch-default.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/crawl-tool.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/nutch-site.xml
051107 234039 No FS indicated, using default:local
051107 234039 crawl started in: crawl.test
051107 234039 rootUrlFile = urls
051107 234039 threads = 10
051107 234039 depth = 3
051107 234039 Created webdb at 
LocalFS,/root/Downloads/nutch-0.7/crawl.test/db
051107 234039 Starting URL processing
051107 234039 Plugins: looking in: /root/Downloads/nutch-0.7/plugins
051107 234039 not including: 
/root/Downloads/nutch-0.7/plugins/clustering-carrot2
051107 234039 not including: 
/root/Downloads/nutch-0.7/plugins/creativecommons
051107 234039 parsing: 
/root/Downloads/nutch-0.7/plugins/index-basic/plugin.xml
051107 234039 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/index-more
051107 234039 not including: 
/root/Downloads/nutch-0.7/plugins/language-identifier
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/ontology
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/parse-ext
051107 234039 parsing: 
/root/Downloads/nutch-0.7/plugins/parse-html/plugin.xml
051107 234040 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-js
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-msword
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-pdf
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-rss
051107 234040 parsing: 
/root/Downloads/nutch-0.7/plugins/parse-text/plugin.xml
051107 234040 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
051107 234040 parsing: 
/root/Downloads/nutch-0.7/plugins/protocol-file/plugin.xml
051107 234040 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.file.File
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/protocol-ftp
051107 234040 parsing: 
/root/Downloads/nutch-0.7/plugins/protocol-http/plugin.xml
051107 234040 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
051107 234040 not including: 
/root/Downloads/nutch-0.7/plugins/protocol-httpclient
051107 234040 parsing: 
/root/Downloads/nutch-0.7/plugins/query-basic/plugin.xml
051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/query-more
051107 234040 parsing: 
/root/Downloads/nutch-0.7/plugins/query-site/plugin.xml
051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
051107 234040 parsing: 
/root/Downloads/nutch-0.7/plugins/query-url/plugin.xml
051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
051107 234040 not including: 
/root/Downloads/nutch-0.7/plugins/urlfilter-prefix
051107 234040 not including: 
/root/Downloads/nutch-0.7/plugins/urlfilter-regex
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
        at 
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
        at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter not 
found.
        at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
        ... 4 more
[root@abc nutch-0.7]#

Now when I remove the property that was recommended in the last step of the 
above outlined process, I get the following reoccuring errors, but the crawl 
finishes (Unlike the above run, which caused the crawl to abort 
prematurely):

051107 214422 fetching file:///root/Downloads/topix/example.xml.53324
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53324 failed 
with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for 
url=file
051107 214422 fetching file:///root/Downloads/topix/example.xml.53077
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53077 failed 
with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for 
url=file
051107 214422 fetching file:///root/Downloads/topix/example.xml.53376
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53376 failed 
with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for 
url=file

So on and so forth, for 500 files. Now, the crawl actually finishes, but 
nothing as you can see was ever indexed or processed (call it what you 
will).

Now, I have looked through the documentation a thousand times and this is 
holding me up now. If anyone here has had a similar problem or has a 
solution, please enlighten me. Thanks a ton guys :)

Tyler

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/