You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roeland Weve <ro...@weve.nl> on 2006/02/25 21:56:56 UTC
nutch 0.7.1 > where is the tutorial? crawldb not found?
Hi,
I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried to
follow the tutorial at:
http://lucene.apache.org/nutch/tutorial.html
But this tutorial seems to be written for another version of Nutch.
Because, first of all the DmozParser is not available (I could'nt find
it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or somewhere
else):
java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser
java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser
Since I'm not really interested in Dmoz data, I continue with injecting
URLs of my own (in the dmoz dir, the file is called 'urls', with on
each line an url) in the database. Unfortunately, I got stuck again. I
tried to execute:
bin/nutch inject crawl/crawldb dmoz
The error is:
> 060225 212634 parsing
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml
> 060225 212635 parsing
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml
> Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir>
(-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset
<subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc]
[-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]
So I tried to adjust the parameters, with something like:
> bin/nutch inject crawl/crawldb -urlfile dmoz/urls
But this leads to an exception:
Exception in thread "main" java.io.FileNotFoundException:
crawl\crawldb\webdb\pagesByURL\data
There are some files in the crawldb dir, but not the webdb dir. Is there
a possibility to create an empty or default database? Or do I need Nutch
0.8? If yes, where can I download it?
Hopefully, this can this be done with Nutch 0.7.1, because I'm not a
hero with compiling stuff on Cygwin
The only thing I want is to inject URLs that can be found in a plain
text file, with on each row a URL. The next step is the crawl those
URLs. The URLs are all different, so I am not interested in the intranet
option of Nitch.
Hopefully someone can help me out with this problem.
Roeland
Re: nutch 0.7.1 > where is the tutorial? crawldb not found?
Posted by "Håvard W. Kongsgård" <h....@niap.no>.
http://wiki.media-style.com/display/nutchDocu/Home
Roeland Weve wrote:
> Hi,
>
> I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried
> to follow the tutorial at:
> http://lucene.apache.org/nutch/tutorial.html
> But this tutorial seems to be written for another version of Nutch.
> Because, first of all the DmozParser is not available (I could'nt find
> it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or
> somewhere else):
> java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser
> java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser
> Since I'm not really interested in Dmoz data, I continue with
> injecting URLs of my own (in the dmoz dir, the file is called 'urls',
> with on each line an url) in the database. Unfortunately, I got stuck
> again. I tried to execute:
> bin/nutch inject crawl/crawldb dmoz
> The error is:
> > 060225 212634 parsing
> file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml
> > 060225 212635 parsing
> file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml
> > Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir>
> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset
> <subsetDenominator>] [-includeAdultMaterial] [-skew skew]
> [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic
> <topic> [...]]]
>
> So I tried to adjust the parameters, with something like:
> > bin/nutch inject crawl/crawldb -urlfile dmoz/urls
> But this leads to an exception:
> Exception in thread "main" java.io.FileNotFoundException:
> crawl\crawldb\webdb\pagesByURL\data
>
> There are some files in the crawldb dir, but not the webdb dir. Is
> there a possibility to create an empty or default database? Or do I
> need Nutch 0.8? If yes, where can I download it?
> Hopefully, this can this be done with Nutch 0.7.1, because I'm not a
> hero with compiling stuff on Cygwin
>
> The only thing I want is to inject URLs that can be found in a plain
> text file, with on each row a URL. The next step is the crawl those
> URLs. The URLs are all different, so I am not interested in the
> intranet option of Nitch.
>
> Hopefully someone can help me out with this problem.
>
> Roeland
>
>