You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roeland Weve <ro...@weve.nl> on 2006/02/25 21:56:56 UTC

nutch 0.7.1 > where is the tutorial? crawldb not found?

Hi,

I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried to 
follow the tutorial at:
http://lucene.apache.org/nutch/tutorial.html
But this tutorial seems to be written for another version of Nutch. 
Because, first of all the DmozParser is not available (I could'nt find 
it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or somewhere 
else):
java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser
java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser
Since I'm not really interested in Dmoz data, I continue with injecting 
URLs  of my own (in the dmoz dir, the file is called 'urls', with on 
each line an url) in the database. Unfortunately, I got stuck again. I 
tried to execute:
bin/nutch inject crawl/crawldb dmoz
The error is:
 > 060225 212634 parsing 
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml
 > 060225 212635 parsing 
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml
 > Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir> 
(-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset 
<subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] 
[-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]

So I tried to adjust the parameters, with something like:
 > bin/nutch inject crawl/crawldb -urlfile dmoz/urls
But this leads to an exception:
Exception in thread "main" java.io.FileNotFoundException: 
crawl\crawldb\webdb\pagesByURL\data

There are some files in the crawldb dir, but not the webdb dir. Is there 
a possibility to create an empty or default database? Or do I need Nutch 
0.8? If yes, where can I download it?
Hopefully, this can this be done with Nutch 0.7.1, because I'm not a 
hero with compiling stuff on Cygwin

The only thing I want is to inject URLs that can be found in a plain 
text file, with on each row a URL. The next step is the crawl those 
URLs. The URLs are all different, so I am not interested in the intranet 
option of Nitch.

Hopefully someone can help me out with this problem.

Roeland


Re: nutch 0.7.1 > where is the tutorial? crawldb not found?

Posted by "Håvard W. Kongsgård" <h....@niap.no>.
http://wiki.media-style.com/display/nutchDocu/Home


Roeland Weve wrote:

> Hi,
>
> I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried 
> to follow the tutorial at:
> http://lucene.apache.org/nutch/tutorial.html
> But this tutorial seems to be written for another version of Nutch. 
> Because, first of all the DmozParser is not available (I could'nt find 
> it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or 
> somewhere else):
> java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser
> java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser
> Since I'm not really interested in Dmoz data, I continue with 
> injecting URLs  of my own (in the dmoz dir, the file is called 'urls', 
> with on each line an url) in the database. Unfortunately, I got stuck 
> again. I tried to execute:
> bin/nutch inject crawl/crawldb dmoz
> The error is:
> > 060225 212634 parsing 
> file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml
> > 060225 212635 parsing 
> file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml
> > Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir> 
> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset 
> <subsetDenominator>] [-includeAdultMaterial] [-skew skew] 
> [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic 
> <topic> [...]]]
>
> So I tried to adjust the parameters, with something like:
> > bin/nutch inject crawl/crawldb -urlfile dmoz/urls
> But this leads to an exception:
> Exception in thread "main" java.io.FileNotFoundException: 
> crawl\crawldb\webdb\pagesByURL\data
>
> There are some files in the crawldb dir, but not the webdb dir. Is 
> there a possibility to create an empty or default database? Or do I 
> need Nutch 0.8? If yes, where can I download it?
> Hopefully, this can this be done with Nutch 0.7.1, because I'm not a 
> hero with compiling stuff on Cygwin
>
> The only thing I want is to inject URLs that can be found in a plain 
> text file, with on each row a URL. The next step is the crawl those 
> URLs. The URLs are all different, so I am not interested in the 
> intranet option of Nitch.
>
> Hopefully someone can help me out with this problem.
>
> Roeland
>
>