You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Philip Brown <ph...@primeradesigns.com> on 2006/08/24 14:45:46 UTC

nutch - start me up - help please

I am having a little trouble gettng Nutch running and would appreciate 
any help:
I am using nutch 0.8

I have altered my conf/crawl-urlfilter.txt to read my local server

# accept hosts in MY.DOMAIN.NAME
+^http://philfedora5:8080/

When it came to the urls file, I was a little uncertain what to create:
When creating only a "urls" file I received message it was invalid. I 
managed to get nutch running by creating
"urls" folder with "nutch" file in containg root URL from which to 
populate the initial fetchlist.

contents of "nutch" file.
http://philfedora5:8080/tinysite/A.html

I then run:
bin/nutch crawl urls -dir crawl-tinysite -depth 3  -topN 50

A crawl-tinysite folder is created.

I then run:
bin/nutch readdb crawl-tinysite/crawldb/ -stats
a bit of churning and turning then nothing. returned to prompt.

I then run:
bin/nutch readdb crawl-tinysite/crawldb/ -dump dumpfile

inside the dumpfile folder I find part-000000, the contents are:

http://philfedora5:8080/tinysite/A.html    Version: 4
Status: 1 (DB_unfetched)
Fetch time: Thu Aug 24 14:36:06 CEST 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

There should also be records for B.html, C.html, C-duplicate.html?
Also, this looks suspicious? > Status: 1 (DB_unfetched)

I have been using these tutorials:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
http://lucene.apache.org/nutch/tutorial8.html
Any help would be appreciated.

Re: nutch - start me up - help please

Posted by Philip Brown <ph...@primeradesigns.com>.
How I Learned to Stop Worrying and Love the crawl-urlfilter.txt.

For those who  happen upon this  compendious thread, I traced my problem 
to DNS.

philfedora5 was hardcoded to 127.0.0.1 in the etc/hosts file. I added a 
host name of nutch and hardcoded it to my network card.

in nutch-default.xml I also made sure I had  a value in the 
"http.agent.name" property.
<name>http.agent.name</name>
  <value>tomcatNutch</value>