You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by blackwater dev <bl...@gmail.com> on 2008/01/29 15:19:31 UTC
nutch won't crawl on windows
I have nutch 0.8.1 loaded on my XP machine.
I created a directory named urls and in there a file named yooroc which
contains the line:
http://www.yooroc.com
I then edited crawl-urlfilter.txt and added this line:
s+^http://([a-z0-9]*\.)*yooroc.com/
Then in nutch-site.xml I have this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Put site-specific property overrides in this file. -->
<property>
<name>http.agent.name</name>
<value>yooroc</value>
<description></description>
</property>
<property>
<name>http.agent.description</name>
<value>yooroc crawling</value>
<description></description>
</property>
<property>
<name>http.agent.url</name>
<value>http://www.yooroc.com</value>
<description></description>
</property>
<property>
<name>http.agent.email</name>
<value>hello@yooroc.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>
>From cygwin, I then run:
bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log
In crawl.log I get:
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
My Java_Home is set to:
c:\Program Files\Java\jre_1.6.0_02
Any ideas what I am doing wrong??
Thanks!
Re: nutch won't crawl on windows
Posted by blackwater dev <bl...@gmail.com>.
Any thoughts on this?
I get the same error with nutch 9.
Thanks.
On Jan 29, 2008 9:19 AM, blackwater dev <bl...@gmail.com> wrote:
> I have nutch 0.8.1 loaded on my XP machine.
>
> I created a directory named urls and in there a file named yooroc which
> contains the line:
>
> http://www.yooroc.com
>
>
> I then edited crawl-urlfilter.txt and added this line:
>
> s+^http://([a-z0-9]*\.)*yooroc.com/
>
>
> Then in nutch-site.xml I have this:
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
> <!-- Put site-specific property overrides in this file. -->
> <property>
> <name>http.agent.name</name>
> <value>yooroc</value>
> <description></description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value>yooroc crawling</value>
> <description></description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value>http://www.yooroc.com</value>
> <description></description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value>hello@yooroc.com</value>
> <description>An email address to advertise in the HTTP 'From' request
> header and User-Agent header. A good practice is to mangle this
> address (e.g. 'info at example dot com') to avoid spamming.
> </description>
> </property>
> </configuration>
>
>
> From cygwin, I then run:
>
> bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log
>
> In crawl.log I get:
>
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
>
> My Java_Home is set to:
> c:\Program Files\Java\jre_1.6.0_02
>
>
> Any ideas what I am doing wrong??
>
> Thanks!
>
>
>