You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by blackwater dev <bl...@gmail.com> on 2008/01/29 15:19:31 UTC

nutch won't crawl on windows

I have nutch 0.8.1 loaded on my XP machine.

I created a directory named urls and in there a file named yooroc which
contains the line:

http://www.yooroc.com


I then edited crawl-urlfilter.txt and added this line:

s+^http://([a-z0-9]*\.)*yooroc.com/


Then in nutch-site.xml I have this:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Put site-specific property overrides in this file. -->
<property>
  <name>http.agent.name</name>
  <value>yooroc</value>
  <description></description>
</property>

<property>
  <name>http.agent.description</name>
  <value>yooroc crawling</value>
  <description></description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://www.yooroc.com</value>
  <description></description>
</property>

<property>
  <name>http.agent.email</name>
  <value>hello@yooroc.com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>


>From cygwin, I then run:

bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log

In crawl.log I get:

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


My Java_Home is set to:
c:\Program Files\Java\jre_1.6.0_02


Any ideas what I am doing wrong??

Thanks!

Re: nutch won't crawl on windows

Posted by blackwater dev <bl...@gmail.com>.

Any thoughts on this?

I get the same error with nutch 9.

Thanks.

On Jan 29, 2008 9:19 AM, blackwater dev <bl...@gmail.com> wrote:

> I have nutch 0.8.1 loaded on my XP machine.
>
> I created a directory named urls and in there a file named yooroc which
> contains the line:
>
> http://www.yooroc.com
>
>
> I then edited crawl-urlfilter.txt and added this line:
>
> s+^http://([a-z0-9]*\.)*yooroc.com/
>
>
> Then in nutch-site.xml I have this:
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
> <!-- Put site-specific property overrides in this file. -->
> <property>
>   <name>http.agent.name</name>
>   <value>yooroc</value>
>   <description></description>
> </property>
>
> <property>
>   <name>http.agent.description</name>
>   <value>yooroc crawling</value>
>   <description></description>
> </property>
>
> <property>
>   <name>http.agent.url</name>
>   <value>http://www.yooroc.com</value>
>   <description></description>
> </property>
>
> <property>
>   <name>http.agent.email</name>
>   <value>hello@yooroc.com</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
> </configuration>
>
>
> From cygwin, I then run:
>
> bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log
>
> In crawl.log I get:
>
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>     at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
>
> My Java_Home is set to:
> c:\Program Files\Java\jre_1.6.0_02
>
>
> Any ideas what I am doing wrong??
>
> Thanks!
>
>
>