You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fred Tyre <fr...@hlipublishing.com> on 2006/08/08 18:26:48 UTC

Possible bug in nutch crawl

I was getting the following error at the command line...

java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
Exception in thread "main"

I looked in the hadoop.log and found...

java.lang.RuntimeException: Invalid first character:
	at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase
.java:144)
	at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:153)
	at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:52)
	at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:56)
	at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
	at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
	at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Caused by: java.io.IOException: Invalid first character:
	at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRulesFile(RegexURLFilt
erBase.java:186)
	at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase
.java:140)
	... 8 more

I checked out my crawl-urlfilter.txt file and found that I had lines with
only spaces on them.

Once I removed the extraneous spaces, I did not receive the error anymore.

Sincerely,
Fred

><><><><><><><><><><><><><><><><><><
   Fred Tyre
   Information Services
   Heartland Communications, Inc.
   515-574-2147
   Fred.Tyre@hlipublishing.com
><><><><><><><><><><><><><><><><><><