You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by jim shirreffs <jp...@verizon.net> on 2007/04/05 18:51:11 UTC

Run Job Crashing

Nutch-0.8.1
Windows 2000/Windows XP
Java 1.6
cygwin1.dll  nov/2004 and gygwin1 latest release


Very strange, ran the crawler once

S bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and everything worked until this error


Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070404094549
Indexer: adding segment: crawl/segments/20070404095026
Indexer: adding segment: crawl/segments/20070404095504
Optimizing index.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)


Tried running the crawler again

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and now I consistantly get this error

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
run java in NUTCH_JAVA_HOME D:\java\jdk1.6
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

I have one file localhost in my url dir and it looks like this

http://localhost

My  crawl-urlfiler.xml looks like this

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|swf|sw):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/

# skip everything else

My nutch-site.xml looks like this

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>RadioCity</value>
  <description></description>
</property>

<property>
  <name>http.agent.description</name>
  <value>nutch web crawler</value>
  <description></description>
</property>

<property>
  <name>http.agent.url</name>
  <value>www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch</value>
  <description></description>
</property>

<property>
  <name>http.agent.email</name>
  <value>jpsb at flash.net</value>
  <description></description>
</property>
</configuration>


I am getting the same behavor on two separate hosts.  If anyone can suggest 
what I might be doing wrong I would greatly appreicate it.

jim s

PS tried to mail from a different host but did not see message in mailing 
list.  Hope only this messages gets into mailing list. 


Re: Run Job Crashing

Posted by jim shirreffs <jp...@verizon.net>.
Figured this one out, just in case some other newbe has the same problem.

Windows places hidden files in the urls dir if one customizes the folder 
view. These files must be removed or Nutch thinks they url files and 
processes them. One the hidden files are removed all is well.

jim s



anyone else has
----- Original Message ----- 
From: "jim shirreffs" <jp...@verizon.net>
To: "nutch lucene apache" <nu...@lucene.apache.org>
Sent: Thursday, April 05, 2007 11:51 AM
Subject: Run Job Crashing


> Nutch-0.8.1
> Windows 2000/Windows XP
> Java 1.6
> cygwin1.dll  nov/2004 and gygwin1 latest release
>
>
> Very strange, ran the crawler once
>
> S bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> and everything worked until this error
>
>
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20070404094549
> Indexer: adding segment: crawl/segments/20070404095026
> Indexer: adding segment: crawl/segments/20070404095504
> Optimizing index.
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>        at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
>
>
> Tried running the crawler again
>
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> and now I consistantly get this error
>
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> run java in NUTCH_JAVA_HOME D:\java\jdk1.6
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
> I have one file localhost in my url dir and it looks like this
>
> http://localhost
>
> My  crawl-urlfiler.xml looks like this
>
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto|swf|sw):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break 
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*localhost/
>
> # skip everything else
>
> My nutch-site.xml looks like this
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>RadioCity</value>
>  <description></description>
> </property>
>
> <property>
>  <name>http.agent.description</name>
>  <value>nutch web crawler</value>
>  <description></description>
> </property>
>
> <property>
>  <name>http.agent.url</name>
>  <value>www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch</value>
>  <description></description>
> </property>
>
> <property>
>  <name>http.agent.email</name>
>  <value>jpsb at flash.net</value>
>  <description></description>
> </property>
> </configuration>
>
>
> I am getting the same behavor on two separate hosts.  If anyone can 
> suggest what I might be doing wrong I would greatly appreicate it.
>
> jim s
>
> PS tried to mail from a different host but did not see message in mailing 
> list.  Hope only this messages gets into mailing list.