You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jan Philippe Wimmer <in...@jepse.net> on 2012/09/06 15:10:02 UTC

SolrDeleteDuplicates: java.io.IOException: Job failed!

Hi there,

I was looking for a solution in different mailing lists for a while now. 
Nutch simply doesn't index this specific page. A crawl without the solr 
parameter works fine. Just by indexing to solr (and this is a nutch 
component) it throws the following error (see usecase). There might be 
something wrong with the SolrDeleteDuplicates routine. Because of this 
exceptions nutch doesn't index any of the crawled pages (all over crawl 
time >30 min.).

Here's the usecase:

When i try to crawl "http://www.stimme.de/sport/", the crawl process 
works fine. When it comes to the SolrDeleteDuplicates part it throws the 
following error. And stops indexing at all. No entries are in the solr 
index (but there is are plenty of crawled pages)
ERROR:
SolrIndexer: starting at 2012-09-06 09:40:37
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2012-09-06 09:41:20
SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:193)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:63)

Since i read this thread about the similar problem 
http://lucene.472066.n3.nabble.com/SolrIndex-java-io-IOException-Job-failed-td3585509.html 
i rebuild with solr-solj-3.5.jar .... still the same error.

Any ideas how to fix that?

I'm using solr-3.4 as Solr-Server and nutch 1.5.1

Since this is a regular page to crawl i thought this is suppose to be a 
bug. So i posted it to the bugtracking 
(https://issues.apache.org/jira/browse/NUTCH-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449561#comment-13449611). 
they told me that this isn't a bug. But what's else?

Any suggestions?

Thanks Philippe