You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jan Philippe Wimmer <in...@jepse.net> on 2012/09/06 15:10:02 UTC

SolrDeleteDuplicates: java.io.IOException: Job failed!

Hi there,

I was looking for a solution in different mailing lists for a while now.
Nutch simply doesn't index this specific page. A crawl without the solr
parameter works fine. Just by indexing to solr (and this is a nutch
component) it throws the following error (see usecase). There might be
something wrong with the SolrDeleteDuplicates routine. Because of this
exceptions nutch doesn't index any of the crawled pages (all over crawl
time >30 min.).

Here's the usecase:

When i try to crawl "http://www.stimme.de/sport/", the crawl process
works fine. When it comes to the SolrDeleteDuplicates part it throws the
following error. And stops indexing at all. No entries are in the solr
index (but there is are plenty of crawled pages)
ERROR:
SolrIndexer: starting at 2012-09-06 09:40:37
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2012-09-06 09:41:20
SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:193)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:63)

Since i read this thread about the similar problem
http://lucene.472066.n3.nabble.com/SolrIndex-java-io-IOException-Job-failed-td3585509.html
i rebuild with solr-solj-3.5.jar .... still the same error.

Any ideas how to fix that?

I'm using solr-3.4 as Solr-Server and nutch 1.5.1

Since this is a regular page to crawl i thought this is suppose to be a
bug. So i posted it to the bugtracking
(https://issues.apache.org/jira/browse/NUTCH-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449561#comment-13449611).
they told me that this isn't a bug. But what's else?

Any suggestions?

Thanks Philippe