You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sourajit Basak <so...@gmail.com> on 2013/07/10 09:16:54 UTC

Re: A bug in the crawl secript in Nutch 1.6

The dedup stage fails with the following error.

SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/collection5
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:390)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:395)

On Sat, Jun 22, 2013 at 8:03 AM, Tejas Patil <te...@gmail.com>wrote:

> Thanks Joe for pointing it out. There was a jira [0] for this bug and the
> change is already present in the trunk.
>
> [0] : https://issues.apache.org/jira/browse/NUTCH-1500
>
>
> On Fri, Jun 21, 2013 at 7:11 PM, Joe Zhang <sm...@gmail.com> wrote:
>
> > The new crawl script is quite useful. Thanks for the addition.
> >
> > It comes with a bug, though:
> >
> >
> > Line 169:
> >  $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
> > $CRAWL_PATH/linkdb $SEGMENT
> >
> > should be:
> >
> >  $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
> > $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
> >
> > instead.
> >
>