You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by cameron tran <ca...@gmail.com> on 2012/05/18 06:58:39 UTC

ERROR solr.SolrIndexer - java.io.IOException: Job failed!

Hello

I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to
http://127.0.0.1:8983/solr/ but is getting the following error. Using Solr
3.6.0.. Please error in bold below.

Is there some incompatability issue?

Ran
bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -threads 3 -depth 3
topN 300

Thank you for your help

org.apache.solr.common.SolrException: ERROR: [doc=http://www.website.com/]
unknown field 'site'

*ERROR: [doc=http://www.website.com/] unknown field 'site'*

request: http://127.0.0.1:8983/solr/update?wt=javabin&version=2
    at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
    at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
    at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
    at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:93)
    at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
    at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
*2012-05-18 14:21:46,921 ERROR solr.SolrIndexer - java.io.IOException: Job
failed!*
2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: starting at 2012-05-18 14:21:46
2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr
2012-05-18 14:21:48,640 INFO  solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: finished at 2012-05-18 14:21:48, elapsed: 00:00:01
2012-05-18 14:21:48,640 INFO  crawl.Crawl - crawl finished:
crawl-20120518141951

Re: ERROR solr.SolrIndexer - java.io.IOException: Job failed!

Posted by cameron tran <ca...@gmail.com>.

Hello Jim and Tolga

Thanks for this... copied nutch's schema.xml to solr and it works.

When runing
bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -threads 3 -depth 5
topN 1000

Only seems to index 8 docs because in solr's admin did a query string
search for *:*

returns only 8 docs in the results.

Have tried stopping and starting solr and running nutch again (using
different depth and topN parameters) and the result is always the same..
Have tried to add more seeds to the urls\seeds.txt list with separate urls
on a new line but same.

what commands in nutch can I use to get it to crawl the site again and add
to solr's index..

Tried bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -threads 3
-depth 5 topN 1000 solrindex

But this gives error

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: file:c:/nutch14/runtime/local/solrindex

Thank you


On Fri, May 18, 2012 at 9:20 PM, Jim Chandler <ja...@gmail.com>wrote:

> You need to add the site field in your schema.xml - in your solr.
>
> Jim
>
> On Fri, May 18, 2012 at 12:58 AM, cameron tran <cameront168@gmail.com
> >wrote:
>
> > Hello
> >
> > I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to
> > http://127.0.0.1:8983/solr/ but is getting the following error. Using
> Solr
> > 3.6.0.. Please error in bold below.
> >
> > Is there some incompatability issue?
> >
> > Ran
> > bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -threads 3 -depth
> 3
> > topN 300
> >
> > Thank you for your help
> >
> > org.apache.solr.common.SolrException: ERROR: [doc=
> http://www.website.com/]
> > unknown field 'site'
> >
> > *ERROR: [doc=http://www.website.com/] unknown field 'site'*
> >
> > request: http://127.0.0.1:8983/solr/update?wt=javabin&version=2
> >    at
> >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
> >    at
> >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> >    at
> >
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> >    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
> >    at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:93)
> >    at
> >
> >
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
> >    at
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
> >    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> >    at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> > *2012-05-18 14:21:46,921 ERROR solr.SolrIndexer - java.io.IOException:
> Job
> > failed!*
> > 2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
> > SolrDeleteDuplicates: starting at 2012-05-18 14:21:46
> > 2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
> > SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr
> > 2012-05-18 14:21:48,640 INFO  solr.SolrDeleteDuplicates -
> > SolrDeleteDuplicates: finished at 2012-05-18 14:21:48, elapsed: 00:00:01
> > 2012-05-18 14:21:48,640 INFO  crawl.Crawl - crawl finished:
> > crawl-20120518141951
> >
>

Re: ERROR solr.SolrIndexer - java.io.IOException: Job failed!

Posted by Jim Chandler <ja...@gmail.com>.

You need to add the site field in your schema.xml - in your solr.

Jim

On Fri, May 18, 2012 at 12:58 AM, cameron tran <ca...@gmail.com>wrote:

> Hello
>
> I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to
> http://127.0.0.1:8983/solr/ but is getting the following error. Using Solr
> 3.6.0.. Please error in bold below.
>
> Is there some incompatability issue?
>
> Ran
> bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -threads 3 -depth 3
> topN 300
>
> Thank you for your help
>
> org.apache.solr.common.SolrException: ERROR: [doc=http://www.website.com/]
> unknown field 'site'
>
> *ERROR: [doc=http://www.website.com/] unknown field 'site'*
>
> request: http://127.0.0.1:8983/solr/update?wt=javabin&version=2
>    at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
>    at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>    at
>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
>    at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:93)
>    at
>
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
>    at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
>    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> *2012-05-18 14:21:46,921 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!*
> 2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: starting at 2012-05-18 14:21:46
> 2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr
> 2012-05-18 14:21:48,640 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: finished at 2012-05-18 14:21:48, elapsed: 00:00:01
> 2012-05-18 14:21:48,640 INFO  crawl.Crawl - crawl finished:
> crawl-20120518141951
>

Re: ERROR solr.SolrIndexer - java.io.IOException: Job failed!

Posted by Tolga <to...@ozses.net>.

Hi Cameron,

I've been dealing with the same issue, and taking care of it by adding 
the field, in your case 'site', to solr schema.xml, and restarting solr.

On 5/18/12 7:58 AM, cameron tran wrote:
> Hello
>
> I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to
> http://127.0.0.1:8983/solr/ but is getting the following error. Using Solr
> 3.6.0.. Please error in bold below.
>
> Is there some incompatability issue?
>
> Ran
> bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -threads 3 -depth 3
> topN 300
>
> Thank you for your help
>
> org.apache.solr.common.SolrException: ERROR: [doc=http://www.website.com/]
> unknown field 'site'
>
> *ERROR: [doc=http://www.website.com/] unknown field 'site'*
>
> request: http://127.0.0.1:8983/solr/update?wt=javabin&version=2
>      at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
>      at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>      at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
>      at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:93)
>      at
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
>      at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>      at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> *2012-05-18 14:21:46,921 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!*
> 2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: starting at 2012-05-18 14:21:46
> 2012-05-18 14:21:46,921 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr
> 2012-05-18 14:21:48,640 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: finished at 2012-05-18 14:21:48, elapsed: 00:00:01
> 2012-05-18 14:21:48,640 INFO  crawl.Crawl - crawl finished:
> crawl-20120518141951
>