You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2012/09/20 16:19:39 UTC
problem with big crawl process
Hi all.
I have a problem when i try to do a big crawl process, specifically when the topN paremeter is bigger than 1000.
Im using nutch 1.4 and solr 3.4 in a pc with this features
Intel CoreI3,Ram 2GB, HD 160 GB.
the problem is an exception(java.io.IOException: Job failed!
) in the moment to add documents in solr index, but I dont know how to fix this, I have reduced the solr.commit.size from 1000 to 250, but the problems still happening, please any idea or recomendation or way to solve will be appreciated
this is a part of my hadoop.log file
2012-09-20 09:46:57,148 INFO solr.SolrMappingReader - source: language dest: language
2012-09-20 09:46:57,148 INFO solr.SolrMappingReader - source: url dest: url
2012-09-20 09:46:57,704 INFO solr.SolrWriter - Adding 250 documents
2012-09-20 09:46:59,974 INFO solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:01,578 INFO solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:02,137 INFO solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:02,816 INFO solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:03,272 WARN mapred.LocalJobRunner - job_local_0030
org.apache.solr.common.SolrException: Petición incorrecta
Petición incorrecta
request: http://localhost:8080/solr/update?wt=javabin&version=2
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:81)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:166)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2012-09-20 09:47:03,447 ERROR solr.SolrIndexer - java.io.IOException: Job failed!
2012-09-20 09:47:03,448 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2012-09-20 09:47:03
2012-09-20 09:47:03,448 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8080/solr
2012-09-20 09:47:05,039 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: finished at 2012-09-20 09:47:05, elapsed: 00:00:01
2012-09-20 09:47:05,040 INFO crawl.Crawl - crawl finished: crawl
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
RE: problem with big crawl process
Posted by Markus Jelsma <ma...@openindex.io>.
Hi
Petición incorrecta is a HTTP 400 BAD REQUEST? Check your Solr log, there must be something there.
Cheers
-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Thu 20-Sep-2012 16:23
> To: user@nutch.apache.org
> Subject: problem with big crawl process
>
> Hi all.
> I have a problem when i try to do a big crawl process, specifically when the topN paremeter is bigger than 1000.
> Im using nutch 1.4 and solr 3.4 in a pc with this features
> Intel CoreI3,Ram 2GB, HD 160 GB.
>
> the problem is an exception(java.io.IOException: Job failed!
> ) in the moment to add documents in solr index, but I dont know how to fix this, I have reduced the solr.commit.size from 1000 to 250, but the problems still happening, please any idea or recomendation or way to solve will be appreciated
>
>
> this is a part of my hadoop.log file
>
> 2012-09-20 09:46:57,148 INFO solr.SolrMappingReader - source: language dest: language
> 2012-09-20 09:46:57,148 INFO solr.SolrMappingReader - source: url dest: url
> 2012-09-20 09:46:57,704 INFO solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:46:59,974 INFO solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:01,578 INFO solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:02,137 INFO solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:02,816 INFO solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:03,272 WARN mapred.LocalJobRunner - job_local_0030
> org.apache.solr.common.SolrException: Petición incorrecta
>
> Petición incorrecta
>
> request: http://localhost:8080/solr/update?wt=javabin&version=2
> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
> at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:81)
> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
> at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:166)
> at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2012-09-20 09:47:03,447 ERROR solr.SolrIndexer - java.io.IOException: Job failed!
> 2012-09-20 09:47:03,448 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2012-09-20 09:47:03
> 2012-09-20 09:47:03,448 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8080/solr
> 2012-09-20 09:47:05,039 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: finished at 2012-09-20 09:47:05, elapsed: 00:00:01
> 2012-09-20 09:47:05,040 INFO crawl.Crawl - crawl finished: crawl
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>