You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2012/09/20 16:19:39 UTC

problem with big crawl process

Hi all.
I have a problem when i try to do a big crawl process, specifically when the topN paremeter is bigger than 1000.
Im using nutch 1.4 and solr 3.4 in a pc with this features
Intel CoreI3,Ram 2GB, HD 160 GB.

the problem is an exception(java.io.IOException: Job failed!
) in the moment to add documents in solr index, but I dont know how to fix this, I have reduced the solr.commit.size from 1000 to 250, but the problems still happening, please any idea or recomendation or way to solve will be appreciated


this is a part of my hadoop.log file

2012-09-20 09:46:57,148 INFO  solr.SolrMappingReader - source: language dest: language
2012-09-20 09:46:57,148 INFO  solr.SolrMappingReader - source: url dest: url
2012-09-20 09:46:57,704 INFO  solr.SolrWriter - Adding 250 documents
2012-09-20 09:46:59,974 INFO  solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:01,578 INFO  solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:02,137 INFO  solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:02,816 INFO  solr.SolrWriter - Adding 250 documents
2012-09-20 09:47:03,272 WARN  mapred.LocalJobRunner - job_local_0030
org.apache.solr.common.SolrException: Petición incorrecta

Petición incorrecta

request: http://localhost:8080/solr/update?wt=javabin&version=2
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
	at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
	at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
	at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:81)
	at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
	at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
	at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
	at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:166)
	at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2012-09-20 09:47:03,447 ERROR solr.SolrIndexer - java.io.IOException: Job failed!
2012-09-20 09:47:03,448 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2012-09-20 09:47:03
2012-09-20 09:47:03,448 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8080/solr
2012-09-20 09:47:05,039 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: finished at 2012-09-20 09:47:05, elapsed: 00:00:01
2012-09-20 09:47:05,040 INFO  crawl.Crawl - crawl finished: crawl

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: problem with big crawl process

Posted by Markus Jelsma <ma...@openindex.io>.
Hi

Petición incorrecta is a HTTP 400 BAD REQUEST? Check your Solr log, there must be something there.

Cheers

 
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Thu 20-Sep-2012 16:23
> To: user@nutch.apache.org
> Subject: problem with big crawl process
> 
> Hi all.
> I have a problem when i try to do a big crawl process, specifically when the topN paremeter is bigger than 1000.
> Im using nutch 1.4 and solr 3.4 in a pc with this features
> Intel CoreI3,Ram 2GB, HD 160 GB.
> 
> the problem is an exception(java.io.IOException: Job failed!
> ) in the moment to add documents in solr index, but I dont know how to fix this, I have reduced the solr.commit.size from 1000 to 250, but the problems still happening, please any idea or recomendation or way to solve will be appreciated
> 
> 
> this is a part of my hadoop.log file
> 
> 2012-09-20 09:46:57,148 INFO  solr.SolrMappingReader - source: language dest: language
> 2012-09-20 09:46:57,148 INFO  solr.SolrMappingReader - source: url dest: url
> 2012-09-20 09:46:57,704 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:46:59,974 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:01,578 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:02,137 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:02,816 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:03,272 WARN  mapred.LocalJobRunner - job_local_0030
> org.apache.solr.common.SolrException: Petición incorrecta
> 
> Petición incorrecta
> 
> request: http://localhost:8080/solr/update?wt=javabin&version=2
> 	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
> 	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> 	at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> 	at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
> 	at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:81)
> 	at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
> 	at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
> 	at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
> 	at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:166)
> 	at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
> 	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2012-09-20 09:47:03,447 ERROR solr.SolrIndexer - java.io.IOException: Job failed!
> 2012-09-20 09:47:03,448 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2012-09-20 09:47:03
> 2012-09-20 09:47:03,448 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8080/solr
> 2012-09-20 09:47:05,039 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: finished at 2012-09-20 09:47:05, elapsed: 00:00:01
> 2012-09-20 09:47:05,040 INFO  crawl.Crawl - crawl finished: crawl
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>