You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by brad <br...@bcs-mail.net> on 2010/09/24 04:59:03 UTC

Nutch 1.2 solrdedup and OutOfMemoryError

I 'm running into an error trying to run solrdedup
bin/nutch solrdedup http://127.0.0.1:8080/solr-nutch/

2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job: job_local_0001
2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 0% reduce 0%
2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
	at
org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:
323)
	at
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
	at
org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405)
	at
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
	at
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.j
ava:339)
	at
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
	at
org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:11
0)
	at
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
	at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101)
	at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(Binar
yResponseParser.java:39)
	at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS
olrServer.java:466)
	at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpS
olrServer.java:243)
	at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:
89)
	at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
	at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecord
Reader(SolrDeleteDuplicates.java:233)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
	at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


I'm running solr via tomcat.  Tomcat is being started with the memory
parameters of :
-Xms2048m -Xmx2048m

So basically there is 2 gb of memory allocated to stack space.  I have
noticed that by changing the parameters some, the location of the error can
change some, but the bottom line is I still run out of stack space.

Nutch runs for about 15 minute and then the error occurs.

I only have 1 solr index and data/index directory size is about 85gb
I'm using the deliver solrconfig.xml file

Is there something else I need to do?  Some change to the Solr or Tomcat
config I have missed.


Config:
Nutch Release 1.2 - 08/07/2010
CentOS Linux 5.5 
Linux 2.6.18-194.3.1.el5 on x86_64 
Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
8gb of ram


Thanks
Brad

RE: Nutch 1.2 solrdedup and OutOfMemoryError

Posted by brad <br...@bcs-mail.net>.

Here is some more information:

The Tomcat status before the process starts shows:
Tomcat Version: Apache Tomcat/6.0.26
JVM Version: 1.6.0-b09
JVM Vendor: Sun Microsystems Inc.

Free memory: 1033.68 MB Total memory: 1873.56 MB Max memory: 1873.56 MB

When it stops with the out of memory error, it still shows about 500mb of
memory free and it is working on the request:
GET
/solr-nutch/select?q=id%3A%5B*+TO+*%5D&fl=id%2Cboost%2Ctstamp%2Cdigest&start
=0&rows=9634413&wt=javabin&version=1 HTTP/1.1

RE: Nutch 1.2 solrdedup and OutOfMemoryError

Posted by Markus Jelsma <ma...@buyways.nl>.

Although it might be possible to list mutliple fields for a deduplication processor, i doubt the usefulness of it. If multiple fields are concatenated before hashing, you can only deduplicate documents that have identical bodies for all fields. I'd rather define a single processor for a single field and creating a Solr field for saving the digest. This way i can deduplicate documents that have similar bodies (beware of the bread crumbs in sites) OR exactly the same title for different URL's.
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Fri 24-09-2010 23:09
To: user@nutch.apache.org; 
Subject: RE: Nutch 1.2 solrdedup and OutOfMemoryError

Well, I think you can specify a list of fields in SolrConfig.xml  during
dedup configuration to control how Solr determines if two documents are
identical.  It should be pretty flexible.  Correct me of course if I
misunderstood your comment.

-----Original Message-----
From: brad [mailto:brad@bcs-mail.net] 
Sent: Friday, September 24, 2010 4:00 PM
To: user@nutch.apache.org
Subject: RE: Nutch 1.2 solrdedup and OutOfMemoryError

Thanks for the info.  I'll give the solr deduplication a try.  It looks
like
its not as thorough as the regular dedup process (URL, Content, highest
score, shortest URL), but I think it will work.

Brad 

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Friday, September 24, 2010 5:27 AM
To: user@nutch.apache.org
Subject: Re: Nutch 1.2 solrdedup and OutOfMemoryError

I'm not suprised that your memory is eaten when fetching almost 10
million
documents! It's a bit tough to read the deduplication code but it looks
like
it's hardcoded to fetch all records and split them between maps. If
you've
got one map, it'll fetch all records and so eating your memory.

I'm unsure how this can be fixed, but in the mean time you can solve it
by
implementing deduplication in your solrconfig.

On Friday 24 September 2010 04:59:03 brad wrote:
> I 'm running into an error trying to run solrdedup bin/nutch solrdedup

> http://127.0.0.1:8080/solr-nutch/
> 
> 2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job:
>  job_local_0001 2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 
> 0%  reduce 0% 2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner -
>  job_local_0001 java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec
> .java
> : 323)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
> at
>
org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405
)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
> at
>
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCod
ec.
> j ava:339)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
> at
> org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.j
> ava:1
> 1 0)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
> at
>
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101
)
> at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse
> (Bina
> r yResponseParser.java:39)
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:466)
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:243)
> at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest
> .java
> : 89)
> at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.get
> Recor d Reader(SolrDeleteDuplicates.java:233)
> at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:17
> 7)
> 
> 
> I'm running solr via tomcat.  Tomcat is being started with the memory 
> parameters of :
> -Xms2048m -Xmx2048m
> 
> So basically there is 2 gb of memory allocated to stack space.  I have

> noticed that by changing the parameters some, the location of the 
> error can change some, but the bottom line is I still run out of stack
space.
> 
> Nutch runs for about 15 minute and then the error occurs.
> 
> I only have 1 solr index and data/index directory size is about 85gb 
> I'm using the deliver solrconfig.xml file
> 
> Is there something else I need to do?  Some change to the Solr or 
> Tomcat config I have missed.
> 
> 
> Config:
> Nutch Release 1.2 - 08/07/2010
> CentOS Linux 5.5
> Linux 2.6.18-194.3.1.el5 on x86_64
> Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
> 8gb of ram
> 
> 
> Thanks
> Brad
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Nutch 1.2 solrdedup and OutOfMemoryError

Posted by "Nemani, Raj" <Ra...@turner.com>.

Well, I think you can specify a list of fields in SolrConfig.xml  during
dedup configuration to control how Solr determines if two documents are
identical.  It should be pretty flexible.  Correct me of course if I
misunderstood your comment.

-----Original Message-----
From: brad [mailto:brad@bcs-mail.net] 
Sent: Friday, September 24, 2010 4:00 PM
To: user@nutch.apache.org
Subject: RE: Nutch 1.2 solrdedup and OutOfMemoryError

Thanks for the info.  I'll give the solr deduplication a try.  It looks
like
its not as thorough as the regular dedup process (URL, Content, highest
score, shortest URL), but I think it will work.

Brad 

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Friday, September 24, 2010 5:27 AM
To: user@nutch.apache.org
Subject: Re: Nutch 1.2 solrdedup and OutOfMemoryError

I'm not suprised that your memory is eaten when fetching almost 10
million
documents! It's a bit tough to read the deduplication code but it looks
like
it's hardcoded to fetch all records and split them between maps. If
you've
got one map, it'll fetch all records and so eating your memory.

I'm unsure how this can be fixed, but in the mean time you can solve it
by
implementing deduplication in your solrconfig.

On Friday 24 September 2010 04:59:03 brad wrote:
> I 'm running into an error trying to run solrdedup bin/nutch solrdedup

> http://127.0.0.1:8080/solr-nutch/
> 
> 2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job:
>  job_local_0001 2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 
> 0%  reduce 0% 2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner -
>  job_local_0001 java.lang.OutOfMemoryError: Java heap space
> 	at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec
> .java
> : 323)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405
)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCod
ec.
> j ava:339)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.j
> ava:1
> 1 0)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101
)
> 	at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse
> (Bina
> r yResponseParser.java:39)
> 	at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:466)
> 	at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:243)
> 	at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest
> .java
> : 89)
> 	at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> 	at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.get
> Recor d Reader(SolrDeleteDuplicates.java:233)
> 	at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:17
> 7)
> 
> 
> I'm running solr via tomcat.  Tomcat is being started with the memory 
> parameters of :
> -Xms2048m -Xmx2048m
> 
> So basically there is 2 gb of memory allocated to stack space.  I have

> noticed that by changing the parameters some, the location of the 
> error can change some, but the bottom line is I still run out of stack
space.
> 
> Nutch runs for about 15 minute and then the error occurs.
> 
> I only have 1 solr index and data/index directory size is about 85gb 
> I'm using the deliver solrconfig.xml file
> 
> Is there something else I need to do?  Some change to the Solr or 
> Tomcat config I have missed.
> 
> 
> Config:
> Nutch Release 1.2 - 08/07/2010
> CentOS Linux 5.5
> Linux 2.6.18-194.3.1.el5 on x86_64
> Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
> 8gb of ram
> 
> 
> Thanks
> Brad
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Nutch 1.2 solrdedup and OutOfMemoryError

Posted by brad <br...@bcs-mail.net>.

Thanks for the info.  I'll give the solr deduplication a try.  It looks like
its not as thorough as the regular dedup process (URL, Content, highest
score, shortest URL), but I think it will work.

Brad 

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Friday, September 24, 2010 5:27 AM
To: user@nutch.apache.org
Subject: Re: Nutch 1.2 solrdedup and OutOfMemoryError

I'm not suprised that your memory is eaten when fetching almost 10 million
documents! It's a bit tough to read the deduplication code but it looks like
it's hardcoded to fetch all records and split them between maps. If you've
got one map, it'll fetch all records and so eating your memory.

I'm unsure how this can be fixed, but in the mean time you can solve it by
implementing deduplication in your solrconfig.

On Friday 24 September 2010 04:59:03 brad wrote:
> I 'm running into an error trying to run solrdedup bin/nutch solrdedup 
> http://127.0.0.1:8080/solr-nutch/
> 
> 2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job:
>  job_local_0001 2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 
> 0%  reduce 0% 2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner -
>  job_local_0001 java.lang.OutOfMemoryError: Java heap space
> 	at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec
> .java
> : 323)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
> 	at
>
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.
> j ava:339)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.j
> ava:1
> 1 0)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
> 	at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101)
> 	at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse
> (Bina
> r yResponseParser.java:39)
> 	at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:466)
> 	at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:243)
> 	at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest
> .java
> : 89)
> 	at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> 	at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.get
> Recor d Reader(SolrDeleteDuplicates.java:233)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:17
> 7)
> 
> 
> I'm running solr via tomcat.  Tomcat is being started with the memory 
> parameters of :
> -Xms2048m -Xmx2048m
> 
> So basically there is 2 gb of memory allocated to stack space.  I have 
> noticed that by changing the parameters some, the location of the 
> error can change some, but the bottom line is I still run out of stack
space.
> 
> Nutch runs for about 15 minute and then the error occurs.
> 
> I only have 1 solr index and data/index directory size is about 85gb 
> I'm using the deliver solrconfig.xml file
> 
> Is there something else I need to do?  Some change to the Solr or 
> Tomcat config I have missed.
> 
> 
> Config:
> Nutch Release 1.2 - 08/07/2010
> CentOS Linux 5.5
> Linux 2.6.18-194.3.1.el5 on x86_64
> Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
> 8gb of ram
> 
> 
> Thanks
> Brad
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch 1.2 solrdedup and OutOfMemoryError

Posted by Markus Jelsma <ma...@buyways.nl>.

I'm not suprised that your memory is eaten when fetching almost 10 million 
documents! It's a bit tough to read the deduplication code but it looks like 
it's hardcoded to fetch all records and split them between maps. If you've got 
one map, it'll fetch all records and so eating your memory.

I'm unsure how this can be fixed, but in the mean time you can solve it by 
implementing deduplication in your solrconfig.

On Friday 24 September 2010 04:59:03 brad wrote:
> I 'm running into an error trying to run solrdedup
> bin/nutch solrdedup http://127.0.0.1:8080/solr-nutch/
> 
> 2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job:
>  job_local_0001 2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 0%
>  reduce 0% 2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner -
>  job_local_0001 java.lang.OutOfMemoryError: Java heap space
> 	at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java
> : 323)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.
> j ava:339)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:1
> 1 0)
> 	at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
> 	at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101)
> 	at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(Bina
> r yResponseParser.java:39)
> 	at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> S olrServer.java:466)
> 	at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> S olrServer.java:243)
> 	at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java
> : 89)
> 	at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> 	at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecor
> d Reader(SolrDeleteDuplicates.java:233)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 
> 
> I'm running solr via tomcat.  Tomcat is being started with the memory
> parameters of :
> -Xms2048m -Xmx2048m
> 
> So basically there is 2 gb of memory allocated to stack space.  I have
> noticed that by changing the parameters some, the location of the error can
> change some, but the bottom line is I still run out of stack space.
> 
> Nutch runs for about 15 minute and then the error occurs.
> 
> I only have 1 solr index and data/index directory size is about 85gb
> I'm using the deliver solrconfig.xml file
> 
> Is there something else I need to do?  Some change to the Solr or Tomcat
> config I have missed.
> 
> 
> Config:
> Nutch Release 1.2 - 08/07/2010
> CentOS Linux 5.5
> Linux 2.6.18-194.3.1.el5 on x86_64
> Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
> 8gb of ram
> 
> 
> Thanks
> Brad
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350