You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ar...@csiro.au on 2011/10/27 05:54:54 UTC

OutOfMemoryError when indexing into Solr

Hi,

I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, but this did not make any difference.

A log extract is included below.

Would anyone have any idea of how to fix this problem?

Thanks,

Arkadi


2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
java.lang.OutOfMemoryError: Java heap space
       at java.util.Arrays.copyOfRange(Arrays.java:3209)
       at java.lang.String.<init>(String.java:215)
       at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
       at java.nio.CharBuffer.toString(CharBuffer.java:1157)
       at org.apache.hadoop.io.Text.decode(Text.java:350)
       at org.apache.hadoop.io.Text.decode(Text.java:322)
       at org.apache.hadoop.io.Text.readString(Text.java:403)
       at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
       at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
       at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
       at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
       at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241)
       at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237)
       at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81)
       at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
       at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job failed!

Re: OutOfMemoryError when indexing into Solr

Posted by Markus Jelsma <ma...@openindex.io>.

Your problem is not the same judging from the stack trace on the Solr list. 
Your Solr runs OOM, not Nutch.

On Thursday 27 October 2011 14:20:10 Fred Zimmerman wrote:
> I'm having the exact same problem. I am trying to isolate whether it is a
> Solr problem or a Nutch+Solr problem.
> 
> On Wed, Oct 26, 2011 at 11:54 PM, <Ar...@csiro.au> wrote:
> > Hi,
> > 
> > I am working with a Nutch 1.4 snapshot and having a very strange problem
> > that makes the system run out of memory when indexing into Solr. This
> > does not look like a trivial lack of memory problem that can be solved
> > by giving more memory to the JVM. I've increased the max memory size
> > from 2Gb to 3Gb, then to 6Gb, but this did not make any difference.
> > 
> > A log extract is included below.
> > 
> > Would anyone have any idea of how to fix this problem?
> > 
> > Thanks,
> > 
> > Arkadi
> > 
> > 
> > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
> > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
> > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
> > java.lang.OutOfMemoryError: Java heap space
> > 
> >       at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >       at java.lang.String.<init>(String.java:215)
> >       at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> >       at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> >       at org.apache.hadoop.io.Text.decode(Text.java:350)
> >       at org.apache.hadoop.io.Text.decode(Text.java:322)
> >       at org.apache.hadoop.io.Text.readString(Text.java:403)
> >       at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> >       at
> > 
> > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWrita
> > bleConfigurable.java:54)
> > 
> >       at
> > 
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserialize
> > r.deserialize(WritableSerialization.java:67)
> > 
> >       at
> > 
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserialize
> > r.deserialize(WritableSerialization.java:40)
> > 
> >       at
> > 
> > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
> > 
> >       at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> >       at
> > 
> > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Reduc
> > eTask.java:241)
> > 
> >       at
> > 
> > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.
> > java:237)
> > 
> >       at
> > 
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81
> > )
> > 
> >       at
> > 
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50
> > )
> > 
> >       at
> > 
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > 
> >       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> >       at
> > 
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job
> > failed!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: OutOfMemoryError when indexing into Solr

Posted by Fred Zimmerman <zi...@gmail.com>.

I'm having the exact same problem. I am trying to isolate whether it is a
Solr problem or a Nutch+Solr problem.

On Wed, Oct 26, 2011 at 11:54 PM, <Ar...@csiro.au> wrote:

> Hi,
>
> I am working with a Nutch 1.4 snapshot and having a very strange problem
> that makes the system run out of memory when indexing into Solr. This does
> not look like a trivial lack of memory problem that can be solved by giving
> more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb,
> then to 6Gb, but this did not make any difference.
>
> A log extract is included below.
>
> Would anyone have any idea of how to fix this problem?
>
> Thanks,
>
> Arkadi
>
>
> 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
> 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
> 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.Arrays.copyOfRange(Arrays.java:3209)
>       at java.lang.String.<init>(String.java:215)
>       at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
>       at java.nio.CharBuffer.toString(CharBuffer.java:1157)
>       at org.apache.hadoop.io.Text.decode(Text.java:350)
>       at org.apache.hadoop.io.Text.decode(Text.java:322)
>       at org.apache.hadoop.io.Text.readString(Text.java:403)
>       at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
>       at
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
>       at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>       at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>       at
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
>       at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
>       at
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241)
>       at
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237)
>       at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81)
>       at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
>       at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>       at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!
>
>

Re: OutOfMemoryError when indexing into Solr

Posted by Markus Jelsma <ma...@openindex.io>.

Thanks. We should decrease the default setting for commit.size.

> Confirming that this worked. Also, times look interesting: to send 73K
> documents in 1000 doc batches (default) took 16 minutes; to send 73K
> documents in 100 doc batches took 15 minutes 24 seconds.
> 
> Regards,
> 
> Arkadi
> 
> > -----Original Message-----
> > From: Arkadi.Kosmynin@csiro.au [mailto:Arkadi.Kosmynin@csiro.au]
> > Sent: Friday, 28 October 2011 12:11 PM
> > To: user@nutch.apache.org; markus.jelsma@openindex.io
> > Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr
> > 
> > Hi Markus,
> > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > > Sent: Thursday, 27 October 2011 11:33 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: OutOfMemoryError when indexing into Solr
> > > 
> > > Interesting, how many records and how large are your records?
> > 
> > There a bit more than 80,000 documents.
> > 
> > <property>
> > 
> >       <name>http.content.limit</name> <value>150000000</value>
> > 
> > </property>
> > 
> > <property>
> > 
> >    <name>indexer.max.tokens</name><value>100000</value>
> > 
> > </property>
> > 
> > > How did you increase JVM heap size?
> > 
> > opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -
> > XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -
> > XX:+CMSClassUnloadingEnabled"
> > 
> > > Do you have custom indexing filters?
> > 
> > Yes. They add a few fields to each document. These fields are small,
> > within a hundred of bytes per document.
> > 
> > > Can you decrease the commit.size?
> > 
> > Yes. Thank you. Good idea. I did not even consider it because, for
> > whatever reason, this option was not in my nutch-default.xml. I've put
> > it to 100. I hope that Solr commit is not done after sending each
> > bunch. Else this would have a very negative impact on performance
> > because Solr commits are very expensive.
> > 
> > > Do you also index large amounts of anchors (without deduplication)
> > 
> > and pass in a very large linkdb?
> > 
> > I do index anchors, but don't think that there is anything
> > extraordinary about them. As I only index less than 100K pages, my
> > linkdb should not be nearly as large as in cases when people index
> > millions of documents.
> > 
> > > The reducer of IndexerMapReduce is a notorious RAM consumer.
> > 
> > If reducing solr.commit.size helps, it would make sense to decrease the
> > default value. Sending small bunches of documents to Solr without
> > commits is not that expensive to risk having memory problems.
> > 
> > Thanks again.
> > 
> > Regards,
> > 
> > Arkadi
> > 
> > > On Thursday 27 October 2011 05:54:54 Arkadi.Kosmynin@csiro.au wrote:
> > > > Hi,
> > > > 
> > > > I am working with a Nutch 1.4 snapshot and having a very strange
> > > 
> > > problem
> > > 
> > > > that makes the system run out of memory when indexing into Solr.
> > 
> > This
> > 
> > > does
> > > 
> > > > not look like a trivial lack of memory problem that can be solved
> > 
> > by
> > 
> > > > giving more memory to the JVM. I've increased the max memory size
> > > 
> > > from 2Gb
> > > 
> > > > to 3Gb, then to 6Gb, but this did not make any difference.
> > > > 
> > > > A log extract is included below.
> > > > 
> > > > Would anyone have any idea of how to fix this problem?
> > > > 
> > > > Thanks,
> > > > 
> > > > Arkadi
> > > > 
> > > > 
> > > > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000
> > 
> > documents
> > 
> > > > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000
> > 
> > documents
> > 
> > > > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner -
> > 
> > job_local_0254
> > 
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > 
> > > >        at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > > >        at java.lang.String.<init>(String.java:215)
> > > >        at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> > > >        at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> > > >        at org.apache.hadoop.io.Text.decode(Text.java:350)
> > > >        at org.apache.hadoop.io.Text.decode(Text.java:322)
> > > >        at org.apache.hadoop.io.Text.readString(Text.java:403)
> > > >        at
> > > 
> > > org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> > > 
> > > >        at
> > 
> > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
> > 
> > > tab
> > > 
> > > > leConfigurable.java:54) at
> > 
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > 
> > > zer
> > > 
> > > > .deserialize(WritableSerialization.java:67) at
> > 
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > 
> > > zer
> > > 
> > > > .deserialize(WritableSerialization.java:40) at
> > 
> > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
> > 
> > > 1)
> > > 
> > > > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> > > 
> > > at
> > 
> > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
> > 
> > > uce
> > > 
> > > > Task.java:241) at
> > 
> > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
> > 
> > > k.j
> > > 
> > > > ava:237) at
> > 
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > > 81)
> > > 
> > > > at
> > 
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > > 50)
> > > 
> > > > at
> > 
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > 
> > > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> > 
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> > 
> > > )
> > > 
> > > > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer -
> > 
> > java.io.IOException:
> > > Job
> > > 
> > > > failed!
> > > 
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350

RE: OutOfMemoryError when indexing into Solr

Posted by Ar...@csiro.au.

Confirming that this worked. Also, times look interesting: to send 73K documents in 1000 doc batches (default) took 16 minutes; to send 73K documents in 100 doc batches took 15 minutes 24 seconds.

Regards,

Arkadi

> -----Original Message-----
> From: Arkadi.Kosmynin@csiro.au [mailto:Arkadi.Kosmynin@csiro.au]
> Sent: Friday, 28 October 2011 12:11 PM
> To: user@nutch.apache.org; markus.jelsma@openindex.io
> Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr
> 
> Hi Markus,
> 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > Sent: Thursday, 27 October 2011 11:33 PM
> > To: user@nutch.apache.org
> > Subject: Re: OutOfMemoryError when indexing into Solr
> >
> > Interesting, how many records and how large are your records?
> 
> There a bit more than 80,000 documents.
> 
> <property>
>       <name>http.content.limit</name> <value>150000000</value>
> </property>
> 
> <property>
>    <name>indexer.max.tokens</name><value>100000</value>
> </property>
> 
> > How did you increase JVM heap size?
> 
> opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -
> XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -
> XX:+CMSClassUnloadingEnabled"
> 
> > Do you have custom indexing filters?
> 
> Yes. They add a few fields to each document. These fields are small,
> within a hundred of bytes per document.
> 
> > Can you decrease the commit.size?
> 
> Yes. Thank you. Good idea. I did not even consider it because, for
> whatever reason, this option was not in my nutch-default.xml. I've put
> it to 100. I hope that Solr commit is not done after sending each
> bunch. Else this would have a very negative impact on performance
> because Solr commits are very expensive.
> 
> 
> > Do you also index large amounts of anchors (without deduplication)
> and pass in a very large linkdb?
> 
> I do index anchors, but don't think that there is anything
> extraordinary about them. As I only index less than 100K pages, my
> linkdb should not be nearly as large as in cases when people index
> millions of documents.
> 
> > The reducer of IndexerMapReduce is a notorious RAM consumer.
> 
> If reducing solr.commit.size helps, it would make sense to decrease the
> default value. Sending small bunches of documents to Solr without
> commits is not that expensive to risk having memory problems.
> 
> Thanks again.
> 
> Regards,
> 
> Arkadi
> 
> 
> >
> > On Thursday 27 October 2011 05:54:54 Arkadi.Kosmynin@csiro.au wrote:
> > > Hi,
> > >
> > > I am working with a Nutch 1.4 snapshot and having a very strange
> > problem
> > > that makes the system run out of memory when indexing into Solr.
> This
> > does
> > > not look like a trivial lack of memory problem that can be solved
> by
> > > giving more memory to the JVM. I've increased the max memory size
> > from 2Gb
> > > to 3Gb, then to 6Gb, but this did not make any difference.
> > >
> > > A log extract is included below.
> > >
> > > Would anyone have any idea of how to fix this problem?
> > >
> > > Thanks,
> > >
> > > Arkadi
> > >
> > >
> > > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000
> documents
> > > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000
> documents
> > > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner -
> job_local_0254
> > > java.lang.OutOfMemoryError: Java heap space
> > >        at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > >        at java.lang.String.<init>(String.java:215)
> > >        at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> > >        at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> > >        at org.apache.hadoop.io.Text.decode(Text.java:350)
> > >        at org.apache.hadoop.io.Text.decode(Text.java:322)
> > >        at org.apache.hadoop.io.Text.readString(Text.java:403)
> > >        at
> > org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> > >        at
> > >
> >
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
> > tab
> > > leConfigurable.java:54) at
> > >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > zer
> > > .deserialize(WritableSerialization.java:67) at
> > >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > zer
> > > .deserialize(WritableSerialization.java:40) at
> > >
> >
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
> > 1)
> > > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> > at
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
> > uce
> > > Task.java:241) at
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
> > k.j
> > > ava:237) at
> > >
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > 81)
> > > at
> > >
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > 50)
> > > at
> >
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> > >
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> > )
> > > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer -
> java.io.IOException:
> > Job
> > > failed!
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

RE: OutOfMemoryError when indexing into Solr

Posted by Ar...@csiro.au.

Hi Markus,

> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Thursday, 27 October 2011 11:33 PM
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError when indexing into Solr
> 
> Interesting, how many records and how large are your records?

There a bit more than 80,000 documents.

<property>
      <name>http.content.limit</name> <value>150000000</value>
</property>

<property>
   <name>indexer.max.tokens</name><value>100000</value> 
</property>

> How did you increase JVM heap size?

opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled"

> Do you have custom indexing filters?

Yes. They add a few fields to each document. These fields are small, within a hundred of bytes per document.

> Can you decrease the commit.size?

Yes. Thank you. Good idea. I did not even consider it because, for whatever reason, this option was not in my nutch-default.xml. I've put it to 100. I hope that Solr commit is not done after sending each bunch. Else this would have a very negative impact on performance because Solr commits are very expensive.  
 

> Do you also index large amounts of anchors (without deduplication) and pass in a very large linkdb?

I do index anchors, but don't think that there is anything extraordinary about them. As I only index less than 100K pages, my linkdb should not be nearly as large as in cases when people index millions of documents.
 
> The reducer of IndexerMapReduce is a notorious RAM consumer.

If reducing solr.commit.size helps, it would make sense to decrease the default value. Sending small bunches of documents to Solr without commits is not that expensive to risk having memory problems.

Thanks again.

Regards,

Arkadi


> 
> On Thursday 27 October 2011 05:54:54 Arkadi.Kosmynin@csiro.au wrote:
> > Hi,
> >
> > I am working with a Nutch 1.4 snapshot and having a very strange
> problem
> > that makes the system run out of memory when indexing into Solr. This
> does
> > not look like a trivial lack of memory problem that can be solved by
> > giving more memory to the JVM. I've increased the max memory size
> from 2Gb
> > to 3Gb, then to 6Gb, but this did not make any difference.
> >
> > A log extract is included below.
> >
> > Would anyone have any idea of how to fix this problem?
> >
> > Thanks,
> >
> > Arkadi
> >
> >
> > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
> > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
> > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
> > java.lang.OutOfMemoryError: Java heap space
> >        at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >        at java.lang.String.<init>(String.java:215)
> >        at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> >        at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> >        at org.apache.hadoop.io.Text.decode(Text.java:350)
> >        at org.apache.hadoop.io.Text.decode(Text.java:322)
> >        at org.apache.hadoop.io.Text.readString(Text.java:403)
> >        at
> org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> >        at
> >
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
> tab
> > leConfigurable.java:54) at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> zer
> > .deserialize(WritableSerialization.java:67) at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> zer
> > .deserialize(WritableSerialization.java:40) at
> >
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
> 1)
> > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
> uce
> > Task.java:241) at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
> k.j
> > ava:237) at
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> 81)
> > at
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> 50)
> > at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> )
> > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException:
> Job
> > failed!
> 
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

Re: OutOfMemoryError when indexing into Solr

Posted by Markus Jelsma <ma...@openindex.io>.

Interesting, how many records and how large are your records? How did you 
increase JVM heap size? Do you have custom indexing filters? Can you decrease 
the commit.size? Do you also index large amounts of anchors (without 
deduplication) and pass in a very large linkdb?

The reducer of IndexerMapReduce is a notorious RAM consumer.

On Thursday 27 October 2011 05:54:54 Arkadi.Kosmynin@csiro.au wrote:
> Hi,
> 
> I am working with a Nutch 1.4 snapshot and having a very strange problem
> that makes the system run out of memory when indexing into Solr. This does
> not look like a trivial lack of memory problem that can be solved by
> giving more memory to the JVM. I've increased the max memory size from 2Gb
> to 3Gb, then to 6Gb, but this did not make any difference.
> 
> A log extract is included below.
> 
> Would anyone have any idea of how to fix this problem?
> 
> Thanks,
> 
> Arkadi
> 
> 
> 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
> 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
> 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
> java.lang.OutOfMemoryError: Java heap space
>        at java.util.Arrays.copyOfRange(Arrays.java:3209)
>        at java.lang.String.<init>(String.java:215)
>        at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
>        at java.nio.CharBuffer.toString(CharBuffer.java:1157)
>        at org.apache.hadoop.io.Text.decode(Text.java:350)
>        at org.apache.hadoop.io.Text.decode(Text.java:322)
>        at org.apache.hadoop.io.Text.readString(Text.java:403)
>        at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
>        at
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritab
> leConfigurable.java:54) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
> .deserialize(WritableSerialization.java:67) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
> .deserialize(WritableSerialization.java:40) at
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Reduce
> Task.java:241) at
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.j
> ava:237) at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81)
> at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350