You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tomislav Poljak <tp...@gmail.com> on 2007/09/10 16:34:01 UTC

OutOfMemoryError while fetching

Hi,
so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when
fetching (default settings). When using 10 threads I can fetch 25000
urls, but when using 20 threads fetcher fails with:
java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url
fetchlist. Is 20 threads to much for -Xmx1000m or is something else
wrong? What would be recommended settings (number of threads, how much
RAM is needed) for fetching a list of 100k urls (with best performance)?

Thanks,
	Tomislav


Re: OutOfMemoryError while fetching

Posted by Karsten Dello <ka...@web.de>.
Hi Tomislav,

2007/9/11, Doğacan Güney <do...@gmail.com>:
> Has your fetch been going on for a long time? Nutch can leak some
> plugins and classes in local mode. But it only becomes a problem if
> you have too many maps (because each new map task loads new classes
> without, it seems, unloading older ones.)

One comment:
Increasing the  perm size might help.  This is _not_ done with the -mx
parameter but with MaxPermSize-Parameter.

Anyway, when parsing huge segments in local mode this does not help. A
memory leak  is a serious bug. There is no solution afaik, but a
workaround:
Reduce the number of urls per segment (use -topn with generate).
If you run into trouble while fetching, you will most probably not be
able to parse it afterwards.

Good luck!

Karsten

Re: OutOfMemoryError while fetching

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On 9/11/07, Tomislav Poljak <tp...@gmail.com> wrote:
> Hi Andrtzej,
> I am running fetcher in non-parsing mode, I have this in nutch-site.xml:
>
> <property>
>   <name>fetcher.parse</name>
>   <value>false</value>
>   <description>If true, fetcher will parse content.</description>
> </property>
>
> Maybe I didn't post a question correctly. I get a couple of fetcher
> threads  failing with java.lang.OutOfMemoryError like this (from
> hadoop.log):
>
> 2007-09-09 01:07:24,150 INFO  fetcher.Fetcher - fetching
> http://scholar.google.com/intl/en/scholar/libraries.html
> 2007-09-09 01:07:27,084 INFO  fetcher.Fetcher - fetch of
> http://logging.apache.org/log4j/1.2/faq.html failed with:
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:27,085 INFO  fetcher.Fetcher - fetching
> http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html
> 2007-09-09 01:07:32,151 INFO  fetcher.Fetcher - fetch of
> http://hockey.fantasysports.yahoo.com/hockey/register/createjoin failed
> with: java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:32,817 INFO  fetcher.Fetcher - fetch of
> http://scholar.google.com/intl/en/scholar/libraries.html failed with:
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:32,817 FATAL fetcher.Fetcher -
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher
> caught:java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher -
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher
> caught:java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:37,865 INFO  fetcher.Fetcher - fetch of
> http://cn.yahoo.com/allservice/index.html failed with:
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:38,019 FATAL fetcher.Fetcher -
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:38,020 FATAL fetcher.Fetcher - fetcher
> caught:java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:42,887 INFO  fetcher.Fetcher - fetch of
> http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html failed with:
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:43,045 FATAL fetcher.Fetcher -
> java.lang.OutOfMemoryError: Java heap space
> 2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - fetcher
> caught:java.lang.OutOfMemoryError: Java heap space
>
> Any ideas why?

Has your fetch been going on for a long time? Nutch can leak some
plugins and classes in local mode. But it only becomes a problem if
you have too many maps (because each new map task loads new classes
without, it seems, unloading older ones.)

Related issue: NUTCH-356

>
> Thanks,
>       Tomislav
>
>
> On Mon, 2007-09-10 at 21:30 +0200, Andrzej Bialecki wrote:
> > Tomislav Poljak wrote:
> > > Hi,
> > > so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when
> > > fetching (default settings). When using 10 threads I can fetch 25000
> > > urls, but when using 20 threads fetcher fails with:
> > > java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url
> > > fetchlist. Is 20 threads to much for -Xmx1000m or is something else
> > > wrong? What would be recommended settings (number of threads, how much
> > > RAM is needed) for fetchi
>
> > ng a list of 100k urls (with best performance)?
> >
> > I routinely run crawls with 100 threads or more. If you're using the
> > fetcher in parsing mode (i.e. it not only fetches but also parses the
> > content) then your problem is likely related to the memory consumption
> > of a parsing plugin (such as PDF or MS Office parsers).
> >
> > I suggest to run the fetcher in non-parsing mode (-noParsing cmd-line
> > option), and then parsing the segment in a separate step (bin/nutch parse).
> >
> >
>
>


-- 
Doğacan Güney

Re: OutOfMemoryError while fetching

Posted by Tomislav Poljak <tp...@gmail.com>.
Hi Andrzej,
when I ps nutch-java process (while fetching) I get:

25276 pts/0    Sl+    0:07 /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
-Dhadoop.log.dir=/home/nutch/test/trunk/logs
-Dhadoop.log.file=hadoop.log -Djava.library.path=/home/nutch...

so this should mean that I have dedicated 1 Gb to this java process
(this are default settings I didn't changed it), right?

I am running nutch-1.0-dev (from trunk 2007-08-08) with sun jdk6 on
Ubuntu Feisty (dual-core AMD64, 2Gb RAM). Should I try with nutch-0.9?

Thanks,
    Tomislav
 


On Tue, 2007-09-11 at 11:32 +0200, Andrzej Bialecki wrote:
> Tomislav Poljak wrote:
> > Hi Andrtzej,
> > I am running fetcher in non-parsing mode, I have this in nutch-site.xml:
> > 
> > <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content.</description>
> > </property>
> > 
> > Maybe I didn't post a question correctly. I get a couple of fetcher
> > threads  failing with java.lang.OutOfMemoryError like this (from
> > hadoop.log): 
> > 
> > 2007-09-09 01:07:24,150 INFO  fetcher.Fetcher - fetching
> > http://scholar.google.com/intl/en/scholar/libraries.html
> > 2007-09-09 01:07:27,084 INFO  fetcher.Fetcher - fetch of
> 
> Sure looks strange to me. And you are sure that your Java heap size is 
> set to the right amount? You didn't lose any 'm or 'g suffix in -Xmx ?
> 
> ;) just checking ...
> 
> 


Re: OutOfMemoryError while fetching

Posted by Andrzej Bialecki <ab...@getopt.org>.
Tomislav Poljak wrote:
> Hi Andrtzej,
> I am running fetcher in non-parsing mode, I have this in nutch-site.xml:
> 
> <property>
>   <name>fetcher.parse</name>
>   <value>false</value>
>   <description>If true, fetcher will parse content.</description>
> </property>
> 
> Maybe I didn't post a question correctly. I get a couple of fetcher
> threads  failing with java.lang.OutOfMemoryError like this (from
> hadoop.log): 
> 
> 2007-09-09 01:07:24,150 INFO  fetcher.Fetcher - fetching
> http://scholar.google.com/intl/en/scholar/libraries.html
> 2007-09-09 01:07:27,084 INFO  fetcher.Fetcher - fetch of

Sure looks strange to me. And you are sure that your Java heap size is 
set to the right amount? You didn't lose any 'm or 'g suffix in -Xmx ?

;) just checking ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: OutOfMemoryError while fetching

Posted by Tomislav Poljak <tp...@gmail.com>.
Hi Andrtzej,
I am running fetcher in non-parsing mode, I have this in nutch-site.xml:

<property>
  <name>fetcher.parse</name>
  <value>false</value>
  <description>If true, fetcher will parse content.</description>
</property>

Maybe I didn't post a question correctly. I get a couple of fetcher
threads  failing with java.lang.OutOfMemoryError like this (from
hadoop.log): 

2007-09-09 01:07:24,150 INFO  fetcher.Fetcher - fetching
http://scholar.google.com/intl/en/scholar/libraries.html
2007-09-09 01:07:27,084 INFO  fetcher.Fetcher - fetch of
http://logging.apache.org/log4j/1.2/faq.html failed with:
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:27,085 INFO  fetcher.Fetcher - fetching
http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html
2007-09-09 01:07:32,151 INFO  fetcher.Fetcher - fetch of
http://hockey.fantasysports.yahoo.com/hockey/register/createjoin failed
with: java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:32,817 INFO  fetcher.Fetcher - fetch of
http://scholar.google.com/intl/en/scholar/libraries.html failed with:
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:32,817 FATAL fetcher.Fetcher -
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher
caught:java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:33,380 FATAL fetcher.Fetcher -
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher
caught:java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:37,865 INFO  fetcher.Fetcher - fetch of
http://cn.yahoo.com/allservice/index.html failed with:
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:38,019 FATAL fetcher.Fetcher -
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:38,020 FATAL fetcher.Fetcher - fetcher
caught:java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:42,887 INFO  fetcher.Fetcher - fetch of
http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html failed with:
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:43,045 FATAL fetcher.Fetcher -
java.lang.OutOfMemoryError: Java heap space
2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - fetcher
caught:java.lang.OutOfMemoryError: Java heap space

Any ideas why?

Thanks,
      Tomislav


On Mon, 2007-09-10 at 21:30 +0200, Andrzej Bialecki wrote:
> Tomislav Poljak wrote:
> > Hi,
> > so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when
> > fetching (default settings). When using 10 threads I can fetch 25000
> > urls, but when using 20 threads fetcher fails with:
> > java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url
> > fetchlist. Is 20 threads to much for -Xmx1000m or is something else
> > wrong? What would be recommended settings (number of threads, how much
> > RAM is needed) for fetchi

> ng a list of 100k urls (with best performance)?
> 
> I routinely run crawls with 100 threads or more. If you're using the 
> fetcher in parsing mode (i.e. it not only fetches but also parses the 
> content) then your problem is likely related to the memory consumption 
> of a parsing plugin (such as PDF or MS Office parsers).
> 
> I suggest to run the fetcher in non-parsing mode (-noParsing cmd-line 
> option), and then parsing the segment in a separate step (bin/nutch parse).
> 
> 


Re: OutOfMemoryError while fetching

Posted by Andrzej Bialecki <ab...@getopt.org>.
Tomislav Poljak wrote:
> Hi,
> so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when
> fetching (default settings). When using 10 threads I can fetch 25000
> urls, but when using 20 threads fetcher fails with:
> java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url
> fetchlist. Is 20 threads to much for -Xmx1000m or is something else
> wrong? What would be recommended settings (number of threads, how much
> RAM is needed) for fetching a list of 100k urls (with best performance)?

I routinely run crawls with 100 threads or more. If you're using the 
fetcher in parsing mode (i.e. it not only fetches but also parses the 
content) then your problem is likely related to the memory consumption 
of a parsing plugin (such as PDF or MS Office parsers).

I suggest to run the fetcher in non-parsing mode (-noParsing cmd-line 
option), and then parsing the segment in a separate step (bin/nutch parse).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com