You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Niels Boldt <ni...@gmail.com> on 2009/12/24 16:08:55 UTC

Memory Exception

Hi,

We are using nutch to crawl some sites quite extensively, but are having
problems with memory consumption. The server which it is deployed on has
about 6 gig of memory available, but after the crawl job has been running
for approximately 24 hours it exits, complaining that there aren't any
heapspace left. Eg OutOfMemory exception.

Also we are quite new to nutch so I'm wondering if our configuration simply
is to small, eg we should add more memory and this behaviour is normal when
nutch run in these memory conditions.

Or is there some way we could configure nutch to run better. We run with
pretty much the default configuration, except that we always fetch the
entire page instead of only the first 64Kb. Could that cause any problems.

Any hints or suggestions would be appreciated.

Best Regards
Niels

-- 
BinaryConstructors ApS
Vestergade 10a, 4th
1456 Kbh K
Denmark
phone: +4528138757
web: http://www.binaryconstructors.dk
mail: nb@binaryconstructors.dk
skype: nielsboldt

Re: Memory Exception

Posted by Niels Boldt <ni...@gmail.com>.
Julien,

Thanks for your answer and sorry for my slow answer.

I'll check up on the noParse, thanks for the hint.

I'm running a absolutely simple system with only one node. Everything is
running on the same server.

I'm not sure about "-D mapred.child.java.opts", will check up on this as
well.

The problem seemed to go away when we decreased db.max.inlinks
significantly, from 1000 to 10 and we haven't experienced it since that. We
have off course lost the actual stacktrace, but as I remember it, we got an
heap exception when modifying a concurrent map.

Thanks
Niels




On Mon, Jan 4, 2010 at 12:46 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Niels,
>
> Do you parse at the same time as you fetch or do you specify noParse while
> fetching? try separating these 2 steps, if the parsing failed at least you
> won't have to refetch every time.
> Since you don't limit the size of the document I suspect that this causes
> the parser to run out of memory. Could you give us more details on your
> config (number of nodes, pseudo or fully distributed mode etc...) and check
> that you specified the memory to be used by the tasks properly i.e. via "-D
> mapred.child.java.opts"
>
> The best way to avoid the problem would be to set a limit on the document
> size, alternatively you can also use Hadoop's skipRecordsOptions (e.g. "-D
> mapred.skip.attempts.to.start.skipping=2 -D
> mapred.skip.map.max.skip.records=1") while parsing to simply skip the
> documents which are causing problems
>
> Best,
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> 2009/12/24 Niels Boldt <ni...@gmail.com>
>
> > Hi,
> >
> > We are using nutch to crawl some sites quite extensively, but are having
> > problems with memory consumption. The server which it is deployed on has
> > about 6 gig of memory available, but after the crawl job has been running
> > for approximately 24 hours it exits, complaining that there aren't any
> > heapspace left. Eg OutOfMemory exception.
> >
> > Also we are quite new to nutch so I'm wondering if our configuration
> simply
> > is to small, eg we should add more memory and this behaviour is normal
> when
> > nutch run in these memory conditions.
> >
> > Or is there some way we could configure nutch to run better. We run with
> > pretty much the default configuration, except that we always fetch the
> > entire page instead of only the first 64Kb. Could that cause any
> problems.
> >
> > Any hints or suggestions would be appreciated.
> >
> > Best Regards
> > Niels
> >
> > --
> > BinaryConstructors ApS
> > Vestergade 10a, 4th
> > 1456 Kbh K
> > Denmark
> > phone: +4528138757
> > web: http://www.binaryconstructors.dk
> > mail: nb@binaryconstructors.dk
> > skype: nielsboldt
> >
>



-- 
BinaryConstructors ApS
Vestergade 10a, 4th
1456 Kbh K
Denmark
phone: +4528138757
web: http://www.binaryconstructors.dk
mail: nb@binaryconstructors.dk
skype: nielsboldt

Re: Memory Exception

Posted by Julien Nioche <li...@gmail.com>.
Hi Niels,

Do you parse at the same time as you fetch or do you specify noParse while
fetching? try separating these 2 steps, if the parsing failed at least you
won't have to refetch every time.
Since you don't limit the size of the document I suspect that this causes
the parser to run out of memory. Could you give us more details on your
config (number of nodes, pseudo or fully distributed mode etc...) and check
that you specified the memory to be used by the tasks properly i.e. via "-D
mapred.child.java.opts"

The best way to avoid the problem would be to set a limit on the document
size, alternatively you can also use Hadoop's skipRecordsOptions (e.g. "-D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1") while parsing to simply skip the
documents which are causing problems

Best,

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/12/24 Niels Boldt <ni...@gmail.com>

> Hi,
>
> We are using nutch to crawl some sites quite extensively, but are having
> problems with memory consumption. The server which it is deployed on has
> about 6 gig of memory available, but after the crawl job has been running
> for approximately 24 hours it exits, complaining that there aren't any
> heapspace left. Eg OutOfMemory exception.
>
> Also we are quite new to nutch so I'm wondering if our configuration simply
> is to small, eg we should add more memory and this behaviour is normal when
> nutch run in these memory conditions.
>
> Or is there some way we could configure nutch to run better. We run with
> pretty much the default configuration, except that we always fetch the
> entire page instead of only the first 64Kb. Could that cause any problems.
>
> Any hints or suggestions would be appreciated.
>
> Best Regards
> Niels
>
> --
> BinaryConstructors ApS
> Vestergade 10a, 4th
> 1456 Kbh K
> Denmark
> phone: +4528138757
> web: http://www.binaryconstructors.dk
> mail: nb@binaryconstructors.dk
> skype: nielsboldt
>