You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2009/07/16 22:53:55 UTC

java heap space problem when using the language identifier

I get more details now for my error.
What can I do about it, I have 4GB of memory, but it is not fully used (I
think).
I use cygwin/windows/local filesystem

java.lang.OutOfMemoryError: Java heap space
    at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:498)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

---------- Forwarded message ----------
From: MilleBii <mi...@gmail.com>
Date: 2009/7/15
Subject: Errorr when using language-identifier plugin ?
To: nutch-user@lucene.apache.org


I decided to add the language-identifier plugin... but I get the following
error when I start indexing my crawldb. Not really explicit.
 If I remove it works just fine. I tried on a smaller crawl database that I
use for testing and it works fine too.
Any idea where to look for ?


2009-07-15 16:19:54,875 WARN  mapred.LocalJobRunner - job_local_0001
2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer:
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
    at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)



-- 
-MilleBii-



-- 
-MilleBii-

Re: java heap space problem when using the language identifier

Posted by Doğacan Güney <do...@gmail.com>.
On Sat, Jul 18, 2009 at 00:02, MilleBii<mi...@gmail.com> wrote:
> Looks great my indexing is now working and I observe a constant memory usage
> instead of the ever-growing slope. Thx a lot, why is this patch not in the
> standard build ?
>

Because I never tested it very well so I never got to commit the
patch. I will try to
review it before 1.1 and hopefully include it in next release.

Anyway, I am glad it solves your problem.

> I just get some weird message in ANT/eclipse
>      [jar] Warning: skipping jar archive
> C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar
> because no files were included.
>      [jar] Building MANIFEST-only jar:
> C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar
> Not sure what that means.
>
>
> 2009/7/17 Doğacan Güney <do...@gmail.com>
>
>> On Fri, Jul 17, 2009 at 00:30, MilleBii<mi...@gmail.com> wrote:
>> > Just trying indexing a smaller segment 300k URLs ... and the memory is
>> just
>> > going up and up... but it does NOT hit the physical boundary limit.
>> Sounds
>> > like a "memory leak" ???
>> > How come I thought Java was doing the garbage collection automatically
>> ????
>> >
>>
>> Can you try the patch at
>>
>> https://issues.apache.org/jira/browse/NUTCH-356
>>
>> (try cache_classes.patch)
>>
>> >
>> > 2009/7/16 MilleBii <mi...@gmail.com>
>> >
>> >> I get more details now for my error.
>> >> What can I do about it, I have 4GB of memory, but it is not fully used
>> (I
>> >> think).
>> >> I use cygwin/windows/local filesystem
>> >>
>> >> java.lang.OutOfMemoryError: Java heap space
>> >>     at
>> >>
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:498)
>> >>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> >>     at
>> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>> >>
>> >> ---------- Forwarded message ----------
>> >> From: MilleBii <mi...@gmail.com>
>> >> Date: 2009/7/15
>> >> Subject: Errorr when using language-identifier plugin ?
>> >> To: nutch-user@lucene.apache.org
>> >>
>> >>
>> >> I decided to add the language-identifier plugin... but I get the
>> following
>> >> error when I start indexing my crawldb. Not really explicit.
>> >>  If I remove it works just fine. I tried on a smaller crawl database
>> that I
>> >> use for testing and it works fine too.
>> >> Any idea where to look for ?
>> >>
>> >>
>> >> 2009-07-15 16:19:54,875 WARN  mapred.LocalJobRunner - job_local_0001
>> >> 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer:
>> >> java.io.IOException: Job failed!
>> >>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>> >>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
>> >>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
>> >>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
>> >>
>> >>
>> >>
>> >> --
>> >> -MilleBii-
>> >>
>> >>
>> >>
>> >> --
>> >> -MilleBii-
>> >>
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>>
>>
>> --
>> Doğacan Güney
>>
>
>
>
> --
> -MilleBii-
>



-- 
Doğacan Güney

Re: java heap space problem when using the language identifier

Posted by MilleBii <mi...@gmail.com>.
Looks great my indexing is now working and I observe a constant memory usage
instead of the ever-growing slope. Thx a lot, why is this patch not in the
standard build ?

I just get some weird message in ANT/eclipse
      [jar] Warning: skipping jar archive
C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar
because no files were included.
      [jar] Building MANIFEST-only jar:
C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar
Not sure what that means.


2009/7/17 Doğacan Güney <do...@gmail.com>

> On Fri, Jul 17, 2009 at 00:30, MilleBii<mi...@gmail.com> wrote:
> > Just trying indexing a smaller segment 300k URLs ... and the memory is
> just
> > going up and up... but it does NOT hit the physical boundary limit.
> Sounds
> > like a "memory leak" ???
> > How come I thought Java was doing the garbage collection automatically
> ????
> >
>
> Can you try the patch at
>
> https://issues.apache.org/jira/browse/NUTCH-356
>
> (try cache_classes.patch)
>
> >
> > 2009/7/16 MilleBii <mi...@gmail.com>
> >
> >> I get more details now for my error.
> >> What can I do about it, I have 4GB of memory, but it is not fully used
> (I
> >> think).
> >> I use cygwin/windows/local filesystem
> >>
> >> java.lang.OutOfMemoryError: Java heap space
> >>     at
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:498)
> >>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >>     at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> >>
> >> ---------- Forwarded message ----------
> >> From: MilleBii <mi...@gmail.com>
> >> Date: 2009/7/15
> >> Subject: Errorr when using language-identifier plugin ?
> >> To: nutch-user@lucene.apache.org
> >>
> >>
> >> I decided to add the language-identifier plugin... but I get the
> following
> >> error when I start indexing my crawldb. Not really explicit.
> >>  If I remove it works just fine. I tried on a smaller crawl database
> that I
> >> use for testing and it works fine too.
> >> Any idea where to look for ?
> >>
> >>
> >> 2009-07-15 16:19:54,875 WARN  mapred.LocalJobRunner - job_local_0001
> >> 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer:
> >> java.io.IOException: Job failed!
> >>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
> >>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
> >>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
> >>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
> >>
> >>
> >>
> >> --
> >> -MilleBii-
> >>
> >>
> >>
> >> --
> >> -MilleBii-
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>
>
>
> --
> Doğacan Güney
>



-- 
-MilleBii-

Re: java heap space problem when using the language identifier

Posted by MilleBii <mi...@gmail.com>.
actually the question I had when looking at the logs : why there are so many
plugin loading, I miss the logic ?



2009/7/17 MilleBii <mi...@gmail.com>

> never applied a patch so far... so I will do my best.
>
>
>
> 2009/7/17 Doğacan Güney <do...@gmail.com>
>
> On Fri, Jul 17, 2009 at 00:30, MilleBii<mi...@gmail.com> wrote:
>> > Just trying indexing a smaller segment 300k URLs ... and the memory is
>> just
>> > going up and up... but it does NOT hit the physical boundary limit.
>> Sounds
>> > like a "memory leak" ???
>> > How come I thought Java was doing the garbage collection automatically
>> ????
>> >
>>
>> Can you try the patch at
>>
>> https://issues.apache.org/jira/browse/NUTCH-356
>>
>> (try cache_classes.patch)
>>
>> >
>> > 2009/7/16 MilleBii <mi...@gmail.com>
>> >
>> >> I get more details now for my error.
>> >> What can I do about it, I have 4GB of memory, but it is not fully used
>> (I
>> >> think).
>> >> I use cygwin/windows/local filesystem
>> >>
>> >> java.lang.OutOfMemoryError: Java heap space
>> >>     at
>> >>
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:498)
>> >>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> >>     at
>> >>
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>> >>
>> >> ---------- Forwarded message ----------
>> >> From: MilleBii <mi...@gmail.com>
>> >> Date: 2009/7/15
>> >> Subject: Errorr when using language-identifier plugin ?
>> >> To: nutch-user@lucene.apache.org
>> >>
>> >>
>> >> I decided to add the language-identifier plugin... but I get the
>> following
>> >> error when I start indexing my crawldb. Not really explicit.
>> >>  If I remove it works just fine. I tried on a smaller crawl database
>> that I
>> >> use for testing and it works fine too.
>> >> Any idea where to look for ?
>> >>
>> >>
>> >> 2009-07-15 16:19:54,875 WARN  mapred.LocalJobRunner - job_local_0001
>> >> 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer:
>> >> java.io.IOException: Job failed!
>> >>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>> >>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
>> >>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
>> >>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
>> >>
>> >>
>> >>
>> >> --
>> >> -MilleBii-
>> >>
>> >>
>> >>
>> >> --
>> >> -MilleBii-
>> >>
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>>
>>
>> --
>> Doğacan Güney
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Re: java heap space problem when using the language identifier

Posted by MilleBii <mi...@gmail.com>.
never applied a patch so far... so I will do my best.



2009/7/17 Doğacan Güney <do...@gmail.com>

> On Fri, Jul 17, 2009 at 00:30, MilleBii<mi...@gmail.com> wrote:
> > Just trying indexing a smaller segment 300k URLs ... and the memory is
> just
> > going up and up... but it does NOT hit the physical boundary limit.
> Sounds
> > like a "memory leak" ???
> > How come I thought Java was doing the garbage collection automatically
> ????
> >
>
> Can you try the patch at
>
> https://issues.apache.org/jira/browse/NUTCH-356
>
> (try cache_classes.patch)
>
> >
> > 2009/7/16 MilleBii <mi...@gmail.com>
> >
> >> I get more details now for my error.
> >> What can I do about it, I have 4GB of memory, but it is not fully used
> (I
> >> think).
> >> I use cygwin/windows/local filesystem
> >>
> >> java.lang.OutOfMemoryError: Java heap space
> >>     at
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:498)
> >>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >>     at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> >>
> >> ---------- Forwarded message ----------
> >> From: MilleBii <mi...@gmail.com>
> >> Date: 2009/7/15
> >> Subject: Errorr when using language-identifier plugin ?
> >> To: nutch-user@lucene.apache.org
> >>
> >>
> >> I decided to add the language-identifier plugin... but I get the
> following
> >> error when I start indexing my crawldb. Not really explicit.
> >>  If I remove it works just fine. I tried on a smaller crawl database
> that I
> >> use for testing and it works fine too.
> >> Any idea where to look for ?
> >>
> >>
> >> 2009-07-15 16:19:54,875 WARN  mapred.LocalJobRunner - job_local_0001
> >> 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer:
> >> java.io.IOException: Job failed!
> >>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
> >>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
> >>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
> >>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
> >>
> >>
> >>
> >> --
> >> -MilleBii-
> >>
> >>
> >>
> >> --
> >> -MilleBii-
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>
>
>
> --
> Doğacan Güney
>



-- 
-MilleBii-

Re: java heap space problem when using the language identifier

Posted by Doğacan Güney <do...@gmail.com>.
On Fri, Jul 17, 2009 at 00:30, MilleBii<mi...@gmail.com> wrote:
> Just trying indexing a smaller segment 300k URLs ... and the memory is just
> going up and up... but it does NOT hit the physical boundary limit. Sounds
> like a "memory leak" ???
> How come I thought Java was doing the garbage collection automatically ????
>

Can you try the patch at

https://issues.apache.org/jira/browse/NUTCH-356

(try cache_classes.patch)

>
> 2009/7/16 MilleBii <mi...@gmail.com>
>
>> I get more details now for my error.
>> What can I do about it, I have 4GB of memory, but it is not fully used (I
>> think).
>> I use cygwin/windows/local filesystem
>>
>> java.lang.OutOfMemoryError: Java heap space
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:498)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>>
>> ---------- Forwarded message ----------
>> From: MilleBii <mi...@gmail.com>
>> Date: 2009/7/15
>> Subject: Errorr when using language-identifier plugin ?
>> To: nutch-user@lucene.apache.org
>>
>>
>> I decided to add the language-identifier plugin... but I get the following
>> error when I start indexing my crawldb. Not really explicit.
>>  If I remove it works just fine. I tried on a smaller crawl database that I
>> use for testing and it works fine too.
>> Any idea where to look for ?
>>
>>
>> 2009-07-15 16:19:54,875 WARN  mapred.LocalJobRunner - job_local_0001
>> 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer:
>> java.io.IOException: Job failed!
>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
>>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
>>
>>
>>
>> --
>> -MilleBii-
>>
>>
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>



-- 
Doğacan Güney

Re: java heap space problem when using the language identifier

Posted by MilleBii <mi...@gmail.com>.
Just trying indexing a smaller segment 300k URLs ... and the memory is just
going up and up... but it does NOT hit the physical boundary limit. Sounds
like a "memory leak" ???
How come I thought Java was doing the garbage collection automatically ????


2009/7/16 MilleBii <mi...@gmail.com>

> I get more details now for my error.
> What can I do about it, I have 4GB of memory, but it is not fully used (I
> think).
> I use cygwin/windows/local filesystem
>
> java.lang.OutOfMemoryError: Java heap space
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:498)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>
> ---------- Forwarded message ----------
> From: MilleBii <mi...@gmail.com>
> Date: 2009/7/15
> Subject: Errorr when using language-identifier plugin ?
> To: nutch-user@lucene.apache.org
>
>
> I decided to add the language-identifier plugin... but I get the following
> error when I start indexing my crawldb. Not really explicit.
>  If I remove it works just fine. I tried on a smaller crawl database that I
> use for testing and it works fine too.
> Any idea where to look for ?
>
>
> 2009-07-15 16:19:54,875 WARN  mapred.LocalJobRunner - job_local_0001
> 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer:
> java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
>
>
>
> --
> -MilleBii-
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-