You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by axierr <ax...@gmail.com> on 2011/02/02 18:51:36 UTC

Nutch 1.2 performance and memory issues

Hi to all,
I'm testing nutch 1-2 in pseudo distributed and local mode. I have a
database with around 126M urls. They are all injected and prepared to fetch.
When generating segments, there is always first a phase of low and stable
memory, and near the end of the operation, memory grows up. 
I have doubts of what is normal here, ¿how much memory requieres segment
generation of 126M urls? I have seen 7Gb of memory filled, and then jvm
crash with gc overhead limit, and other errors. 
When I do topN 10000000 it works well but the memory comsuption is very high
too.

I don't know if this is normal or not, I've been reading nutch-844, and
other memory problems, but I don't know if they are applicable on segment
generation. Maybe is a problem of using in pseudo distribution mode or in
local mode, or maybe is a memory leak, or maybe is normal.

By the way, ¿How do you guys scale the generation of segments, database
updates etc?
Using crawl.database.update and generating small segments?

Thanks in advance,
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2407256.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.2 performance and memory issues

Posted by axierr <ax...@gmail.com>.
Finally I've solved my issue mapping hosts into sqlite database index and
doing selects, a bit slower but takes a lot less memory.

Thanks, 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2432887.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.2 performance and memory issues

Posted by Julien Nioche <li...@gmail.com>.
That's indeed the map I was initially referring to. Since you have pretty
much 126M unique hosts it is no wonder it takes a substantial amount of
memory. This is an extreme case, especially given that you do that on a
single machine. Best solution would be to NOT count per host (since you know
that they are unique) or even better start using more than one machine

Julien

On 3 February 2011 18:38, axierr <ax...@gmail.com> wrote:

>
> Well, I thin i founded the problem. Trying simple tests.
> When 1,3M of different domains and generate.max.count, it grows to 1,6-1,7
> Gb.
> When 1,3M and generate.max.count disabled, around 300-400mb.
> 126M different host are simply too much for the hash of hosts...
> I'm going to review that code
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2416709.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch 1.2 performance and memory issues

Posted by axierr <ax...@gmail.com>.
Well, I thin i founded the problem. Trying simple tests.
When 1,3M of different domains and generate.max.count, it grows to 1,6-1,7
Gb.
When 1,3M and generate.max.count disabled, around 300-400mb.
126M different host are simply too much for the hash of hosts... 
I'm going to review that code
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2416709.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.2 performance and memory issues

Posted by Julien Nioche <li...@gmail.com>.
Is it roughly when the memory goes out of control? Could be a dodgy URL
putting the URLnormalisation in a spin : one gets all sorts of horrors after
a while.

Maybe try using '-noNorm' on the Generation and see if that has any impact.
Would be good also to know on which job and map/red the issue is happening,
can use the Hadoop jobtracker GUI on the pseudo distributed mode to see that

Thanks

Julien



On 3 February 2011 00:28, axierr <ax...@gmail.com> wrote:

>
>
> Here are the results, I'm going to do now without url partitioning :
>
> nutch generator output -
> Generator: starting at 2011-02-02 20:11:00
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
>
> jstack output -
> Full thread dump Java HotSpot(TM) Client VM (17.1-b03 mixed mode, sharing):
>
> "communication thread" daemon prio=10 tid=0x0a104800 nid=0x637 runnable
> [0xb3cad000]
>   java.lang.Thread.State: RUNNABLE
>        at java.lang.Object.getClass(Native Method)
>        at java.util.ArrayList.<init>(ArrayList.java:134)
>        at
> org.apache.hadoop.fs.FileSystem.getAllStatistics(FileSystem.java:1567)
>        - locked <0x8ef584c8> (a java.lang.Class for
> org.apache.hadoop.fs.FileSystem)
>        at org.apache.hadoop.mapred.Task.updateCounters(Task.java:652)
>        - locked <0x66a3d020> (a org.apache.hadoop.mapred.ReduceTask)
>        at org.apache.hadoop.mapred.Task.access$600(Task.java:56)
>        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:539)
>        at java.lang.Thread.run(Thread.java:662)
>
> "Attach Listener" daemon prio=10 tid=0x0a0a9800 nid=0x7e31 waiting on
> condition [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Thread-13" prio=10 tid=0xb3b27800 nid=0x6fd9 runnable [0xb3cfe000]
>   java.lang.Thread.State: RUNNABLE
>        at java.util.ArrayList.size(ArrayList.java:177)
>        at java.util.AbstractList$Itr.hasNext(AbstractList.java:339)
>        at
>
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:168)
>        - locked <0x66a48778> (a
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer)
>        at
>
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:179)
>        - locked <0x66a48778> (a
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer)
>        at
> org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
>        at
> org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:244)
>        at
> org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
>        at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
>
> "Low Memory Detector" daemon prio=10 tid=0x09ec8800 nid=0x6fb3 runnable
> [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x09ec6800 nid=0x6fb2 waiting on
> condition [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x09ec4c00 nid=0x6fb1 runnable
> [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x09ec0800 nid=0x6fb0 in Object.wait()
> [0xb46cc000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x65410258> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>        - locked <0x65410258> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x09ebbc00 nid=0x6faf in
> Object.wait() [0xb471d000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x654102e8> (a java.lang.ref.Reference$Lock)
>        at java.lang.Object.wait(Object.java:485)
>        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>        - locked <0x654102e8> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x09e97000 nid=0x6fad runnable [0xb6c97000]
>   java.lang.Thread.State: RUNNABLE
>        at
> java.text.DecimalFormatSymbols.initialize(DecimalFormatSymbols.java:509)
>        at
> java.text.DecimalFormatSymbols.<init>(DecimalFormatSymbols.java:77)
>        at java.text.DecimalFormat.<init>(DecimalFormat.java:416)
>        at
> org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:113)
>        at
> org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1283)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
>        at org.apache.nutch.crawl.Generator.generate(Generator.java:526)
>        at org.apache.nutch.crawl.Generator.run(Generator.java:692)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Generator.main(Generator.java:648)
>
> "VM Thread" prio=10 tid=0x09eba400 nid=0x6fae runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x09ecac00 nid=0x6fb4 waiting on
> condition
>
> JNI global references: 1419
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2410061.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch 1.2 performance and memory issues

Posted by axierr <ax...@gmail.com>.

Here are the results, I'm going to do now without url partitioning :

nutch generator output - 
Generator: starting at 2011-02-02 20:11:00
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.

jstack output -
Full thread dump Java HotSpot(TM) Client VM (17.1-b03 mixed mode, sharing):

"communication thread" daemon prio=10 tid=0x0a104800 nid=0x637 runnable
[0xb3cad000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Object.getClass(Native Method)
        at java.util.ArrayList.<init>(ArrayList.java:134)
        at
org.apache.hadoop.fs.FileSystem.getAllStatistics(FileSystem.java:1567)
        - locked <0x8ef584c8> (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
        at org.apache.hadoop.mapred.Task.updateCounters(Task.java:652)
        - locked <0x66a3d020> (a org.apache.hadoop.mapred.ReduceTask)
        at org.apache.hadoop.mapred.Task.access$600(Task.java:56)
        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:539)
        at java.lang.Thread.run(Thread.java:662)

"Attach Listener" daemon prio=10 tid=0x0a0a9800 nid=0x7e31 waiting on
condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Thread-13" prio=10 tid=0xb3b27800 nid=0x6fd9 runnable [0xb3cfe000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList.size(ArrayList.java:177)
        at java.util.AbstractList$Itr.hasNext(AbstractList.java:339)
        at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:168)
        - locked <0x66a48778> (a
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer)
        at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:179)
        - locked <0x66a48778> (a
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer)
        at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
        at
org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:244)
        at
org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
        at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)

"Low Memory Detector" daemon prio=10 tid=0x09ec8800 nid=0x6fb3 runnable
[0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x09ec6800 nid=0x6fb2 waiting on
condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x09ec4c00 nid=0x6fb1 runnable
[0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x09ec0800 nid=0x6fb0 in Object.wait()
[0xb46cc000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x65410258> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
        - locked <0x65410258> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x09ebbc00 nid=0x6faf in
Object.wait() [0xb471d000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x654102e8> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x654102e8> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x09e97000 nid=0x6fad runnable [0xb6c97000]
   java.lang.Thread.State: RUNNABLE
        at
java.text.DecimalFormatSymbols.initialize(DecimalFormatSymbols.java:509)
        at
java.text.DecimalFormatSymbols.<init>(DecimalFormatSymbols.java:77)
        at java.text.DecimalFormat.<init>(DecimalFormat.java:416)
        at
org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:113)
        at
org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1283)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:526)
        at org.apache.nutch.crawl.Generator.run(Generator.java:692)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Generator.main(Generator.java:648)

"VM Thread" prio=10 tid=0x09eba400 nid=0x6fae runnable

"VM Periodic Task Thread" prio=10 tid=0x09ecac00 nid=0x6fb4 waiting on
condition

JNI global references: 1419
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2410061.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.2 performance and memory issues

Posted by axierr <ax...@gmail.com>.
I have done multiple jstacks, and here is when memory grows up a lot :

"Thread-13" prio=6 tid=0x0000000009a8f000 nid=0x102c runnable
[0x000000000ba3f000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:190)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
	- locked <0x0000000550c51320> (a java.io.BufferedOutputStream)
	at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	- locked <0x0000000550c512e0> (a org.apache.hadoop.fs.FSDataOutputStream)
	at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:354)
	at
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
	- locked <0x0000000550c51080> (a
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer)
	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
	- locked <0x0000000550c51080> (a
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer)
	at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
	- locked <0x0000000550c51080> (a
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer)
	at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	- locked <0x0000000550c51040> (a org.apache.hadoop.fs.FSDataOutputStream)
	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1013)
	- locked <0x0000000550c50ff0> (a org.apache.hadoop.io.SequenceFile$Writer)
	at
org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
	at
org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:102)
	at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
	at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:290)
	at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2417398.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.2 performance and memory issues

Posted by axierr <ax...@gmail.com>.
Well, I didn't express it correctly, here is a summary :

126M urls with unique hosts ( they are all different host names ) and using
:
<property>
  <name>generate.max.count</name>
  <value>10</value>
</property> 
and byHost ( default ). I know that this parameter is not going to take
effect in this generate because all are differente, but in the future is the
value I want to use. 

My tests until now:
With default values of nutch xmls  it worked ok. That discards regex
normalizer and mangled urls, right? ( I didn't touch regex xml file in my
previous conf, only nutch-site.xml values ) makes sense to test -noNorm now?

Now : I'm doing the same test with only generate.max.count enabled and
default values and see if that fails.

Sorry for the late response, but I'm doing tests on local mode and it's
slow, I don't want to introduce a new variable with pseudo distributed
system. ( Sometimes I get weird FS errors in that mode )

Thank you for everything Julien, this is a bit frustating


-- 
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2416176.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.2 performance and memory issues

Posted by Julien Nioche <li...@gmail.com>.
> for the record, there urls are 126M and they are all from from unique host,
> no host repeteated here. I'm partitioning by host ( in first round now it's
> not going to have effect, but for the next rounds ) partitioning 10 urls by
> host.
> ¿Can be that the problem?
>

no, limiting the number of urls by host is what I was asking about, which is
not the same as partioning by host. The latter is definitely not the source
of the problem.



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch 1.2 performance and memory issues

Posted by axierr <ax...@gmail.com>.
Thanks for your reply Julien,
for the record, there urls are 126M and they are all from from unique host,
no host repeteated here. I'm partitioning by host ( in first round now it's
not going to have effect, but for the next rounds ) partitioning 10 urls by
host. 
¿Can be that the problem?
In a moment I will try to do some test again and try jstrack and find if it
is partitioning or selection.

Thanks for everything.


-- 
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2408672.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.2 performance and memory issues

Posted by Julien Nioche <li...@gmail.com>.
Hi


I'm testing nutch 1-2 in pseudo distributed and local mode. I have a
> database with around 126M urls. They are all injected and prepared to
> fetch.
> When generating segments, there is always first a phase of low and stable
> memory, and near the end of the operation, memory grows up.
>

the generation consists of 2 separate jobs (selection and partition). do you
know which one is causing the issue? is it during the map or the reduce
stage?

the only thing I can think of is the map holding the count of urls per host.
Do you limit the number of URLS per host?



> I have doubts of what is normal here, ¿how much memory requieres segment
> generation of 126M urls? I have seen 7Gb of memory filled, and then jvm
> crash with gc overhead limit, and other errors.
> When I do topN 10000000 it works well but the memory comsuption is very
> high
> too.


> I don't know if this is normal or not, I've been reading nutch-844, and
> other memory problems, but I don't know if they are applicable on segment
> generation. Maybe is a problem of using in pseudo distribution mode or in
> local mode, or maybe is a memory leak, or maybe is normal.
>

it is worth investigating. Could you call jstack on the process when things
are starting to take a bit of memory? This could give us an indication


>
> By the way, ¿How do you guys scale the generation of segments, database
> updates etc?
> Using crawl.database.update and generating small segments?
>

the trouble with generating small segments is that when the crawldb gets
large, you spend most of the time generating / updating and proportionally
little time on fetching and parsing. It is more efficient to generate
multiple segments (using -maxNumSegments) once, fetch and parse each segment
then update the whole lot against the crawldb.

the obvious way of scaling being of course to use more than one machine in
your cluster

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com