You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ken van Mulder <ke...@wavefire.com> on 2005/10/28 00:44:00 UTC

fetch questions - freezing

Hey folks,

I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 300k 
url list.

Initially, its able to reach ~25 pages/s with 150 threads. The fetcher 
gets progressivly slower though, dropping down to about ~15 pages/s 
after about 2-3 hours or so and continues to slow down. I've seen a few 
references on these lists to the issue, but I'm not clear on if its 
expected behaviour or if there's a solution to it? I've also noticed 
that the process takes up more and more memory as it runs, is this 
expected as well?

Also, I seem to have a problem with the fetcher hanging at a certain 
point. At about half way through the list it will continue to run (chew 
up cpu cycles) but with no output, stack traces or anything. The CPU 
usage will be near 100%, memory usage will have gotten pretty close to 
the boxes limit, and it will sit there for hours. I'm trying to run it 
again with a profiler to see if I can figure out what its doing.

Has anyone run into a similar problem?

-- 
Ken van Mulder
Wavefire Technologies Corporation

http://www.wavefire.com
250.717.0200 (ext 113)

Re: fetch questions - freezing

Posted by Doug Cutting <cu...@nutch.org>.

Ken van Mulder wrote:
> As a side note, does anyone have any recommendations for profiling 
> software? I've used the standard hprof, which slows down the process to 
> much for my needs and jmp which seems pretty unstable.

I recommend 'kill -QUIT' as a poor-man's profiler.  With a few stack 
dumps you can usually get a decent idea of where the time is going.  If 
you want to get fancy you can 'kill -QUIT' every minute or so, then use 
'sort | uniq -c | sort -nv' so see where you're spending a lot of time.

Doug

Re: fetch questions - freezing

Posted by Ken van Mulder <ke...@wavefire.com>.

P4 2.4/HTT, 512MB RAM, 10Mb pipe. running FreeBSD 7.0 is my current 
machine of choice. Without our custom plugins is ~32 pages/s.

I've tried a number of different machines/configurations so far as well:

Celeron 800, 256MB with FreeBSD 6.0 - This one seemed to max out at 
about 4.5 - 5 pages/s, far less than I thought. From the profiling 
information it seemed to be cpu bound.

Dual 600, 1G with FreeBSD 6.0 - ~20 pages/s

P4 2.4/HTT, 1GB with FreeBSD 4.1 - This was the biggest surprise as it 
maxed out at about 4 - 5 pages/s.

Celeron 2.4, 256MB with FreeBSD 5.3 - ~30 pages/s.

All of them have been using the same pipe, I've tried a few dns servers 
for each and all have been using jdk 1.4.2

Configurations ranged anywhere from 50 -> 200 threads with the results 
here being the optimal. The rest of the settings were default, or close 
to. I may have tweaked max threads/host and max delays by a couple on 
one or two of the boxes. This resulted in less errors, but didn't affect 
the overall speed significantly.

 From the profiling information, the celeron 800 seemed to be CPU bound 
and that was its limiting factor. The box with FreeBSD 4.1, as far as I 
can tell, was slowed down by OS issues which seemed to have been fixed 
in 5. I believe that 5 has much improved threading capability and 
networking performance which seemed to make the difference.

As a side note, does anyone have any recommendations for profiling 
software? I've used the standard hprof, which slows down the process to 
much for my needs and jmp which seems pretty unstable.

-Ken

AJ Chen wrote:
> I noticed the same problem. My temp solution is to fetch smaller number of
> pages, say 200k, per cycle so that the slow-down in each cycle won't make
> too much impact.
> But, I also run into another problem: the start-up download speed varies
> from run to run. Most of time, it's running at speed (1 page/s) that is much
> slower than my bandwidth (1.5mbps) allows. What bandwidth and hardware do
> you use?
> 
> AJ
> 
> On 10/27/05, Ken van Mulder <ke...@wavefire.com> wrote:
> 
>>Hey folks,
>>
>>I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 300k
>>url list.
>>
>>Initially, its able to reach ~25 pages/s with 150 threads. The fetcher
>>gets progressivly slower though, dropping down to about ~15 pages/s
>>after about 2-3 hours or so and continues to slow down. I've seen a few
>>references on these lists to the issue, but I'm not clear on if its
>>expected behaviour or if there's a solution to it? I've also noticed
>>that the process takes up more and more memory as it runs, is this
>>expected as well?
>>
>>Also, I seem to have a problem with the fetcher hanging at a certain
>>point. At about half way through the list it will continue to run (chew
>>up cpu cycles) but with no output, stack traces or anything. The CPU
>>usage will be near 100%, memory usage will have gotten pretty close to
>>the boxes limit, and it will sit there for hours. I'm trying to run it
>>again with a profiler to see if I can figure out what its doing.
>>
>>Has anyone run into a similar problem?
>>
>>--
>>Ken van Mulder
>>Wavefire Technologies Corporation
>>
>>http://www.wavefire.com
>>250.717.0200 (ext 113)
>>
> 
> 

-- 
Ken van Mulder
Wavefire Technologies Corporation

http://www.wavefire.com
250.717.0200 (ext 113)

Re: fetch questions - freezing

Posted by AJ Chen <ca...@gmail.com>.

I noticed the same problem. My temp solution is to fetch smaller number of
pages, say 200k, per cycle so that the slow-down in each cycle won't make
too much impact.
But, I also run into another problem: the start-up download speed varies
from run to run. Most of time, it's running at speed (1 page/s) that is much
slower than my bandwidth (1.5mbps) allows. What bandwidth and hardware do
you use?

AJ

On 10/27/05, Ken van Mulder <ke...@wavefire.com> wrote:
>
> Hey folks,
>
> I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 300k
> url list.
>
> Initially, its able to reach ~25 pages/s with 150 threads. The fetcher
> gets progressivly slower though, dropping down to about ~15 pages/s
> after about 2-3 hours or so and continues to slow down. I've seen a few
> references on these lists to the issue, but I'm not clear on if its
> expected behaviour or if there's a solution to it? I've also noticed
> that the process takes up more and more memory as it runs, is this
> expected as well?
>
> Also, I seem to have a problem with the fetcher hanging at a certain
> point. At about half way through the list it will continue to run (chew
> up cpu cycles) but with no output, stack traces or anything. The CPU
> usage will be near 100%, memory usage will have gotten pretty close to
> the boxes limit, and it will sit there for hours. I'm trying to run it
> again with a profiler to see if I can figure out what its doing.
>
> Has anyone run into a similar problem?
>
> --
> Ken van Mulder
> Wavefire Technologies Corporation
>
> http://www.wavefire.com
> 250.717.0200 (ext 113)
>

Re: fetch questions - freezing

Posted by Ken Krugler <kk...@krugle.net>.

>Ken Krugler wrote:
>>We're only using the html & text parsers, so I don't think that's 
>>the problem. Plus we dumping the thread stack when it hangs, and 
>>it's always in the ChunkedInputStream.exhaustInputStream() process 
>>(see trace below).
>
>The trace did not make it.

Oops - see at the end of this email.

>Have you tried protocol-http instead of protocol-httpclient?

No, not yet. Andrzej also suggested this.

>Is it any better?

I'll give it a try & report back.

>What JVM are you running?

Java version "1.4.2_09"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_09-b05)
Java HotSpot(TM) Client VM (build 1.4.2_09-b05, mixed mode)

>I get fewer socket hangs in 1.5 than 1.4.

I'll see if we can update our server to 1.5, thanks!

>Also, the mapred fetcher has been changed to succeed even when 
>threads hang.  Perhaps we should change the 0.7 fetcher similarly? 
>I think we should probably go even farther, and kill threads which 
>take longer than a timeout to process a url.  Thread.stop() is 
>theoretically unsafe, but I've used it in the past for this sort of 
>thing and never traced subsequent problems back to it...

I thought the issue with Thread.stop() is that it won't interrupt a 
hung java.io read, and that's why java.nio (which is interruptible) 
is preferred.

But from what I'm seeing, Thread.stop() should work, since there is a 
trickle of data coming in from the remote host, and thus the read 
calls should be returning.

I'll give this a try as well.

Thanks,

-- Ken

=====================================================================
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:222)
at java.io.BufferedInputStream.read(BufferedInputStream.java:277)
- locked <0x27252050> (a java.io.BufferedInputStream)
at 
org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:169)
at 
org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:183)
at 
org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(ChunkedInputStream.java:368)
at 
org.apache.commons.httpclient.ContentLengthInputStream.close(ContentLengthInputStream.java:117)
at java.io.FilterInputStream.close(FilterInputStream.java:159)
at 
org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(AutoCloseInputStream.java:176)
at 
org.apache.commons.httpclient.AutoCloseInputStream.close(AutoCloseInputStream.java:140)
at 
org.apache.nutch.protocol.httpclient.HttpResponse.runResponse(HttpResponse.java:159)
at 
org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:97)
at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:222)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:150)

-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: fetch questions - freezing

Posted by Doug Cutting <cu...@nutch.org>.

Ken Krugler wrote:
> We're only using the html & text parsers, so I don't think that's the 
> problem. Plus we dumping the thread stack when it hangs, and it's always 
> in the ChunkedInputStream.exhaustInputStream() process (see trace below).

The trace did not make it.

Have you tried protocol-http instead of protocol-httpclient?  Is it any 
better?  What JVM are you running?  I get fewer socket hangs in 1.5 than 
1.4.

Also, the mapred fetcher has been changed to succeed even when threads 
hang.  Perhaps we should change the 0.7 fetcher similarly?  I think we 
should probably go even farther, and kill threads which take longer than 
a timeout to process a url.  Thread.stop() is theoretically unsafe, but 
I've used it in the past for this sort of thing and never traced 
subsequent problems back to it...

Doug

Re: fetch questions - freezing

Posted by Ken Krugler <kk...@transpac.com>.

>>>I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 
>>>300k url list.
>>>
>>>Initially, its able to reach ~25 pages/s with 150 threads. The 
>>>fetcher gets progressivly slower though, dropping down to about 
>>>~15 pages/s after about 2-3 hours or so and continues to slow 
>>>down. I've seen a few references on these lists to the issue, but 
>>>I'm not clear on if its expected behaviour or if there's a 
>>>solution to it? I've also noticed that the process takes up more 
>>>and more memory as it runs, is this expected as well?
>>
>>We've run into a similar situation, though we're using Nutch 0.7. 
>>What seems to be happening is that a host is slowly trickling data 
>>back to us. This happens when we're trying to releasing the 
>>connection, and we get stuck in the commons-httpclient code at 
>>ChunkedInputStream.exhaustInputStream().
>>
>>I have a theory that this happens when our http protocol max size 
>>limit is hit. The protocol-httpclient plugin reads up to the limit 
>>(in our case, 1MB) and then tries to release the connection, but 
>>for some reason the host keeps sending us data, albeit at some very 
>>slow rate. I was seeing 30Kbits/second or so.
>>
>>Anyway, I've added the commons-httpclient code to my project and am 
>>plugging in some additional logging to help track down the issue.
>
>I would appreciate any feedback. Please also note that you need to 
>eliminate other factors, like the limit of threads per host, but 
>most notably the overhead of parsing - please use the -noParse flag 
>to fetcher for all those experiments. In the past it was common for 
>the fetcher to be stuck in a buggy parser plugin, so you will need 
>to eliminate this factor.

We're only using the html & text parsers, so I don't think that's the 
problem. Plus we dumping the thread stack when it hangs, and it's 
always in the ChunkedInputStream.exhaustInputStream() process (see 
trace below).

We've left the # of threads per host set to 1, and varied the total 
number of threads from 50 up to 400. Increasing from 50 to 200 
definitely improved performance, but going from 200 to 400 seemed to 
have minimal impact, other than boosting the CPU usage to 80%.

More research results to come...

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

RE: fetch questions - freezing

Posted by Steve Betts <sb...@minethurn.com>.

There is an issue with the PDFBox library shipped with Nutch 0.7. It will
hang parsing certain PDF files. PDFBox 0.7.2 fixes this issue.  If you are
parsing PDF files, then this could also be a problem.

Thanks,

Steve Betts
sbetts@minethurn.com
937-477-1797

-----Original Message-----
From: Byron Miller [mailto:byronmhome@yahoo.com]
Sent: Friday, October 28, 2005 8:10 AM
To: nutch-user@lucene.apache.org
Subject: Re: fetch questions - freezing

For what its worth i fetch my segments of 1 million
urls with 80 threads at a time and no slow downs.


I'll grab some of my stats and publish them, but i
haven't had problems with fetcher slowing down like
this in a long time.

(linux/Centos 4.2 platform)

-byron

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Ken Krugler wrote:
>
> >> I'm using the mapred branch on a FreeBSD 7.0 box
> to do fetchs of a
> >> 300k url list.
> >>
> >> Initially, its able to reach ~25 pages/s with 150
> threads. The
> >> fetcher gets progressivly slower though, dropping
> down to about ~15
> >> pages/s after about 2-3 hours or so and continues
> to slow down. I've
> >> seen a few references on these lists to the
> issue, but I'm not clear
> >> on if its expected behaviour or if there's a
> solution to it? I've
> >> also noticed that the process takes up more and
> more memory as it
> >> runs, is this expected as well?
> >
> >
> > We've run into a similar situation, though we're
> using Nutch 0.7. What
> > seems to be happening is that a host is slowly
> trickling data back to
> > us. This happens when we're trying to releasing
> the connection, and we
> > get stuck in the commons-httpclient code at
> > ChunkedInputStream.exhaustInputStream().
> >
> > I have a theory that this happens when our http
> protocol max size
> > limit is hit. The protocol-httpclient plugin reads
> up to the limit (in
> > our case, 1MB) and then tries to release the
> connection, but for some
> > reason the host keeps sending us data, albeit at
> some very slow rate.
> > I was seeing 30Kbits/second or so.
> >
> > Anyway, I've added the commons-httpclient code to
> my project and am
> > plugging in some additional logging to help track
> down the issue.
>
>
> I would appreciate any feedback. Please also note
> that you need to
> eliminate other factors, like the limit of threads
> per host, but most
> notably the overhead of parsing - please use the
> -noParse flag to
> fetcher for all those experiments. In the past it
> was common for the
> fetcher to be stuck in a buggy parser plugin, so you
> will need to
> eliminate this factor.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>

Re: fetch questions - freezing

Posted by Earl Cahill <ca...@yahoo.com>.

Trunk?  Map reduce?  Could you describe your box
setup, job division, and maybe post your
conf/nutch-site.xml file?

Just trying to get things going and not have much luck
with the mapreduce branch.  I also tried trunk, the
crawl stops around 30000 pages (out of maybe a million
), and once it's done I can't get results to show up
via tomcat.

Thanks,
Earl


--- Byron Miller <by...@yahoo.com> wrote:

> For what its worth i fetch my segments of 1 million
> urls with 80 threads at a time and no slow downs.
> 
> 
> I'll grab some of my stats and publish them, but i
> haven't had problems with fetcher slowing down like
> this in a long time.
> 
> (linux/Centos 4.2 platform)
> 
> -byron
> 
> --- Andrzej Bialecki <ab...@getopt.org> wrote:
> 
> > Ken Krugler wrote:
> > 
> > >> I'm using the mapred branch on a FreeBSD 7.0
> box
> > to do fetchs of a 
> > >> 300k url list.
> > >>
> > >> Initially, its able to reach ~25 pages/s with
> 150
> > threads. The 
> > >> fetcher gets progressivly slower though,
> dropping
> > down to about ~15 
> > >> pages/s after about 2-3 hours or so and
> continues
> > to slow down. I've 
> > >> seen a few references on these lists to the
> > issue, but I'm not clear 
> > >> on if its expected behaviour or if there's a
> > solution to it? I've 
> > >> also noticed that the process takes up more and
> > more memory as it 
> > >> runs, is this expected as well?
> > >
> > >
> > > We've run into a similar situation, though we're
> > using Nutch 0.7. What 
> > > seems to be happening is that a host is slowly
> > trickling data back to 
> > > us. This happens when we're trying to releasing
> > the connection, and we 
> > > get stuck in the commons-httpclient code at 
> > > ChunkedInputStream.exhaustInputStream().
> > >
> > > I have a theory that this happens when our http
> > protocol max size 
> > > limit is hit. The protocol-httpclient plugin
> reads
> > up to the limit (in 
> > > our case, 1MB) and then tries to release the
> > connection, but for some 
> > > reason the host keeps sending us data, albeit at
> > some very slow rate. 
> > > I was seeing 30Kbits/second or so.
> > >
> > > Anyway, I've added the commons-httpclient code
> to
> > my project and am 
> > > plugging in some additional logging to help
> track
> > down the issue.
> > 
> > 
> > I would appreciate any feedback. Please also note
> > that you need to 
> > eliminate other factors, like the limit of threads
> > per host, but most 
> > notably the overhead of parsing - please use the
> > -noParse flag to 
> > fetcher for all those experiments. In the past it
> > was common for the 
> > fetcher to be stuck in a buggy parser plugin, so
> you
> > will need to 
> > eliminate this factor.
> > 
> > -- 
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _  
> > __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval,
> > Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System
> > Integration
> > http://www.sigram.com  Contact: info at sigram dot
> > com
> > 
> > 
> > 
> 
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: fetch questions - freezing

Posted by Byron Miller <by...@yahoo.com>.

For what its worth i fetch my segments of 1 million
urls with 80 threads at a time and no slow downs.


I'll grab some of my stats and publish them, but i
haven't had problems with fetcher slowing down like
this in a long time.

(linux/Centos 4.2 platform)

-byron

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Ken Krugler wrote:
> 
> >> I'm using the mapred branch on a FreeBSD 7.0 box
> to do fetchs of a 
> >> 300k url list.
> >>
> >> Initially, its able to reach ~25 pages/s with 150
> threads. The 
> >> fetcher gets progressivly slower though, dropping
> down to about ~15 
> >> pages/s after about 2-3 hours or so and continues
> to slow down. I've 
> >> seen a few references on these lists to the
> issue, but I'm not clear 
> >> on if its expected behaviour or if there's a
> solution to it? I've 
> >> also noticed that the process takes up more and
> more memory as it 
> >> runs, is this expected as well?
> >
> >
> > We've run into a similar situation, though we're
> using Nutch 0.7. What 
> > seems to be happening is that a host is slowly
> trickling data back to 
> > us. This happens when we're trying to releasing
> the connection, and we 
> > get stuck in the commons-httpclient code at 
> > ChunkedInputStream.exhaustInputStream().
> >
> > I have a theory that this happens when our http
> protocol max size 
> > limit is hit. The protocol-httpclient plugin reads
> up to the limit (in 
> > our case, 1MB) and then tries to release the
> connection, but for some 
> > reason the host keeps sending us data, albeit at
> some very slow rate. 
> > I was seeing 30Kbits/second or so.
> >
> > Anyway, I've added the commons-httpclient code to
> my project and am 
> > plugging in some additional logging to help track
> down the issue.
> 
> 
> I would appreciate any feedback. Please also note
> that you need to 
> eliminate other factors, like the limit of threads
> per host, but most 
> notably the overhead of parsing - please use the
> -noParse flag to 
> fetcher for all those experiments. In the past it
> was common for the 
> fetcher to be stuck in a buggy parser plugin, so you
> will need to 
> eliminate this factor.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
>

Re: fetch questions - freezing

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ken Krugler wrote:

>> I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 
>> 300k url list.
>>
>> Initially, its able to reach ~25 pages/s with 150 threads. The 
>> fetcher gets progressivly slower though, dropping down to about ~15 
>> pages/s after about 2-3 hours or so and continues to slow down. I've 
>> seen a few references on these lists to the issue, but I'm not clear 
>> on if its expected behaviour or if there's a solution to it? I've 
>> also noticed that the process takes up more and more memory as it 
>> runs, is this expected as well?
>
>
> We've run into a similar situation, though we're using Nutch 0.7. What 
> seems to be happening is that a host is slowly trickling data back to 
> us. This happens when we're trying to releasing the connection, and we 
> get stuck in the commons-httpclient code at 
> ChunkedInputStream.exhaustInputStream().
>
> I have a theory that this happens when our http protocol max size 
> limit is hit. The protocol-httpclient plugin reads up to the limit (in 
> our case, 1MB) and then tries to release the connection, but for some 
> reason the host keeps sending us data, albeit at some very slow rate. 
> I was seeing 30Kbits/second or so.
>
> Anyway, I've added the commons-httpclient code to my project and am 
> plugging in some additional logging to help track down the issue.


I would appreciate any feedback. Please also note that you need to 
eliminate other factors, like the limit of threads per host, but most 
notably the overhead of parsing - please use the -noParse flag to 
fetcher for all those experiments. In the past it was common for the 
fetcher to be stuck in a buggy parser plugin, so you will need to 
eliminate this factor.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: fetch questions - freezing

Posted by Ken Krugler <kk...@transpac.com>.

>I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 
>300k url list.
>
>Initially, its able to reach ~25 pages/s with 150 threads. The 
>fetcher gets progressivly slower though, dropping down to about ~15 
>pages/s after about 2-3 hours or so and continues to slow down. I've 
>seen a few references on these lists to the issue, but I'm not clear 
>on if its expected behaviour or if there's a solution to it? I've 
>also noticed that the process takes up more and more memory as it 
>runs, is this expected as well?

We've run into a similar situation, though we're using Nutch 0.7. 
What seems to be happening is that a host is slowly trickling data 
back to us. This happens when we're trying to releasing the 
connection, and we get stuck in the commons-httpclient code at 
ChunkedInputStream.exhaustInputStream().

I have a theory that this happens when our http protocol max size 
limit is hit. The protocol-httpclient plugin reads up to the limit 
(in our case, 1MB) and then tries to release the connection, but for 
some reason the host keeps sending us data, albeit at some very slow 
rate. I was seeing 30Kbits/second or so.

Anyway, I've added the commons-httpclient code to my project and am 
plugging in some additional logging to help track down the issue.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: fetch questions - freezing

Posted by Ken van Mulder <ke...@wavefire.com>.

I've got the default plugins enabled:

nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)

And kill -QUIT works great, right up until the process stalls. Kill 
seems to have some issues with the java process in general. The only way 
to kill a running fetch with numerous threads is to kill -9 it. kill 
-QUIT works initially, but not after it stalls.

I'll be trying with -noparse and a few different profilers in the next 
bit to see what happens.

Doug Cutting wrote:
> Ken van Mulder wrote:
> 
>> Initially, its able to reach ~25 pages/s with 150 threads. The fetcher 
>> gets progressivly slower though, dropping down to about ~15 pages/s 
>> after about 2-3 hours or so and continues to slow down. I've seen a 
>> few references on these lists to the issue, but I'm not clear on if 
>> its expected behaviour or if there's a solution to it? I've also 
>> noticed that the process takes up more and more memory as it runs, is 
>> this expected as well?
> 
> 
> What parse plugins do you have enabled?
> 
> The best way to diagnose these problems is to 'kill -QUIT' an offending 
> fetcher process.  This will dump the stack of every fetcher thread. This 
> will likely look quite different at the start of your run than later in 
> the run, and that difference should point to the problem.
> 
> In the past I have seen these symptoms primarily with parser plugins.  I 
> have also seen threads hang infinitely in a socket read, but that is 
> much rarer.
> 
> Doug
> 
> 

-- 
Ken van Mulder
Wavefire Technologies Corporation

http://www.wavefire.com
250.717.0200 (ext 113)

Re: fetch questions - freezing

Posted by Doug Cutting <cu...@nutch.org>.

Ken van Mulder wrote:
> Initially, its able to reach ~25 pages/s with 150 threads. The fetcher 
> gets progressivly slower though, dropping down to about ~15 pages/s 
> after about 2-3 hours or so and continues to slow down. I've seen a few 
> references on these lists to the issue, but I'm not clear on if its 
> expected behaviour or if there's a solution to it? I've also noticed 
> that the process takes up more and more memory as it runs, is this 
> expected as well?

What parse plugins do you have enabled?

The best way to diagnose these problems is to 'kill -QUIT' an offending 
fetcher process.  This will dump the stack of every fetcher thread. 
This will likely look quite different at the start of your run than 
later in the run, and that difference should point to the problem.

In the past I have seen these symptoms primarily with parser plugins.  I 
have also seen threads hang infinitely in a socket read, but that is 
much rarer.

Doug