You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by og...@yahoo.com on 2006/08/12 07:17:51 UTC

On fetcher slowness

Hello,

Several people reported issues with slow fetcher in 0.8...

I run Nutch on a dual CPU (+HT) box, and have noticed that the fetch speed didn't increase when I went from using 100 threads, to 200 threads.  Has anyone else observed the same?

I was using 2 map tasks (mapred.map.tasks property) in both cases, and the aggregate fetch speed was between 20 and 40 pages/sec.  This was a fetch of 50K+ URLs from a diverse set of servers.

While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of gettimeofday calls.  Running strace several times in a row kept showing that gettimeofday is the most frequent system call.
Has anyone tried tracing the fetcher process?  Where do these calls come from?  Any call to new Date() or Calendar.getInstance(), as must be done for every single logging call, perhaps?

I can certainly be impolite and lower fetcher.server.delay to 1 second or even 0, but I'd like to be polite.

I saw Ken Krugle's email suggesting to increast the number of fetcher threads to 2000+ and set the maximal java thread stack size to 512k with -Xss.  Has anyone other than Ken tried this with success?  Wouldn't the JVM go crazy context switching between this many threads?

Thanks,
Otis




Re: On fetcher slowness

Posted by Murat Ali Bayir <mu...@agmlab.com>.
Hi everbody, I want to learn the error percent when using 2000+ threads. 
When I use the number of threads equal to 2000 I got %80-90 error. When 
we use 1000 thread we got %20 error.  Is there any relation between the 
number of threads and error? Can anyone give any  configuration with 
2000+ threads getting lower errors? We have random set of urls from 
diverse set of hosts.

Dennis Kubes wrote:

> Sorry, yeah it was 344 not 334.
>
> Dennis
>
> Ken Krugler wrote:
>
>>> Here is a lightly tested patch for the crawl delay that allows urls 
>>> with crawl delay set greater than x number of seconds to be ignored. 
>>> I have currently run this on over 2 million urls and it is working 
>>> good.  With this patch I also added this to my nutch-site.xml file 
>>> for ignoring sites with crawl delay > 30 seconds.  The value can be 
>>> changed to suit.  I have seen crawl delays as high at 259200 seconds 
>>> in our crawls.
>>>
>>> <property>
>>> <name>http.max.crawl.delay</name>
>>> <value>30</value>
>>> <description>
>>> If the crawl delay in robots.txt is set to greater than this value 
>>> then the
>>> fetcher will ignore this page.  If set to -1 the fetcher will never 
>>> ignore
>>> pages and will wait the amount of time retrieved from robots.txt 
>>> crawl delay.
>>> This can cause hung threads if the delay is >= task timeout value.  
>>> If all
>>> threads get hung it can cause the fetcher task to about prematurely.
>>> </description>
>>> </property>
>>
>>
>> Thanks, this is useful. We did a survey of a bunch of sites, and 
>> found crawl delay values up to 99999 seconds.
>>
>>> The most recent patches to fetcher (not fetcher2) with NUTCH-334 
>>> seems to have speed up our fetching dramatically.  We are only using 
>>> about 50 fetchers but are consistently fetcher 1M + urls per day. 
>>> The patch attatched and the 334 patches will help if staying on 0.8. 
>>> If moving forward I think the new Fetcher2 codebase is a better 
>>> solution though still a new one.
>>
>>
>> NUTCH-334 or NUTCH-344? I'm assuming the latter 
>> (http://issues.apache.org/jira/browse/NUTCH-344).
>>
>> Thanks,
>>
>> -- Ken
>>
>>
>>> Ken Krugler wrote:
>>>
>>>>> On 8/12/06, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Several people reported issues with slow fetcher in 0.8...
>>>>>>
>>>>>> I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>>>>> fetch speed didn't increase when I went from using 100 threads, 
>>>>>> to 200 threads.  Has anyone else observed the same?
>>>>>>
>>>>>> I was using 2 map tasks (mapred.map.tasks property) in both 
>>>>>> cases, and the aggregate fetch speed was between 20 and 40 
>>>>>> pages/sec. This was a fetch of 50K+ URLs from a diverse set of 
>>>>>> servers.
>>>>>>
>>>>>> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>>>>> gettimeofday calls.  Running strace several times in a row kept 
>>>>>> showing that gettimeofday is the most frequent system call.
>>>>>> Has anyone tried tracing the fetcher process?  Where do these 
>>>>>> calls come from?  Any call to new Date() or 
>>>>>> Calendar.getInstance(), as must be done for every single logging 
>>>>>> call, perhaps?
>>>>>>
>>>>>> I can certainly be impolite and lower fetcher.server.delay to 1 
>>>>>> second or even 0, but I'd like to be polite.
>>>>>>
>>>>>> I saw Ken Krugle's email suggesting to increast the number of 
>>>>>> fetcher threads to 2000+ and set the maximal java thread stack 
>>>>>> size to 512k with -Xss.  Has anyone other than Ken tried this 
>>>>>> with success?  Wouldn't the JVM go crazy context switching 
>>>>>> between this many threads?
>>>>>
>>>>
>>>> Note that most of the time these fetcher threads are all blocked, 
>>>> waiting for other threads that are already fetching from the same 
>>>> IP address. So there's not a lot of thrashing.
>>>>
>>>>> I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>>>>> well. However number of fetcher for my part is 2500+.. I had to play
>>>>> around with this number to match my bandwidth limitation, but now I
>>>>> maximize my full bandwidth. But the problem that I run into are the
>>>>> fetcher threads hangs, and for crawl delay/robots.txt file (Please 
>>>>> see
>>>>> Dennis Kubes posting on this).
>>>>
>>>>
>>>> Yes, these are definitely problems.
>>>>
>>>> Stefan has been working on a queue-based fetcher that uses NIO. 
>>>> Seems very promising, but not yet ready for prime time.
>>>>
>>>> -- Ken
>>>
>>>
>>>
>>> Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
>>> ===================================================================
>>> --- src/java/org/apache/nutch/protocol/ProtocolStatus.java     
>>> (revision 428457)
>>> +++ src/java/org/apache/nutch/protocol/ProtocolStatus.java    
>>> (working copy)
>>> @@ -60,6 +60,10 @@
>>>    public static final int NOTFETCHING          = 20;
>>>    /** Unchanged since the last fetch. */
>>>    public static final int NOTMODIFIED          = 21;
>>> +  /** Request was refused by protocol plugins, because it would block.
>>> +   * The expected number of milliseconds to wait before retry may 
>>> be provided
>>> +   * in args. */
>>> +  public static final int WOULDBLOCK          = 22;
>>>      // Useful static instances for status codes that don't usually 
>>> require any
>>>    // additional arguments.
>>> Index: 
>>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>>>
>>> ===================================================================
>>> --- 
>>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>>>     (revision 430392)
>>> +++ 
>>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>>>     (working copy)
>>> @@ -125,6 +125,9 @@
>>>
>>>    /** Do we use HTTP/1.1? */
>>>    protected boolean useHttp11 = false;
>>> + +  /** Ignore page if crawl delay over this number of seconds */
>>> +  private int maxCrawlDelay = -1;
>>>
>>>    /** Creates a new instance of HttpBase */
>>>    public HttpBase() {
>>> @@ -152,6 +155,7 @@
>>>          this.userAgent = 
>>> getAgentString(conf.get("http.agent.name"), 
>>> conf.get("http.agent.version"), conf
>>>                  .get("http.agent.description"), 
>>> conf.get("http.agent.url"), conf.get("http.agent.email"));
>>>          this.serverDelay = (long) 
>>> (conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
>>> +        this.maxCrawlDelay = (int) 
>>> conf.getInt("http.max.crawl.delay", -1);
>>>          // backward-compatible default setting
>>>          this.byIP = 
>>> conf.getBoolean("fetcher.threads.per.host.by.ip", true);
>>>          this.useHttp11 = conf.getBoolean("http.http11", false);
>>> @@ -185,6 +189,17 @@
>>>              long crawlDelay = robots.getCrawlDelay(this, u);
>>>        long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
>>> +     +      int crawlDelaySeconds = (int)(delay / 1000);
>>> +      if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
>>> +        LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: 
>>> max=" +
>>> +          maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
>>> +          " seconds");
>>> +        Content c = new Content(u.toString(), u.toString(), 
>>> EMPTY_CONTENT,
>>> +          null, null, this.conf);
>>> +        return new ProtocolOutput(c, new 
>>> ProtocolStatus(ProtocolStatus.WOULDBLOCK));
>>> +      }
>>> +            String host = blockAddr(u, delay);
>>>        Response response;
>>>        try {
>>
>>
>>
>
>
>


Re: On fetcher slowness

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Sorry, yeah it was 344 not 334.

Dennis

Ken Krugler wrote:
>> Here is a lightly tested patch for the crawl delay that allows urls 
>> with crawl delay set greater than x number of seconds to be ignored. 
>> I have currently run this on over 2 million urls and it is working 
>> good.  With this patch I also added this to my nutch-site.xml file 
>> for ignoring sites with crawl delay > 30 seconds.  The value can be 
>> changed to suit.  I have seen crawl delays as high at 259200 seconds 
>> in our crawls.
>>
>> <property>
>> <name>http.max.crawl.delay</name>
>> <value>30</value>
>> <description>
>> If the crawl delay in robots.txt is set to greater than this value 
>> then the
>> fetcher will ignore this page.  If set to -1 the fetcher will never 
>> ignore
>> pages and will wait the amount of time retrieved from robots.txt 
>> crawl delay.
>> This can cause hung threads if the delay is >= task timeout value.  
>> If all
>> threads get hung it can cause the fetcher task to about prematurely.
>> </description>
>> </property>
>
> Thanks, this is useful. We did a survey of a bunch of sites, and found 
> crawl delay values up to 99999 seconds.
>
>> The most recent patches to fetcher (not fetcher2) with NUTCH-334 
>> seems to have speed up our fetching dramatically.  We are only using 
>> about 50 fetchers but are consistently fetcher 1M + urls per day. The 
>> patch attatched and the 334 patches will help if staying on 0.8. If 
>> moving forward I think the new Fetcher2 codebase is a better solution 
>> though still a new one.
>
> NUTCH-334 or NUTCH-344? I'm assuming the latter 
> (http://issues.apache.org/jira/browse/NUTCH-344).
>
> Thanks,
>
> -- Ken
>
>
>> Ken Krugler wrote:
>>>> On 8/12/06, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
>>>>> Hello,
>>>>>
>>>>> Several people reported issues with slow fetcher in 0.8...
>>>>>
>>>>> I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>>>> fetch speed didn't increase when I went from using 100 threads, to 
>>>>> 200 threads.  Has anyone else observed the same?
>>>>>
>>>>> I was using 2 map tasks (mapred.map.tasks property) in both cases, 
>>>>> and the aggregate fetch speed was between 20 and 40 pages/sec. 
>>>>> This was a fetch of 50K+ URLs from a diverse set of servers.
>>>>>
>>>>> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>>>> gettimeofday calls.  Running strace several times in a row kept 
>>>>> showing that gettimeofday is the most frequent system call.
>>>>> Has anyone tried tracing the fetcher process?  Where do these 
>>>>> calls come from?  Any call to new Date() or 
>>>>> Calendar.getInstance(), as must be done for every single logging 
>>>>> call, perhaps?
>>>>>
>>>>> I can certainly be impolite and lower fetcher.server.delay to 1 
>>>>> second or even 0, but I'd like to be polite.
>>>>>
>>>>> I saw Ken Krugle's email suggesting to increast the number of 
>>>>> fetcher threads to 2000+ and set the maximal java thread stack 
>>>>> size to 512k with -Xss.  Has anyone other than Ken tried this with 
>>>>> success?  Wouldn't the JVM go crazy context switching between this 
>>>>> many threads?
>>>
>>> Note that most of the time these fetcher threads are all blocked, 
>>> waiting for other threads that are already fetching from the same IP 
>>> address. So there's not a lot of thrashing.
>>>
>>>> I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>>>> well. However number of fetcher for my part is 2500+.. I had to play
>>>> around with this number to match my bandwidth limitation, but now I
>>>> maximize my full bandwidth. But the problem that I run into are the
>>>> fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>>>> Dennis Kubes posting on this).
>>>
>>> Yes, these are definitely problems.
>>>
>>> Stefan has been working on a queue-based fetcher that uses NIO. 
>>> Seems very promising, but not yet ready for prime time.
>>>
>>> -- Ken
>>
>>
>> Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
>> ===================================================================
>> --- src/java/org/apache/nutch/protocol/ProtocolStatus.java 
>>     (revision 428457)
>> +++ src/java/org/apache/nutch/protocol/ProtocolStatus.java    
>> (working copy)
>> @@ -60,6 +60,10 @@
>>    public static final int NOTFETCHING          = 20;
>>    /** Unchanged since the last fetch. */
>>    public static final int NOTMODIFIED          = 21;
>> +  /** Request was refused by protocol plugins, because it would block.
>> +   * The expected number of milliseconds to wait before retry may be 
>> provided
>> +   * in args. */
>> +  public static final int WOULDBLOCK          = 22;
>>      // Useful static instances for status codes that don't usually 
>> require any
>>    // additional arguments.
>> Index: 
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>>
>> ===================================================================
>> --- 
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>>     (revision 430392)
>> +++ 
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>>     (working copy)
>> @@ -125,6 +125,9 @@
>>
>>    /** Do we use HTTP/1.1? */
>>    protected boolean useHttp11 = false;
>> + +  /** Ignore page if crawl delay over this number of seconds */
>> +  private int maxCrawlDelay = -1;
>>
>>    /** Creates a new instance of HttpBase */
>>    public HttpBase() {
>> @@ -152,6 +155,7 @@
>>          this.userAgent = getAgentString(conf.get("http.agent.name"), 
>> conf.get("http.agent.version"), conf
>>                  .get("http.agent.description"), 
>> conf.get("http.agent.url"), conf.get("http.agent.email"));
>>          this.serverDelay = (long) 
>> (conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
>> +        this.maxCrawlDelay = (int) 
>> conf.getInt("http.max.crawl.delay", -1);
>>          // backward-compatible default setting
>>          this.byIP = 
>> conf.getBoolean("fetcher.threads.per.host.by.ip", true);
>>          this.useHttp11 = conf.getBoolean("http.http11", false);
>> @@ -185,6 +189,17 @@
>>              long crawlDelay = robots.getCrawlDelay(this, u);
>>        long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
>> +     +      int crawlDelaySeconds = (int)(delay / 1000);
>> +      if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
>> +        LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: 
>> max=" +
>> +          maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
>> +          " seconds");
>> +        Content c = new Content(u.toString(), u.toString(), 
>> EMPTY_CONTENT,
>> +          null, null, this.conf);
>> +        return new ProtocolOutput(c, new 
>> ProtocolStatus(ProtocolStatus.WOULDBLOCK));
>> +      }
>> +            String host = blockAddr(u, delay);
>>        Response response;
>>        try {
>
>

Re: On fetcher slowness

Posted by Ken Krugler <kk...@transpac.com>.
>Here is a lightly tested patch for the crawl delay that allows urls 
>with crawl delay set greater than x number of seconds to be ignored. 
>I have currently run this on over 2 million urls and it is working 
>good.  With this patch I also added this to my nutch-site.xml file 
>for ignoring sites with crawl delay > 30 seconds.  The value can be 
>changed to suit.  I have seen crawl delays as high at 259200 seconds 
>in our crawls.
>
><property>
><name>http.max.crawl.delay</name>
><value>30</value>
><description>
>If the crawl delay in robots.txt is set to greater than this value then the
>fetcher will ignore this page.  If set to -1 the fetcher will never ignore
>pages and will wait the amount of time retrieved from robots.txt crawl delay.
>This can cause hung threads if the delay is >= task timeout value.  If all
>threads get hung it can cause the fetcher task to about prematurely.
></description>
></property>

Thanks, this is useful. We did a survey of a bunch of sites, and 
found crawl delay values up to 99999 seconds.

>The most recent patches to fetcher (not fetcher2) with NUTCH-334 
>seems to have speed up our fetching dramatically.  We are only using 
>about 50 fetchers but are consistently fetcher 1M + urls per day. 
>The patch attatched and the 334 patches will help if staying on 0.8. 
>If moving forward I think the new Fetcher2 codebase is a better 
>solution though still a new one.

NUTCH-334 or NUTCH-344? I'm assuming the latter 
(http://issues.apache.org/jira/browse/NUTCH-344).

Thanks,

-- Ken


>Ken Krugler wrote:
>>>On 8/12/06, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
>>>>Hello,
>>>>
>>>>Several people reported issues with slow fetcher in 0.8...
>>>>
>>>>I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>>>fetch speed didn't increase when I went from using 100 threads, 
>>>>to 200 threads.  Has anyone else observed the same?
>>>>
>>>>I was using 2 map tasks (mapred.map.tasks property) in both 
>>>>cases, and the aggregate fetch speed was between 20 and 40 
>>>>pages/sec. This was a fetch of 50K+ URLs from a diverse set of 
>>>>servers.
>>>>
>>>>While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>>>gettimeofday calls.  Running strace several times in a row kept 
>>>>showing that gettimeofday is the most frequent system call.
>>>>Has anyone tried tracing the fetcher process?  Where do these 
>>>>calls come from?  Any call to new Date() or 
>>>>Calendar.getInstance(), as must be done for every single logging 
>>>>call, perhaps?
>>>>
>>>>I can certainly be impolite and lower fetcher.server.delay to 1 
>>>>second or even 0, but I'd like to be polite.
>>>>
>>>>I saw Ken Krugle's email suggesting to increast the number of 
>>>>fetcher threads to 2000+ and set the maximal java thread stack 
>>>>size to 512k with -Xss.  Has anyone other than Ken tried this 
>>>>with success?  Wouldn't the JVM go crazy context switching 
>>>>between this many threads?
>>
>>Note that most of the time these fetcher threads are all blocked, 
>>waiting for other threads that are already fetching from the same 
>>IP address. So there's not a lot of thrashing.
>>
>>>I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>>>well. However number of fetcher for my part is 2500+.. I had to play
>>>around with this number to match my bandwidth limitation, but now I
>>>maximize my full bandwidth. But the problem that I run into are the
>>>fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>>>Dennis Kubes posting on this).
>>
>>Yes, these are definitely problems.
>>
>>Stefan has been working on a queue-based fetcher that uses NIO. 
>>Seems very promising, but not yet ready for prime time.
>>
>>-- Ken
>
>
>Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
>===================================================================
>--- src/java/org/apache/nutch/protocol/ProtocolStatus.java 
>	(revision 428457)
>+++ src/java/org/apache/nutch/protocol/ProtocolStatus.java	(working copy)
>@@ -60,6 +60,10 @@
>    public static final int NOTFETCHING          = 20;
>    /** Unchanged since the last fetch. */
>    public static final int NOTMODIFIED          = 21;
>+  /** Request was refused by protocol plugins, because it would block.
>+   * The expected number of milliseconds to wait before retry may be provided
>+   * in args. */
>+  public static final int WOULDBLOCK          = 22;
>   
>    // Useful static instances for status codes that don't usually require any
>    // additional arguments.
>Index: 
>src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>===================================================================
>--- 
>src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>	(revision 430392)
>+++ 
>src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
>	(working copy)
>@@ -125,6 +125,9 @@
>
>    /** Do we use HTTP/1.1? */
>    protected boolean useHttp11 = false;
>+ 
>+  /** Ignore page if crawl delay over this number of seconds */
>+  private int maxCrawlDelay = -1;
>
>    /** Creates a new instance of HttpBase */
>    public HttpBase() {
>@@ -152,6 +155,7 @@
>          this.userAgent = 
>getAgentString(conf.get("http.agent.name"), 
>conf.get("http.agent.version"), conf
>                  .get("http.agent.description"), 
>conf.get("http.agent.url"), conf.get("http.agent.email"));
>          this.serverDelay = (long) 
>(conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
>+        this.maxCrawlDelay = (int) conf.getInt("http.max.crawl.delay", -1);
>          // backward-compatible default setting
>          this.byIP = conf.getBoolean("fetcher.threads.per.host.by.ip", true);
>          this.useHttp11 = conf.getBoolean("http.http11", false);
>@@ -185,6 +189,17 @@
>       
>        long crawlDelay = robots.getCrawlDelay(this, u);
>        long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
>+     
>+      int crawlDelaySeconds = (int)(delay / 1000);
>+      if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
>+        LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: max=" +
>+          maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
>+          " seconds");
>+        Content c = new Content(u.toString(), u.toString(), EMPTY_CONTENT,
>+          null, null, this.conf);
>+        return new ProtocolOutput(c, new 
>ProtocolStatus(ProtocolStatus.WOULDBLOCK));
>+      }
>+     
>        String host = blockAddr(u, delay);
>        Response response;
>        try {


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: On fetcher slowness

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Here is a lightly tested patch for the crawl delay that allows urls with 
crawl delay set greater than x number of seconds to be ignored.  I have 
currently run this on over 2 million urls and it is working good.  With 
this patch I also added this to my nutch-site.xml file for ignoring 
sites with crawl delay > 30 seconds.  The value can be changed to suit.  
I have seen crawl delays as high at 259200 seconds in our crawls.

<property>
 <name>http.max.crawl.delay</name>
 <value>30</value>
 <description>
 If the crawl delay in robots.txt is set to greater than this value then 
the
 fetcher will ignore this page.  If set to -1 the fetcher will never ignore
 pages and will wait the amount of time retrieved from robots.txt crawl 
delay.
 This can cause hung threads if the delay is >= task timeout value.  If all
 threads get hung it can cause the fetcher task to about prematurely.
 </description>
</property>

The most recent patches to fetcher (not fetcher2) with NUTCH-334 seems 
to have speed up our fetching dramatically.  We are only using about 50 
fetchers but are consistently fetcher 1M + urls per day.  The patch 
attatched and the 334 patches will help if staying on 0.8.  If moving 
forward I think the new Fetcher2 codebase is a better solution though 
still a new one.

Dennis

Ken Krugler wrote:
>> On 8/12/06, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
>>> Hello,
>>>
>>> Several people reported issues with slow fetcher in 0.8...
>>>
>>> I run Nutch on a dual CPU (+HT) box, and have noticed that the fetch 
>>> speed didn't increase when I went from using 100 threads, to 200 
>>> threads.  Has anyone else observed the same?
>>>
>>> I was using 2 map tasks (mapred.map.tasks property) in both cases, 
>>> and the aggregate fetch speed was between 20 and 40 pages/sec. This 
>>> was a fetch of 50K+ URLs from a diverse set of servers.
>>>
>>> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>> gettimeofday calls.  Running strace several times in a row kept 
>>> showing that gettimeofday is the most frequent system call.
>>> Has anyone tried tracing the fetcher process?  Where do these calls 
>>> come from?  Any call to new Date() or Calendar.getInstance(), as 
>>> must be done for every single logging call, perhaps?
>>>
>>> I can certainly be impolite and lower fetcher.server.delay to 1 
>>> second or even 0, but I'd like to be polite.
>>>
>>> I saw Ken Krugle's email suggesting to increast the number of 
>>> fetcher threads to 2000+ and set the maximal java thread stack size 
>>> to 512k with -Xss.  Has anyone other than Ken tried this with 
>>> success?  Wouldn't the JVM go crazy context switching between this 
>>> many threads?
>
> Note that most of the time these fetcher threads are all blocked, 
> waiting for other threads that are already fetching from the same IP 
> address. So there's not a lot of thrashing.
>
>> I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>> well. However number of fetcher for my part is 2500+.. I had to play
>> around with this number to match my bandwidth limitation, but now I
>> maximize my full bandwidth. But the problem that I run into are the
>> fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>> Dennis Kubes posting on this).
>
> Yes, these are definitely problems.
>
> Stefan has been working on a queue-based fetcher that uses NIO. Seems 
> very promising, but not yet ready for prime time.
>
> -- Ken

Re: [Nutch-general] On fetcher slowness

Posted by Andrzej Bialecki <ab...@getopt.org>.
ogjunk-nutch@yahoo.com wrote:
>
> Stefan has been working on a queue-based fetcher that uses NIO. Seems 
> very promising, but not yet ready for prime time.
>
> OG: yeah, I saw his email.  Kelvin worked on the same thing many months ago, pre-0.8, but it never made it into the trunk.  I'm looking forward to Stefan's code now.
>   

Please note that patches in NUTCH-339 also implement a queue-based 
fetcher - in addition to Stefan's version it is protocol-independent. 
Whether that's good or bad, I'm not sure yet - perhaps you can only 
implement proper queueing in a protocol-dependent manner ... I'm waiting 
for Stefan's input on these patches, too.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Nutch-general] On fetcher slowness

Posted by og...@yahoo.com.
Hi,

----- Original Message ----
From: Ken Krugler <kk...@transpac.com>

>On 8/12/06, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
>>Hello,
>>
>>Several people reported issues with slow fetcher in 0.8...
>>
>>I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>fetch speed didn't increase when I went from using 100 threads, to 
>>200 threads.  Has anyone else observed the same?
>>
>>I was using 2 map tasks (mapred.map.tasks property) in both cases, 
>>and the aggregate fetch speed was between 20 and 40 pages/sec. 
>>This was a fetch of 50K+ URLs from a diverse set of servers.

<snip>

>>I saw Ken Krugle's email suggesting to increast the number of 
>>fetcher threads to 2000+ and set the maximal java thread stack size 
>>to 512k with -Xss.  Has anyone other than Ken tried this with 
>>success?  Wouldn't the JVM go crazy context switching between this 
>>many threads?

Note that most of the time these fetcher threads are all blocked, 
waiting for other threads that are already fetching from the same IP 
address. So there's not a lot of thrashing.

OG: I see.  But wouldn't that be true only in case of more vertical crawls, crawls that don't have a very large and diverse set of hosts?
In other words, if you are doing a web-wide crawl, each of those 2000 fetcher threads is very likely to be assigned to a host/IP that is currently not being crawled by any other fetcher thread, no?

>I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>well. However number of fetcher for my part is 2500+.. I had to play
>around with this number to match my bandwidth limitation, but now I
>maximize my full bandwidth. But the problem that I run into are the
>fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>Dennis Kubes posting on this).

Yes, these are definitely problems.

Stefan has been working on a queue-based fetcher that uses NIO. Seems 
very promising, but not yet ready for prime time.

OG: yeah, I saw his email.  Kelvin worked on the same thing many months ago, pre-0.8, but it never made it into the trunk.  I'm looking forward to Stefan's code now.

Otis





Re: On fetcher slowness

Posted by Ken Krugler <kk...@transpac.com>.
>On 8/12/06, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
>>Hello,
>>
>>Several people reported issues with slow fetcher in 0.8...
>>
>>I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>fetch speed didn't increase when I went from using 100 threads, to 
>>200 threads.  Has anyone else observed the same?
>>
>>I was using 2 map tasks (mapred.map.tasks property) in both cases, 
>>and the aggregate fetch speed was between 20 and 40 pages/sec. 
>>This was a fetch of 50K+ URLs from a diverse set of servers.
>>
>>While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>gettimeofday calls.  Running strace several times in a row kept 
>>showing that gettimeofday is the most frequent system call.
>>Has anyone tried tracing the fetcher process?  Where do these calls 
>>come from?  Any call to new Date() or Calendar.getInstance(), as 
>>must be done for every single logging call, perhaps?
>>
>>I can certainly be impolite and lower fetcher.server.delay to 1 
>>second or even 0, but I'd like to be polite.
>>
>>I saw Ken Krugle's email suggesting to increast the number of 
>>fetcher threads to 2000+ and set the maximal java thread stack size 
>>to 512k with -Xss.  Has anyone other than Ken tried this with 
>>success?  Wouldn't the JVM go crazy context switching between this 
>>many threads?

Note that most of the time these fetcher threads are all blocked, 
waiting for other threads that are already fetching from the same IP 
address. So there's not a lot of thrashing.

>I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>well. However number of fetcher for my part is 2500+.. I had to play
>around with this number to match my bandwidth limitation, but now I
>maximize my full bandwidth. But the problem that I run into are the
>fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>Dennis Kubes posting on this).

Yes, these are definitely problems.

Stefan has been working on a queue-based fetcher that uses NIO. Seems 
very promising, but not yet ready for prime time.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: On fetcher slowness

Posted by Zaheed Haque <za...@gmail.com>.
On 8/12/06, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
> Hello,
>
> Several people reported issues with slow fetcher in 0.8...
>
> I run Nutch on a dual CPU (+HT) box, and have noticed that the fetch speed didn't increase when I went from using 100 threads, to 200 threads.  Has anyone else observed the same?
>
> I was using 2 map tasks (mapred.map.tasks property) in both cases, and the aggregate fetch speed was between 20 and 40 pages/sec.  This was a fetch of 50K+ URLs from a diverse set of servers.
>
> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of gettimeofday calls.  Running strace several times in a row kept showing that gettimeofday is the most frequent system call.
> Has anyone tried tracing the fetcher process?  Where do these calls come from?  Any call to new Date() or Calendar.getInstance(), as must be done for every single logging call, perhaps?
>
> I can certainly be impolite and lower fetcher.server.delay to 1 second or even 0, but I'd like to be polite.
>
> I saw Ken Krugle's email suggesting to increast the number of fetcher threads to 2000+ and set the maximal java thread stack size to 512k with -Xss.  Has anyone other than Ken tried this with success?  Wouldn't the JVM go crazy context switching between this many threads?

I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).

 I been testing Fetcher2 (Nutch-339) with about 2M + URLs and I got
good result. I had some trouble in the begining but now works good.
Note this solves the crawl delay problem but I still need to apply
Stack size chnages.

Cheers
Zaheed