You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/22 16:32:57 UTC

[jira] [Created] (NUTCH-1067) Configure minimum throughput for fetcher

Configure minimum throughput for fetcher
----------------------------------------

                 Key: NUTCH-1067
                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
             Project: Nutch
          Issue Type: New Feature
          Components: generator
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor
             Fix For: 1.4, 2.0
         Attachments: NUTCH-1067-1.4-1.patch

Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.

This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.

Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1067.
--------------------------------


Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
                
> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070076#comment-13070076 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

There's a problem with the current patch: it usually reports 0 p/s at the start of the thread. At this stage numThreads downloads are in progress simultaniously. It is also possible to report 0 p/s during the fetch. Issue must be modified as not to quit on these conditions.

It needs to:
* interact with the feeder
* have an additional threshold for the number of times that 0 p/s is reported

..and possible more.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085773#comment-13085773 ] 

Julien Nioche commented on NUTCH-1067:
--------------------------------------

Markus - please assign this issue to me : that will serve as a reminder that I need to review it

Thanks

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1067:
---------------------------------

    Attachment: NUTCH-1067-1.4-1.patch

Patch for 1.4. It has not been thoroughly tested yet. 

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097950#comment-13097950 ] 

Julien Nioche commented on NUTCH-1067:
--------------------------------------

see comments on NUTCH-1102
Patch for 1.4 looks fine
Thanks

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104442#comment-13104442 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

* Crawl and Benchmark both read the value and pass it to the fetcher. It's safe there to remove the argument.

* There's also a problem with TestFetcher.testFetch(). This it seems, relies on the parse to work as it passes TRUE to the fetcher but doesn't set the directive. I'll override the configuration directive to TRUE there.

* TestFetcher.testAgentNameCheck() for some reason sets the conf directive to FALSE but passes TRUE as argument.

All source code and tests now compile again. The fetcher tests also pass without errors.  I'll attach a patch now.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104454#comment-13104454 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

Committed fixes for NUTCH-1102 (originating issue) for 1.4 in rev. 1170557. Everything works again with a clean check out. My apologies for letting myself be fooled by not doing a ant clean more regularly.

Thanks Julien for being to prompt!

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1067.
----------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 2.0)
         Assignee: Markus Jelsma  (was: Julien Nioche)

Committed for 1.4 in rev. 1170526.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088677#comment-13088677 ] 

Julien Nioche commented on NUTCH-1067:
--------------------------------------

{quote}
    * this is going to be difficuly because i measure the actual #pages/sec, that's always an integer, thoughts?
{quote}

OK, so this can't work when the number of pages per sec is < 1 which is an acceptable limitation as long as it is clearly stated in the comments for the parameter

{quote}
    * hasMore() method was added because i need to check outside the class if the feeder hasMore items
{quote}

I can't see in the patch where this call is made. Is it in some custom code of yours?

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1067:
---------------------------------

    Attachment: NUTCH-1067-1.4-4.patch

Agreed, a patch with the required modifications. Also moved the hasMore stuff to its original state. It was added before i used isAlive. Conf also updated to reflect changes.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085750#comment-13085750 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

The impact of this patch is too great to be committed without review but i'd like to get it in some day :)

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1067:
---------------------------------

    Attachment: NUTCH-1067-1.4-2.patch

New patch to enable the check only when the feeder has finished and allows for a configurable number of times to exceed the threshold.

There can be a significant number of exceptions due to the return statement used. Probably clearer to clear the queue's first.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108359#comment-13108359 ] 

Hudson commented on NUTCH-1067:
-------------------------------

Integrated in Nutch-branch-1.4 #11 (See [https://builds.apache.org/job/Nutch-branch-1.4/11/])
    NUTCH-1067 Nutch-default configuration directives missing

markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1172585
Files : 
* /nutch/branches/branch-1.4/conf/nutch-default.xml


> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1067:
---------------------------------

    Component/s:     (was: generator)
                 fetcher

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1067:
------------------------------------

    Assignee: Julien Nioche  (was: Markus Jelsma)

Assigned to Julien for review. Cheers!

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche reopened NUTCH-1067:
----------------------------------


At revision 1170548.

ant clean then ant =>

compile-core:
    [javac] /data/nutch-1.4/build.xml:96: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 172 source files to /data/nutch-1.4/build/classes
    [javac] /data/nutch-1.4/src/java/org/apache/nutch/crawl/Crawl.java:136: fetch(org.apache.hadoop.fs.Path,int) in org.apache.nutch.fetcher.Fetcher cannot be applied to (org.apache.hadoop.fs.Path,int,boolean)
    [javac]       fetcher.fetch(segs[0], threads, org.apache.nutch.fetcher.Fetcher.isParsing(getConf()));  // fetch it
    [javac]              ^
    [javac] /data/nutch-1.4/src/java/org/apache/nutch/tools/Benchmark.java:234: fetch(org.apache.hadoop.fs.Path,int) in org.apache.nutch.fetcher.Fetcher cannot be applied to (org.apache.hadoop.fs.Path,int,boolean)
    [javac]       fetcher.fetch(segs[0], threads, org.apache.nutch.fetcher.Fetcher.isParsing(getConf()));  // fetch it
    [javac]              ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 2 errors

BUILD FAILED


> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1067:
---------------------------------

    Attachment: NUTCH-1067-1.4-3.patch

Another patch. It cleans the queue the same as time bomb and reports in a similar fashion if it kicks in. Move cleaning code to new method thats being shared by timebomb and this one.

It has two configuration options:
- fetcher.throughput.threshold to enable/disable and set the minimum #pages/second
- fetcher.throughput.threshold.retries to set the number of times allowed to drop below the threshold to prevent a few accidental pauses from immediately killing the queue

It's tested in a production cluster and seems to work nicely, no more long dreadful delays when finalizing a fetch.

Please comment on usefulness and implemenation.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097925#comment-13097925 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

Julien, 

If there are no objections i'd like to commit this issue together with NUTCH-1102 soon.

Cheers

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1067:
---------------------------------

    Attachment: NUTCH-1045-1.4-v2.patch

Patch to fix the issues reported by Julien plus the issues found in TestFetcher.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088661#comment-13088661 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

Thanks for your comments.

* modified the naming to use pages in conf and code as per your comment;
* this is going to be difficuly because i measure the actual #pages/sec, that's always an integer, thoughts?
* hasMore() method was added because i need to check outside the class if the feeder hasMore items, it was an internal to the method QueueFeeder.run(), i could make it a public attribute but choose a getter instead;

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1067.
----------------------------------

    Resolution: Fixed

Fixed again.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088602#comment-13088602 ] 

Julien Nioche commented on NUTCH-1067:
--------------------------------------

Looks good but 2 comments though : 
- fetcher.throughput.threshold -> rename to 'fetcher.throughput.threshold.pages'? This way we could also introduce a threshold based on the bytes later?
- threshold should not be an integer but a float -> for small crawls we could have less than one page per second but still want to use the threshold for preventing things to get worse

Out of curiosity why do you put hasMore() as a separate method?

Thanks

Ju

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "behnam nikbakht (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223157#comment-13223157 ] 

behnam nikbakht commented on NUTCH-1067:
----------------------------------------

i can not understand why disable the threshold checker:
throughputThresholdPages = -1;
that cause to enforce this factor once.
                
> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104435#comment-13104435 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

*^#$*%@ i'm on it!

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097956#comment-13097956 ] 

Markus Jelsma commented on NUTCH-1067:
--------------------------------------

Thanks Julien. Depending on your new answer in NUTCH-1102 i'll put these in today.

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or even more. This can usually be dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of pages per second before the fetcher gives up. It doesn't use the global number of pages / running time but records the actual pages processed in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira