You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2009/11/23 12:05:39 UTC

[jira] Created: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

Fetcher to skip queues for URLS getting repeated exceptions  
-------------------------------------------------------------

                 Key: NUTCH-769
                 URL: https://issues.apache.org/jira/browse/NUTCH-769
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Julien Nioche
            Priority: Minor


As discussed on the mailing list (see http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues.

by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783245#action_12783245 ] 

Andrzej Bialecki  commented on NUTCH-769:
-----------------------------------------

The patch contains a new method, checkExceptionThreshold,which seems to do the right thing, but this method is never used in Fetcher. I think the idea was to call it in FetchItemQueues.finishItem()?

> Fetcher to skip queues for URLS getting repeated exceptions  
> -------------------------------------------------------------
>
>                 Key: NUTCH-769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-769
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>            Priority: Minor
>         Attachments: NUTCH-769.patch
>
>
> As discussed on the mailing list (see http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues.
> by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-769.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
         Assignee: Andrzej Bialecki 

> Fetcher to skip queues for URLS getting repeated exceptions  
> -------------------------------------------------------------
>
>                 Key: NUTCH-769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-769
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NUTCH-769-2.patch, NUTCH-769.patch
>
>
> As discussed on the mailing list (see http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues.
> by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-769:
--------------------------------

    Attachment: NUTCH-769-2.patch

> Fetcher to skip queues for URLS getting repeated exceptions  
> -------------------------------------------------------------
>
>                 Key: NUTCH-769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-769
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>            Priority: Minor
>         Attachments: NUTCH-769-2.patch, NUTCH-769.patch
>
>
> As discussed on the mailing list (see http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues.
> by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784260#action_12784260 ] 

Andrzej Bialecki  commented on NUTCH-769:
-----------------------------------------

I had to apply this patch by hand, due to NUTCH-770. I also added conf/nutch-default.xml documentation. This was committed in rev. 885785 - thanks!

> Fetcher to skip queues for URLS getting repeated exceptions  
> -------------------------------------------------------------
>
>                 Key: NUTCH-769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-769
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NUTCH-769-2.patch, NUTCH-769.patch
>
>
> As discussed on the mailing list (see http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues.
> by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783247#action_12783247 ] 

Julien Nioche commented on NUTCH-769:
-------------------------------------

Missed a couple of lines indeed when I was trying to untangle this functionality from my (heavily modified) local copy.
checkExceptionThreshold is called after the line 664

              case ProtocolStatus.EXCEPTION:
                logError(fit.url, status.getMessage());
                int killedURLs = fetchQueues.checkExceptionThreshold(fit.getQueueID());
                reporter.incrCounter("FetcherStatus", "Exceptions", killedURLs);

I'll attach a modified version of the patch

Thanks

J.

> Fetcher to skip queues for URLS getting repeated exceptions  
> -------------------------------------------------------------
>
>                 Key: NUTCH-769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-769
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>            Priority: Minor
>         Attachments: NUTCH-769-2.patch, NUTCH-769.patch
>
>
> As discussed on the mailing list (see http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues.
> by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-769:
--------------------------------

    Attachment: NUTCH-769.patch

> Fetcher to skip queues for URLS getting repeated exceptions  
> -------------------------------------------------------------
>
>                 Key: NUTCH-769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-769
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Julien Nioche
>            Priority: Minor
>         Attachments: NUTCH-769.patch
>
>
> As discussed on the mailing list (see http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues.
> by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.