You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "behnam nikbakht (Created) (JIRA)" <ji...@apache.org> on 2012/02/21 08:59:32 UTC

[jira] [Created] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

Generator should not generate filter and not found and denied and gone and permanently moved pages
--------------------------------------------------------------------------------------------------

                 Key: NUTCH-1288
                 URL: https://issues.apache.org/jira/browse/NUTCH-1288
             Project: Nutch
          Issue Type: Bug
          Components: fetcher, generator
    Affects Versions: 1.4
            Reporter: behnam nikbakht


Generator should not generate filter and not found and denied and gone and permanently moved pages.
in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked against special states of fetch like not found, and not generate them again.
so we can add a status in CrawlDatum that indicates invalid urls, and set this status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

Posted by "behnam nikbakht (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

behnam nikbakht updated NUTCH-1288:
-----------------------------------

    Attachment: NUTCH-1288.patch
    
> Generator should not generate filter and not found and denied and gone and permanently moved pages
> --------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1288
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1288
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, generator
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1288.patch
>
>
> Generator should not generate filter and not found and denied and gone and permanently moved pages.
> in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked against special states of fetch like not found, and not generate them again.
> so we can add a status in CrawlDatum that indicates invalid urls, and set this status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

Posted by "Julien Nioche (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-1288.
----------------------------------

    Resolution: Invalid

This is not the right way to do. If you don't want to re-try such pages then implement a custom fetch schedule - don't hack the AbstractFetchSchedule as you do.
Hardcoding the schedule policy forces people to use Nutch the way you want to use it, not a good idea. Moreover your patch removes useful information about the status of a page to give a more generic (and dubious value).
                
> Generator should not generate filter and not found and denied and gone and permanently moved pages
> --------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1288
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1288
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, generator
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1288.patch
>
>
> Generator should not generate filter and not found and denied and gone and permanently moved pages.
> in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked against special states of fetch like not found, and not generate them again.
> so we can add a status in CrawlDatum that indicates invalid urls, and set this status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira