You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "behnam nikbakht (Created) (JIRA)" <ji...@apache.org> on 2012/04/11 09:38:52 UTC

[jira] [Created] (NUTCH-1331) limit crawler to defined depth

limit crawler to defined depth
------------------------------

                 Key: NUTCH-1331
                 URL: https://issues.apache.org/jira/browse/NUTCH-1331
             Project: Nutch
          Issue Type: New Feature
          Components: generator, parser, storage
    Affects Versions: 1.4
            Reporter: behnam nikbakht


there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls.
an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations)
we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1331) limit crawler to defined depth

Posted by "Julien Nioche (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1331:
---------------------------------

    Attachment: NUTCH-1331-v2.patch

Attached is an implementation of what I described earlier. This has been generously donated by www.ant.com

This allows to track the depth for a URL and remove its outlinks based on a global setting or per-seed

 
                
> limit crawler to defined depth
> ------------------------------
>
>                 Key: NUTCH-1331
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1331
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, parser, storage
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1331-v2.patch, NUTCH-1331.patch
>
>
> there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls.
> an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations)
> we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1331) limit crawler to defined depth

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251684#comment-13251684 ] 

Julien Nioche commented on NUTCH-1331:
--------------------------------------

This can be done with the ScoringFilters alone without modifying the Nutch, Generator or CrawlDatum class. The idea could also to be able to set the max depth on a per-seed basis or using a global default value (in nutch-site.xml). The URLs could also be prioritised during the generation based on the depth. I'd do that in a separate ScoringFilter instead of modifying the default one though.
                
> limit crawler to defined depth
> ------------------------------
>
>                 Key: NUTCH-1331
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1331
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, parser, storage
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1331.patch
>
>
> there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls.
> an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations)
> we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1331) limit crawler to defined depth

Posted by "behnam nikbakht (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

behnam nikbakht updated NUTCH-1331:
-----------------------------------

    Attachment: NUTCH-1331.patch
    
> limit crawler to defined depth
> ------------------------------
>
>                 Key: NUTCH-1331
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1331
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, parser, storage
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>         Attachments: NUTCH-1331.patch
>
>
> there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls.
> an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations)
> we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira