You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2012/12/21 12:37:13 UTC

[jira] [Resolved] (NUTCH-1331) limit crawler to defined depth

     [ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-1331.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7

Thanks Markus

Committed in revision 1424875 for trunk and opened a separate issue for porting to 2.x

and documented in nutch-default.xml

{quote}
<property>
  <name>scoring.depth.max</name>
  <value>1000</value>
  <description>Max depth value from seed allowed by default.
  Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE"
  as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  to track the distance from the seed it was found from. 
  The depth is used to prioritise URLs in the generation step so that
  shallower pages are fetched first.
  </description>
</property>
{quote}
                
> limit crawler to defined depth
> ------------------------------
>
>                 Key: NUTCH-1331
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1331
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator, parser, storage
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>             Fix For: 1.7
>
>         Attachments: NUTCH-1331.patch, NUTCH-1331-v2.patch
>
>
> there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls.
> an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations)
> we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira