You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/05/21 00:49:45 UTC

[Nutch Wiki] Trivial Update of "GoogleSummerOfCode/SitemapCrawler" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=1&rev2=2

+ <<TableOfContents(4)>>
+ 
  == Abstract ==
  
  The url’s can be got from only pages that were scanned before in nutch crawler system. This method is expensive. Also, the degrees of importance and “change frequance” of these urls are not known only guessed. But, it is possible to find the whole of urls in a up-to-date sitemap file. For this reason, sitemap files in website should be crawled. Nutch project will have that support of sitemap crawler thanks to this development.