You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Moreno Feltscher (JIRA)" <ji...@apache.org> on 2018/01/03 14:16:00 UTC

[jira] [Created] (NUTCH-2491) Integrate sitemap processing and HostDB into crawl script

Moreno Feltscher created NUTCH-2491:
---------------------------------------

             Summary: Integrate sitemap processing and HostDB into crawl script
                 Key: NUTCH-2491
                 URL: https://issues.apache.org/jira/browse/NUTCH-2491
             Project: Nutch
          Issue Type: Improvement
            Reporter: Moreno Feltscher
            Assignee: Moreno Feltscher
            Priority: Minor


Add three new steps to the crawl bash script:
1. Generate HostDB from CrawlDB
2. Inject URLs from sitemaps URLs found in hosts from HostDb
3. If given, inject sitemap URLs specified in a configuration file / in configuration files



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)