You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2014/01/21 20:21:22 UTC

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

     [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tejas Patil updated NUTCH-1465:
-------------------------------

    Attachment: NUTCH-1465-trunk.v2.patch

Attaching NUTCH-1465-trunk.v2.patch which has implementation of *option (B)* _Have separate job for the sitemap stuff and merge its output into the crawldb_

+I have tied both the cases in this patch:+
1. users with targeted crawl who want to get sitemaps injected from a list of sitemap urls - the use case which [~wastl-nagel] had pointed out.
2. large open web crawls where users cannot afford to generate sitemap seeds for all the hosts and want nutch to inject sitemaps automatically. 

+To try out this patch:+
1. Apply the patch for HostDb feature (https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch)
2. Apply this patch (NUTCH-1465-trunk.v2.patch)
3. (optional) Add this to conf/log4j.properties at line 11:
{noformat}
log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout
{noformat}
3. Run using 
{noformat}
bin/nutch org.apache.nutch.util.SitemapProcessor
{noformat}

I have started working on a *wiki page* describing this feature: https://wiki.apache.org/nutch/SitemapFeature 

Any suggestion and comments are welcome.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)