You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2017/06/30 15:00:02 UTC
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1465:
---------------------------------
Attachment: NUTCH-1465.patch
Updated patch for trunk:
* added some curly braces to if statements, that kind of formatting always screws me at some point;
* added support for redirects, in hostdb mode, a url is built for url filtering, but the actual protocol can be https instead, so redirect;
* added support for defaulting to /sitemap.xml, some robots.txt do not properly point to the map
* added support for NOT OVERWRITING existing CrawlDatum information and made it the default option, letting external sitemap overwrite interval is a very bad idea.
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)