You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2013/01/28 04:33:12 UTC

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564019#comment-13564019 ] 

Ken Krugler commented on NUTCH-1465:
------------------------------------

Hi Tejas - I thought the current CC robots parsing code was already extracting the sitemap links. Or is the above comment ("modified the robots parsing code to extract the links to sitemap pages") a change to the current Nutch robots parsing code?

I do remember thinking that the CC version would need to change to support multiple Sitemap links, even though it wasn't clear whether that was actually valid.

-- Ken
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.7
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira