You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2013/01/28 21:03:13 UTC

[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564601#comment-13564601 ] 

Tejas Patil edited comment on NUTCH-1465 at 1/28/13 8:02 PM:
-------------------------------------------------------------

Hi Sebastian,

By (“for a given host, sitemaps are processed just once”) I meant : in the same round, the processing is done just once for a given host. I agree with you that a sitemap is fetched and processed every cycle for every host. The SitemapInjector idea is good.

The way I see this: "SitemapInjector" will be a
- Separate map-reduce job
- Responsible for fetching sitemap location(s) from robots file, getting the sitemap file(s) and adding the urls (along with the crawl freq. etc meta) from sitemap to the crawldb. 
- For large web crawls, we dont want to run this job for every nutch cycle. Also, new hosts will be discovered on the way for which the sitemaps need to be added to the crawldb. For those host, for which sitemaps were already processed, it might be possible that new sitemap location is been added to the robots file. So have a "sitemapFrequency" param to the crawl script. eg. If sitemapFrequency=10, sitemap job will be invoked in every 10 cycles of nutch crawl (1st cycle, 11th cycle, 21st cycle and so on). 
- Users can also run this job in standalone fashion on a crawldb.

What say ?
                
      was (Author: tejasp):
    Hi Sebastian,

By (“for a given host, sitemaps are processed just once”) I meant : in the same round, the processing is done just once for a given host. I agree with you that a sitemap is fetched and processed every cycle for every host. The SitemapInjector idea is good.

The way I see this: "SitemapInjector" will be a
- Separate map-reduce job 
- Responsible for fetching sitemaps and merging those urls with the crawldb. 
- For large web crawls, we dont want to run this job for every nutch cycle. Also, new hosts will be discovered on the way for which the sitemaps need to be added to the crawldb. So have a "sitemapFrequency" param to the crawl script. eg. If sitemapFrequency=10, sitemap job will be invoked in every 10 cycles of nutch crawl (1st cycle, 11th cycle, 21st cycle and so on). 
- Users can also run this job in standalone fashion on a crawldb.

What say ?
                  
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.7
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira