You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2013/12/15 09:18:08 UTC

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848561#comment-13848561 ] 

Tejas Patil commented on NUTCH-1465:
------------------------------------

Revisited this Jira after a long time and gave a thought how this can be done cleanly. Two ways for implementing this:

*(A) Do the sitemap stuff in the fetch phase of nutch cycle.*
This was my original approach which the (in-progress) patch addresses. This would involve tweaking core nutch classes at several locations.

Pros:
- Sitemaps are nothing but normal pages with several outlinks. Fits well in the 'fetch' cycle.

Cons:
- Sitemaps can be very huge in size. Fetching them need large size and time limits. Fetch code must have a special case to take into account that the url is a sitemap url and use custom limits => leads to hacky coding style.
- Outlink class cannot hold extra information contained in sitemaps (like lastmod, changefreq). Modify it to hold this information too. This would be specific for sitemaps only yet we end up making all outlinks to hold this info. We could create a special type of outlink and take care of this.

*(B) Have separate job for the sitemap stuff and merge its output into the crawldb.*
i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got all the hosts to be processed.
ii. Run a map-reduce job: for each host, 
          - get the robots page, extract sitemap urls, 
          - get xml content of these sitemap pages
          - create crawl datums with the requried info and write this to a sitemapDB

iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb

Pros:
- Cleaner code. 
- Users have control when to perform sitemap extraction. This is better than (A) wherein sitemap urls are sitting in the crawldb and get fetched along with normal pages (thus, eating up fetch time of every fetch phase). We can have a sitemap_fequency used insdie the crawl script so that users say that after 'x' nutch cycles, run sitemap processing.

Cons:
- Additional map-reduce jobs are needed. I think that this must be reasonable. Running sitemap job 1-5 times in a month on a production level crawl would work out well.

I am inclined towards implementing (B)

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)