You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "byron miller (JIRA)" <ji...@apache.org> on 2005/12/29 21:00:00 UTC

[jira] Created: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

Process Sitemap data in text, rss or xml format as well as OAI-PMH
------------------------------------------------------------------

         Key: NUTCH-158
         URL: http://issues.apache.org/jira/browse/NUTCH-158
     Project: Nutch
        Type: New Feature
  Components: fetcher  
    Versions: 0.8-dev    
    Reporter: byron miller
    Priority: Minor


Add support to the fetcher to look for sitemap files, download them and process them into webdb.

Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that.

I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today :)

* RSS format/Atom Format (standard)
* XML meta descroption
* OAI-PMH meta description (http://www.openarchives.org/OAI/openarchivesprotocol.html)

Perhaps even a "pre crawler" that will scour for these to inject into the web db to help build your link map so you could even just index topN.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

Posted by "raghavendra prabhu (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-158?page=comments#action_12365483 ] 

raghavendra prabhu commented on NUTCH-158:
------------------------------------------

This is an important thing 

We should automaticall be able to insert the links parsed out of site map into webdb

But currently if we enable parse-rss and crawl these links ,dont they get added

> Process Sitemap data in text, rss or xml format as well as OAI-PMH
> ------------------------------------------------------------------
>
>          Key: NUTCH-158
>          URL: http://issues.apache.org/jira/browse/NUTCH-158
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: byron miller
>     Priority: Minor

>
> Add support to the fetcher to look for sitemap files, download them and process them into webdb.
> Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that.
> I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today :)
> * RSS format/Atom Format (standard)
> * XML meta descroption
> * OAI-PMH meta description (http://www.openarchives.org/OAI/openarchivesprotocol.html)
> Perhaps even a "pre crawler" that will scour for these to inject into the web db to help build your link map so you could even just index topN.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira