You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Saurabh Suman <sa...@rediff.com> on 2009/07/08 08:24:41 UTC

How to Parse Rss Feed URL

hi
I want to parse feedUrl using nutch.i tried to use
org.apache.nutch.parse.feed.FeedParser class. Its input is xml. I put in xml
the link below.
http://timesofindia.indiatimes.com/rssfeedsdefault.cms
This url contains all rss feeds for newspaper.When i tried to use it through
Rome Feed Parser it was giving me all the permalink, title,date etc. But
nutch parser doesnot give anything.
How can i get all the permalink,title,date in this url.

-- 
View this message in context: http://www.nabble.com/How-to-Parse-Rss-Feed-URL-tp24386051p24386051.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How to Parse Rss Feed URL

Posted by Saurabh Suman <sa...@rediff.com>.
When I org.apache.nutch.parse.rss.RSSParser , its working fine.Now I am
getting URLs.Now i want to get content. How will i do this? Do i need to
send to all URLs to crawldb.Then run the crawl command,or there is another
way.

hi
I want to parse feedUrl using nutch.i tried to use
org.apache.nutch.parse.feed.FeedParser class. Its input is xml. I put in xml
the link below.
http://timesofindia.indiatimes.com/rssfeedsdefault.cms
This url contains all rss feeds for newspaper.When i tried to use it through
Rome Feed Parser it was giving me all the permalink, title,date etc. But
nutch parser doesnot give anything.
How can i get all the permalink,title,date in this url.



-- 
View this message in context: http://www.nabble.com/How-to-Parse-Rss-Feed-URL-tp24386051p24404029.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How to Parse Rss Feed URL

Posted by Doğacan Güney <do...@gmail.com>.
On Wed, Jul 8, 2009 at 09:24, Saurabh Suman <sa...@rediff.com>wrote:

>
> hi
> I want to parse feedUrl using nutch.i tried to use
> org.apache.nutch.parse.feed.FeedParser class. Its input is xml. I put in
> xml
> the link below.
> http://timesofindia.indiatimes.com/rssfeedsdefault.cms
> This url contains all rss feeds for newspaper.When i tried to use it
> through
> Rome Feed Parser it was giving me all the permalink, title,date etc. But
> nutch parser doesnot give anything.
> How can i get all the permalink,title,date in this url.
>


 In conf/parse-plugins.xml:

        <mimeType name="text/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        <plugin id="feed" />
        </mimeType>

The URL you mentioned has a text/xml content-type. And since you probably
also have
parse-html defined in your conf file, parse-html tries to parse the feeds.
Try moving "feed" plugin higher so :

        <mimeType name="text/xml">
               <plugin id="feed" />
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        </mimeType>



>
> --
> View this message in context:
> http://www.nabble.com/How-to-Parse-Rss-Feed-URL-tp24386051p24386051.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Doğacan Güney