You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Xalan <aa...@gmail.com> on 2009/06/11 00:57:13 UTC

Crawling blogs, feeds & comments

Regards,

I would like to crawl feeds urls (/feed/) and comments urls (#comments), I
have read that some plugins are need to be added to parse-plugins.xml, but I
don´t know which one. What can I do?

Thanks for your help


-- 
Allan Avendaño S.
Guayaquil-Ecuador
Home   :   +593(4) 2800 692
Office :   +593(4) 2269 268
Mobile :        09 700 42 48
MSN-Messenger: edgar_allan_poe86@hotmail.com
Gmail: aavendan@gmail.com

Re: Crawling blogs, feeds & comments

Posted by Lewis John Mcgibbney <le...@gmail.com>.
There is actually some nice Javadoc documentation within the FeedParser [1]
and FeedIndexingFilter [2] if you look there. Also have a look at the text
suite for this plugin, its pretty comprehensive. You can use it from the
command line for ease of use.

hth

[1]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
[2]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java

On Tue, Mar 27, 2012 at 10:35 AM, pragya <pr...@gmail.com> wrote:

> I also want to crawl rss feeds in nutch..
> i have heard about a 'feed plugin'..
> if anyone know this please let me know how it works??
> thank you
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawling-blogs-feeds-comments-tp618324p3860817.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Crawling blogs, feeds & comments

Posted by pragya <pr...@gmail.com>.
I also want to crawl rss feeds in nutch..
i have heard about a 'feed plugin'..
if anyone know this please let me know how it works??
thank you

--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-blogs-feeds-comments-tp618324p3860817.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling blogs, feeds & comments

Posted by yanky young <ya...@gmail.com>.
Hi:

Myabe you just need to add url filter to your regex-*urlfilter*.txt
configuration file.

And if the feeds are rss or atom format, you should activate *parse-rss
plugin*. Just add it into your nutch-site.xml plugins part.

good luck

yanky

2009/6/11 Xalan <aa...@gmail.com>

> Regards,
>
> I would like to crawl feeds urls (/feed/) and comments urls (#comments), I
> have read that some plugins are need to be added to parse-plugins.xml, but
> I
> don´t know which one. What can I do?
>
> Thanks for your help
>
>
> --
> Allan Avendaño S.
> Guayaquil-Ecuador
> Home   :   +593(4) 2800 692
> Office :   +593(4) 2269 268
> Mobile :        09 700 42 48
> MSN-Messenger: edgar_allan_poe86@hotmail.com
> Gmail: aavendan@gmail.com
>