You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Xalan <aa...@gmail.com> on 2009/06/11 00:57:13 UTC
Crawling blogs, feeds & comments
Regards,
I would like to crawl feeds urls (/feed/) and comments urls (#comments), I
have read that some plugins are need to be added to parse-plugins.xml, but I
don´t know which one. What can I do?
Thanks for your help
--
Allan Avendaño S.
Guayaquil-Ecuador
Home : +593(4) 2800 692
Office : +593(4) 2269 268
Mobile : 09 700 42 48
MSN-Messenger: edgar_allan_poe86@hotmail.com
Gmail: aavendan@gmail.com
Re: Crawling blogs, feeds & comments
Posted by Lewis John Mcgibbney <le...@gmail.com>.
There is actually some nice Javadoc documentation within the FeedParser [1]
and FeedIndexingFilter [2] if you look there. Also have a look at the text
suite for this plugin, its pretty comprehensive. You can use it from the
command line for ease of use.
hth
[1]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
[2]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
On Tue, Mar 27, 2012 at 10:35 AM, pragya <pr...@gmail.com> wrote:
> I also want to crawl rss feeds in nutch..
> i have heard about a 'feed plugin'..
> if anyone know this please let me know how it works??
> thank you
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawling-blogs-feeds-comments-tp618324p3860817.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*Lewis*
Re: Crawling blogs, feeds & comments
Posted by pragya <pr...@gmail.com>.
I also want to crawl rss feeds in nutch..
i have heard about a 'feed plugin'..
if anyone know this please let me know how it works??
thank you
--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-blogs-feeds-comments-tp618324p3860817.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Crawling blogs, feeds & comments
Posted by yanky young <ya...@gmail.com>.
Hi:
Myabe you just need to add url filter to your regex-*urlfilter*.txt
configuration file.
And if the feeds are rss or atom format, you should activate *parse-rss
plugin*. Just add it into your nutch-site.xml plugins part.
good luck
yanky
2009/6/11 Xalan <aa...@gmail.com>
> Regards,
>
> I would like to crawl feeds urls (/feed/) and comments urls (#comments), I
> have read that some plugins are need to be added to parse-plugins.xml, but
> I
> don´t know which one. What can I do?
>
> Thanks for your help
>
>
> --
> Allan Avendaño S.
> Guayaquil-Ecuador
> Home : +593(4) 2800 692
> Office : +593(4) 2269 268
> Mobile : 09 700 42 48
> MSN-Messenger: edgar_allan_poe86@hotmail.com
> Gmail: aavendan@gmail.com
>