You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/05/10 14:58:15 UTC

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494735 ] 

Doğacan Güney commented on NUTCH-444:
-------------------------------------

Now that NUTCH-443 is in, how does everyone feel about this one? We have been using ROME in our system for a while now, and we are very happy with it. Its biggest advantage against feedparser is (besides being actively developed) that it supports modules, meaning it can also parse MediaRss, Itunes podcast, etc., so that it is a better building block for a podcast or a video search engine. 

We can also go with the transparency interface, but I am worried that interface is going to be huge, if the interface is also going to support video thumbnails (from MediaRSS), enclosures, you know, all the extra stuff that comes from rss extensions. That's why I think just choosing a library (*cough* ROME *cough* :)  is better.

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.