You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Chris Mattmann <ch...@jpl.nasa.gov> on 2005/04/04 19:27:13 UTC

RSS Parser Plugin based on commons-feedparser submitted

Hi Folks,

 I just wanted to let you know that I¹ve submitted the parse-rss plugin that
I was working on to the JIRA system under issue ³NUTCH-30²
(http://issues.apache.org/jira/browse/NUTCH-30). The plugin includes a patch
filie (svn diff), along with the zipped up source and runtime libraries. The
rss parser is based on the commons-feedparser out of the jakarta sandbox,
and fully supports all of the major rss formats (atom, rss 1.0, 2.0, etc.).
Additionally, I¹ve included a junit test that runs the parser on an example
rss file and validates the outlinks and content extracted.

I hope that you will find it useful and vote to have it included in the
nutch distro.

Thanks,
  Chris 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [Nutch-dev] Re: RSS Parser Plugin based on commons-feedparser submitted

Posted by "Kevin A. Burton" <bu...@rojo.com>.

Andrzej Bialecki wrote:

> Chris Mattmann wrote:
>
>> Hi Folks,
>>
>>  I just wanted to let you know that I¹ve submitted the parse-rss 
>> plugin that
>> I was working on to the JIRA system under issue ³NUTCH-30²
>> (http://issues.apache.org/jira/browse/NUTCH-30). The plugin includes 
>> a patch
>> filie (svn diff), along with the zipped up source and runtime 
>> libraries. The
>> rss parser is based on the commons-feedparser out of the jakarta 
>> sandbox,
>> and fully supports all of the major rss formats (atom, rss 1.0, 2.0, 
>> etc.).
>> Additionally, I¹ve included a junit test that runs the parser on an 
>> example
>> rss file and validates the outlinks and content extracted.
>>
>> I hope that you will find it useful and vote to have it included in the
>> nutch distro.
>
>
> +1, with some reservations (see jira).
>
> I think it's a very useful contribution. Thank you, Chris!
>
Wow... thats GREAT.  (I'm the author of the FeedParser). 

BTW.  Its in commons-proper now but I just haven't had a chance to do a 
0.5.0 release.  We've had a release candidate but I need to release 
another one WRT some feedback we've had.

If you're running from a sandbox build I'd HIGHLY recommend getting a 
commons proper build of 0.5.0RC1.

http://jakarta.apache.org/commons/feedparser/

Kevin

-- 

Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!

Kevin A. Burton, Location - San Francisco, CA
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

Re: [Nutch-dev] Re: RSS Parser Plugin based on commons-feedparser submitted

Posted by "Kevin A. Burton" <bu...@rojo.com>.

Andrzej Bialecki wrote:

> Chris Mattmann wrote:
>
> +1, with some reservations (see jira).
>
> I think it's a very useful contribution. Thank you, Chris!
>
Also.. are you using our networking IO layer?

If so I'd recommend setting your own UserAgent.

Kevin

-- 

Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!

Kevin A. Burton, Location - San Francisco, CA
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

RE: RSS Parser Plugin based on commons-feedparser submitted

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Andrzej,

  Thanks for all the comments. I will update the parse-rss code to abide by
those guidelines, and will try to get out an update in the next day or so.

Thanks for looking at the code so carefully!

I'll also fix the patch file using your suggested svn command.

Take care,
  Chris


______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: Monday, April 04, 2005 3:17 PM
> To: nutch-dev@incubator.apache.org
> Subject: Re: RSS Parser Plugin based on commons-feedparser submitted
> 
> Chris Mattmann wrote:
> > Hi Andrzej,
> >
> >  Yeah, actually I was the one that initiated that thread about the XML
> > parsing libraries ;) Kinda funny how my plugin uses one huh? :-)
> >
> 
> :-)
> 
> > The plugin I submitted uses jdom actually (although it's a moot point
> [..]
> 
> Ah, thanks for the clarification. So, what this boils down to is that
> dom4j is there to stay. I don't mind it, I was just curious.
> 
> >
> > As for the patch having large white-space in the diffs, I can fix that
> with
> > a perl script. I'll try and fix that by tonight.
> >
> 
> Perhaps try to run a 'svn diff -x b' to ignore whitespace changes, and
> then only fix the remaining lines that really differ.
> 
> > With respect to the transformDocument commment, my RSS Parser doesn't
> use
> > that function: that is from the one that Stefan submitted earlier before
> he
> > could find my code and look at it. The two files that I submitted (that
> > comprise my plugin) are:
> >
> >  parse-rss-patch.txt
> >  parse-rss.zip
> 
> Ah, ok - then I was looking at the wrong files altogether. I now had a
> look at the source in parse-rss.zip. If you don't mind, here's a couple
> of new comments:
> 
> * package names follow the old naming, the new naming is under
> org.apache.nutch.*
> 
> * in RSSParser.java, you retrieve contentLength, but the code never uses
> it.
> 
> * lines 149-160 seem a bit bogus to me. As I understand the RSS spec,
> the item's permalink should be preferred _if present_, but it's not an
> error if it's absent (as signified by a null value, which currently
> causes MalformedURLException to be thrown) - in such case the getLink
> should be used instead. The message in line 157 is wrong, too, because
> it prints the url of the channel, and not the current item. When it's
> fixed, it would be also good to demonstrate such fallback in the test
> case.
> 
> * I'm not sure what is the purpose of copying through the metadata - the
> code doesn't modify the copy, so you could as well use the original,
> right?
> 
> * probably just a matter of programming style, but I'm always somewhat
> vexed by frequent String concatenations, especially in a "for" loop -
> like in the code that creates the title and the body. StringBuffer-s
> would be a good fit here...
> 
> * IMHO it's better to put the various intermediate diagnostic output
> from the plugin under LOG.fine(), to reduce the amount of information to
> be logged. The final result of processing the content could be put under
> LOG.info() or LOG.warn(), depending on the final result. (I personally
> favor no output whatsoever if everything went ok).
> 
> And lastly, a minor thing, but still... the formatting style and
> indentation in most files doesn't adhere to the Nutch coding style when
> it comes to whitespace rules - please see e.g. WebDBReader.java as a
> reference. This especially concerns the whitespace around curly braces
> and assignments, and the use of literal Tab instead of 4 spaces. This is
> easy to fix with an IDE, but it helps a lot when someone else is reading
> the code...
> 
> Thanks again for this contribution!
> 
> 
> --
> Best regards,
> Andrzej Bialecki
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

Re: RSS Parser Plugin based on commons-feedparser submitted

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Mattmann wrote:
> Hi Andrzej,
> 
>  Yeah, actually I was the one that initiated that thread about the XML
> parsing libraries ;) Kinda funny how my plugin uses one huh? :-)
> 

:-)

> The plugin I submitted uses jdom actually (although it's a moot point
[..]

Ah, thanks for the clarification. So, what this boils down to is that 
dom4j is there to stay. I don't mind it, I was just curious.

> 
> As for the patch having large white-space in the diffs, I can fix that with
> a perl script. I'll try and fix that by tonight.
> 

Perhaps try to run a 'svn diff -x b' to ignore whitespace changes, and 
then only fix the remaining lines that really differ.

> With respect to the transformDocument commment, my RSS Parser doesn't use
> that function: that is from the one that Stefan submitted earlier before he
> could find my code and look at it. The two files that I submitted (that
> comprise my plugin) are:
> 
>  parse-rss-patch.txt
>  parse-rss.zip 

Ah, ok - then I was looking at the wrong files altogether. I now had a 
look at the source in parse-rss.zip. If you don't mind, here's a couple 
of new comments:

* package names follow the old naming, the new naming is under 
org.apache.nutch.*

* in RSSParser.java, you retrieve contentLength, but the code never uses it.

* lines 149-160 seem a bit bogus to me. As I understand the RSS spec, 
the item's permalink should be preferred _if present_, but it's not an 
error if it's absent (as signified by a null value, which currently 
causes MalformedURLException to be thrown) - in such case the getLink 
should be used instead. The message in line 157 is wrong, too, because 
it prints the url of the channel, and not the current item. When it's 
fixed, it would be also good to demonstrate such fallback in the test case.

* I'm not sure what is the purpose of copying through the metadata - the 
code doesn't modify the copy, so you could as well use the original, right?

* probably just a matter of programming style, but I'm always somewhat 
vexed by frequent String concatenations, especially in a "for" loop - 
like in the code that creates the title and the body. StringBuffer-s 
would be a good fit here...

* IMHO it's better to put the various intermediate diagnostic output 
from the plugin under LOG.fine(), to reduce the amount of information to 
be logged. The final result of processing the content could be put under 
LOG.info() or LOG.warn(), depending on the final result. (I personally 
favor no output whatsoever if everything went ok).

And lastly, a minor thing, but still... the formatting style and 
indentation in most files doesn't adhere to the Nutch coding style when 
it comes to whitespace rules - please see e.g. WebDBReader.java as a 
reference. This especially concerns the whitespace around curly braces 
and assignments, and the use of literal Tab instead of 4 spaces. This is 
easy to fix with an IDE, but it helps a lot when someone else is reading 
the code...

Thanks again for this contribution!

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: RSS Parser Plugin based on commons-feedparser submitted

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Andrzej,

 Yeah, actually I was the one that initiated that thread about the XML
parsing libraries ;) Kinda funny how my plugin uses one huh? :-)

The plugin I submitted uses jdom actually (although it's a moot point
whether it uses jdom, or dom4j, etc.). The jdom dependency comes from jaxen,
which the commons-feedparser uses. The nice thing about the
commons-feedparser component is its SAX-based (event style) parsing model,
and its ability to handle virtually all of the different RSS feed styles
(Atom, RSS 1.0, 2.0, etc.).

The original discussion about the different XML parsing APIs arose out of
Nutch's reliance (at the time) on dom4j 1.4.2, which had some external jaxen
API classes included in it, which caused namespace conflicts with various
other XML parsing APIs. Therefore, those who wrote plugins for Nutch before
the dom4j in the $NUTCH_HOME/lib directory was upgraded to 1.5.2, and who
needed jaxen, or dom4j, or other XML reading APIs in their plugins, would
have had namespace conflicts like myself. So, Doug upgraded Nutch to rely on
dom4j 1.5.2, which doesn't include the additional jaxen classes, and that
problem has been alleviated (for now of course, until the next XML API
conflict comes along ;) ).

As for the patch having large white-space in the diffs, I can fix that with
a perl script. I'll try and fix that by tonight.

With respect to the transformDocument commment, my RSS Parser doesn't use
that function: that is from the one that Stefan submitted earlier before he
could find my code and look at it. The two files that I submitted (that
comprise my plugin) are:

 parse-rss-patch.txt
 parse-rss.zip 

Thanks for your comments and I hope that the Nutch community can benefit
from the plugin.

Cheers,
  Chris

On 4/4/05 12:14 PM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Chris Mattmann wrote:
>> Hi Folks,
>> 
>>  I just wanted to let you know that I¹ve submitted the parse-rss plugin that
>> I was working on to the JIRA system under issue ³NUTCH-30²
>> (http://issues.apache.org/jira/browse/NUTCH-30). The plugin includes a patch
>> filie (svn diff), along with the zipped up source and runtime libraries. The
>> rss parser is based on the commons-feedparser out of the jakarta sandbox,
>> and fully supports all of the major rss formats (atom, rss 1.0, 2.0, etc.).
>> Additionally, I¹ve included a junit test that runs the parser on an example
>> rss file and validates the outlinks and content extracted.
>> 
>> I hope that you will find it useful and vote to have it included in the
>> nutch distro.
> 
> +1, with some reservations (see jira).
> 
> I think it's a very useful contribution. Thank you, Chris!

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: RSS Parser Plugin based on commons-feedparser submitted

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Mattmann wrote:
> Hi Folks,
> 
>  I just wanted to let you know that I¹ve submitted the parse-rss plugin that
> I was working on to the JIRA system under issue ³NUTCH-30²
> (http://issues.apache.org/jira/browse/NUTCH-30). The plugin includes a patch
> filie (svn diff), along with the zipped up source and runtime libraries. The
> rss parser is based on the commons-feedparser out of the jakarta sandbox,
> and fully supports all of the major rss formats (atom, rss 1.0, 2.0, etc.).
> Additionally, I¹ve included a junit test that runs the parser on an example
> rss file and validates the outlinks and content extracted.
> 
> I hope that you will find it useful and vote to have it included in the
> nutch distro.

+1, with some reservations (see jira).

I think it's a very useful contribution. Thank you, Chris!

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com