You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Marco PV <nu...@hotmail.com> on 2005/04/21 04:24:29 UTC

parse-rss fetch problems

Hi,

I'm using /nutch-nightly from  April 18th.
I've downloaded and uploaded the last src/plugin/parse-rss (src) and 
/plugin/parse-rss  (bin).
I've also compiled it with "ant", with no erros.
I've edited nutch-default.xml and modified the "parse-(rss|text|html)"
Should I edit the new mime.type files?

But when trying to fetch it can't parse either .xml or .rss files.
I get the error "indexed, but can't parse : content type not text/html; 
content type is "text/xml".
  Should I edit the new mime.type files?
  Whatever should I do?

Please, help.

Thanks,
Marco

_________________________________________________________________
MSN Messenger: instale grátis e converse com seus amigos. 
http://messenger.msn.com.br

Re: parse-rss fetch problems

Posted by Jérôme Charron <je...@gmail.com>.

> 
> The bigger issue, however, is how you deal with causing the byte sequence
> (or so called "magic characters") in the mime types configuration file to
> recognize that a file is in fact an RSS file. With so many different types
> of valid feeds (RSS 2.0, 0.9, 1.0, ATOM, and its many versions), how do 
> you
> reliably and accurately detect by magic character matchers that a file is
> RSS? The first bytes of the file may be * completely * different in all
> these valid feed types. The only thing you could probably detect is the 
> fact
> that the file is of type text/xml. Then, you would need a way to then
> understand that it's an XML file, but it's also RSS.


That's exact. I take a look on Freedesktop mime-type database, and it 
doesn't have any magic detection for RSS.
In fact, there's no easy way to detect rss content.
But the actual mime-types definitin in Nutch can detect xml content using 
the magic sequence &lt;?xml at the begining of the file.
Then, the Rss parser module need to check if this xml file is an rss content 
or not.
For now, that's the only solution.

parse-rss plugin.xml file, and change it to handle content type "text/xml"
> instead of "application/rss+xml", which is what's currently in there. 
> Then,
> when the code gets called, I've code the RSSParser to accept both
> "application/rss+xml", * and * "text/xml". So, it would work fine from
> there.
> Does that make sense? 

Yes

Jerome


-- 
http://motrech.free.fr/
http://frutch.free.fr/

RE: parse-rss fetch problems

Posted by Chris Mattmann <Ch...@jpl.nasa.gov>.

Hi Marco,

  The issue that you are having is that the parse-html plugin is getting
called by default on the content that you are trying to parse. This may have
to do with the MIME type mappings, and the new improved way (that J. Charron
worked on) that Nutch is currently using. So, basically there needs to be an
entry in the mime types content file to detect that the file type is RSS,
and set the content type to "application/rss+xml", which will cause the
parse-rss content parser to be invoked. The problem right now for you is
that it is now being invoked.

  The bigger issue, however, is how you deal with causing the byte sequence
(or so called "magic characters") in the mime types configuration file to
recognize that a file is in fact an RSS file. With so many different types
of valid feeds (RSS 2.0, 0.9, 1.0, ATOM, and its many versions), how do you
reliably and accurately detect by magic character matchers that a file is
RSS? The first bytes of the file may be * completely * different in all
these valid feed types. The only thing you could probably detect is the fact
that the file is of type text/xml. Then, you would need a way to then
understand that it's an XML file, but it's also RSS.

  So, the long story short is, let me look into how this could be done with
J. Charron's new MIME type system. I'll try and think about how this could
be done. In the meanwhile, try and see if you can get the MIME type system
to recognize that the file is in fact XML. Because, if you do that, then a
quick and dirty solution for your problem would be to just edit the
parse-rss plugin.xml file, and change it to handle content type "text/xml"
instead of "application/rss+xml", which is what's currently in there. Then,
when the code gets called, I've code the RSSParser to accept both
"application/rss+xml", * and * "text/xml". So, it would work fine from
there.


Does that make sense? If not, just let me know. I got your prior email with
the info about checking out your system. I have some free time tonight, so
I'll give it a look see and let you know if I can set that up for you.

Thanks,
  Chris Mattmann


-----Original Message-----
From: Marco PV [mailto:nutch_mail@hotmail.com] 
Sent: Wednesday, April 20, 2005 7:24 PM
To: nutch-dev@incubator.apache.org
Subject: parse-rss fetch problems

Hi,

I'm using /nutch-nightly from  April 18th.
I've downloaded and uploaded the last src/plugin/parse-rss (src) and 
/plugin/parse-rss  (bin).
I've also compiled it with "ant", with no erros.
I've edited nutch-default.xml and modified the "parse-(rss|text|html)"
Should I edit the new mime.type files?

But when trying to fetch it can't parse either .xml or .rss files.
I get the error "indexed, but can't parse : content type not text/html; 
content type is "text/xml".
  Should I edit the new mime.type files?
  Whatever should I do?

Please, help.

Thanks,
Marco

_________________________________________________________________
MSN Messenger: instale grátis e converse com seus amigos. 
http://messenger.msn.com.br