You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Miguel A Paraz <mp...@gmail.com> on 2005/10/18 18:36:29 UTC

Crawling blogs and RSS

Hi,
I'm trying to set up Nutch to crawl blogs.

For nutch-site.xml, I added parse-rss to plugin.includes:
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|rss)|index-more|query-(basic|site|url)</value>


and set db.ignore.internal.links to false.

I noticed that in parse-plugins.xml:

        <mimeType name="text/xml">
                <plugin id="parse-text" />
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        </mimeType>

is this by order of priority, and parse-rss is last?

I tried injecting a single URL, my blog feed which is text/xml:
http://migs.paraz.com/w/feed/

It apparently isn't parsed.

Thanks in advance.

Re: Crawling blogs and RSS

Posted by Miguel A Paraz <mp...@gmail.com>.
On 10/19/05, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> Hi Miguel,
> Anyways, I have been thinking about this for a while, and will start working
> on a proposal and solution in the near future. For now, if you like, you
> could create a JIRA issue about this as a "wish" or "improvement" to be
> worked on in the (near) future.


Thanks for the insightful reply! Count me in if you need help in
coding and testing.

Re: Crawling blogs and RSS

Posted by Miguel A Paraz <mp...@gmail.com>.
On 10/19/05, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
>  Actually it's not out of priority, unfortunately because of the generic
> nature of the mime type "text/xml". Turns out that a lot of RSS comes back
> as configured by the web server with the content type "text/xml", even
> though it's recommended that "application/rss+xml" be used as the mime type
> for RSS. Most web server admins don't really spend the time configuring this
> mime type correctly in their web server. Further, if you go look at the IANA
> list of mime types, there really isn't a mime type specified for RSS
> (although RDF has applicaction/rdf+xml, which is sometimes used to identify
> RSS as well).

Hi,
I just realized: we don't have to look inside the XML file. We can
pick it up from context.

1. We could look inside the <head/> for links like:

<link rel="alternate" type="application/rss+xml" title="RSS 2.0"
href="http://migs.paraz.com/w/feed/" />
<link rel="alternate" type="application/atom+xml" title="Atom 0.3"
href="http://migs.paraz.com/w/feed/atom/" />

Is it practical to add a parser type to the Outlink type, so that the
HTML parser could set it from context?

2. We could add a new inject type: inject a list of feed URLs as the
starting point for the crawl. Technically, this isn't necessary since
an external program that parse the feeds then generate the URLs.

Re: Crawling blogs and RSS

Posted by Jérôme Charron <je...@gmail.com>.
> One part of fixing this problem is correct mime type identification for
> document types, which I know that Jerome is working on an update to, and
> will soon have a new mime type registry committed to Nutch.

The futur Mime Type Registry will be compatible with the FreeDesktop Shared
Mime Info specification.
http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-0.13.html
As you can notice, this specification provides some XML recognition
mechanism with a *root-XML* elements that provides a way to identify the
precise mime-type of a XML document based on its nameSpaceURI or/and its
localName.
This part of the specification is not yet implemented (but planned), so
that, in a near futur (I hope!!) the Mime Type Registry will be able to
solve your use case.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Crawling blogs and RSS

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Miguel,

 Actually it's not out of priority, unfortunately because of the generic
nature of the mime type "text/xml". Turns out that a lot of RSS comes back
as configured by the web server with the content type "text/xml", even
though it's recommended that "application/rss+xml" be used as the mime type
for RSS. Most web server admins don't really spend the time configuring this
mime type correctly in their web server. Further, if you go look at the IANA
list of mime types, there really isn't a mime type specified for RSS
(although RDF has applicaction/rdf+xml, which is sometimes used to identify
RSS as well). 

 So when I coded up the parse-plugins.xml file, I just noted the fact that
text/xml isn't really the standard mime type for rss, it's just the mime
type for any type of XML document, i.e., something that starts out with
"<?xml version=.....", which can conform to * any * XML Schema or DTD as
specified, which means identifying a document as text/xml doesn't really get
you anywhere unfortunately. That's what I set the parse-text plugin to be
the highest priority for text/xml, as in my mind it was most suited to
handle the generic nature of XML documents. I listed parse-html as 2nd in
priority because XHTML is becoming more popular and a pervasive form of
content. Finally, parse-rss is last, well, because, I think it should be.
:-) If you think about it, parse-rss is really only meant to handle RSS
feeds, which may, or may not, come back with the mime type "text/xml".

So, to answer your question, yes, parse-rss is last in the default
parse-plugins file. However, this doesn't mean it has to be that way in your
file. You are free to modify this list. Remember that order matters, in
fact, the order that the plugin comes underneath a mime type specifies its
order of preference to be used during parsing. You can find the full
specification of this at:

http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

which was authored by myself, Jerome Charron, and Sebastien LeCallonec
jointly. 

One part of fixing this problem is correct mime type identification for
document types, which I know that Jerome is working on an update to, and
will soon have a new mime type registry committed to Nutch. The other part
of this however, is deeper than just correct mime type identification. It
has to do with understanding the appropriate DTD or XML Schema that an XML
document conforms to. Only then will we understand the "right" parser to
call for an XML document. This could be handled in a number of ways, off the
top of my head, 2 ways come to mind:

1. Having a generic "text/xml" reading plugin than could parse out the
DTD/or XML Schema used by an XML document, and then call the right "sub XML
parsing plugin", that knew how to handle that DTD or schema

2. Adding an attribute to the plugin.xml file that specifies the DTD or
Schema that an XML Parsing Plugin supports, and then doing the resolution in
a decentralized fashion whenever the mime type "text/xml" is encountered

Anyways, I have been thinking about this for a while, and will start working
on a proposal and solution in the near future. For now, if you like, you
could create a JIRA issue about this as a "wish" or "improvement" to be
worked on in the (near) future.


FYI, here are a few interesting articles on the subject:

http://spazioinwind.libero.it/pierfederici/blog/000056.html
http://www.rassoc.com/gregr/weblog/archive.aspx?post=662

Thanks,
  Chris



On 10/18/05 9:36 AM, "Miguel A Paraz" <mp...@gmail.com> wrote:

> Hi,
> I'm trying to set up Nutch to crawl blogs.
> 
> For nutch-site.xml, I added parse-rss to plugin.includes:
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|rss)|index-more
> |query-(basic|site|url)</value>
> 
> 
> and set db.ignore.internal.links to false.
> 
> I noticed that in parse-plugins.xml:
> 
>         <mimeType name="text/xml">
>                 <plugin id="parse-text" />
>                 <plugin id="parse-html" />
>                 <plugin id="parse-rss" />
>         </mimeType>
> 
> is this by order of priority, and parse-rss is last?
> 
> I tried injecting a single URL, my blog feed which is text/xml:
> http://migs.paraz.com/w/feed/
> 
> It apparently isn't parsed.
> 
> Thanks in advance.

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.