You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mikeyc <mc...@gmail.com> on 2006/04/17 19:02:36 UTC

Blogger RSS Parsing Error

Hey all,
I'm trying to parse Blogger rss feeds and seem to be getting errors when
certain elements are encountered.  Specifically, the elements are prefixed
by "st1".  I believe these are Microsoft Smart Tags - not 100% though.  Has
anyone successfully done this?  If so, can you point me in the right
direction?  

I have attached the error message below for reference.  

Thanks,
Mike

org.apache.commons.feedparser.FeedParserException: org.jdom.JDOMException:
Error on line 46: The prefix "st1" for element "st1:country-region" is not
bound.
        at
org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:86)
        at org.apache.nutch.parse.rss.RSSParser.getParse(RSSParser.java:116)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:225)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:137)
Caused by: org.jdom.JDOMException: Error on line 46: The prefix "st1" for
element "st1:country-region" is not bound.
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:367)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:673)
        at
org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:73)
        ... 4 more
Caused by: org.xml.sax.SAXParseException: The prefix "st1" for element
"st1:country-region" is not bound.
        at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
        at
org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:354)



--
View this message in context: http://www.nabble.com/Blogger-RSS-Parsing-Error-t1462722.html#a3953362
Sent from the Nutch - User forum at Nabble.com.


Re: Blogger RSS Parsing Error

Posted by mikeyc <mc...@gmail.com>.
Chris,
Ok, I'll try the commons-feedparser mailing list.  Also, yes that was the
stack trace in the log output.  

Thanks again,
Mike
--
View this message in context: http://www.nabble.com/Blogger-RSS-Parsing-Error-t1462722.html#a3953778
Sent from the Nutch - User forum at Nabble.com.


Re: Blogger RSS Parsing Error

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Mike,

  The RSS parser for Nutch is based on Kevin Burton's commons-feedparser in
the Jakarta Sandbox. Here is the documentation for that feedparser:

http://jakarta.apache.org/commons/sandbox/feedparser/

You might want to post to the commons-feedparser email list asking him about
your RSS question: he's the real RSS guru, and I bet you he could help you
out.

  As for your guess that it's probably an unrecognized tag, I think you're
probably right. Now the question is, your fetch isn't failing because of
this, right? I mean, I see in the RSS parser that line 116 (the call to the
"parse" function) is within a try/catch block, so what you are pasting below
is just the output of the stack trace, right?

Anyways, good luck on your problem!

Cheers,
  Chris

--
View this message in context: http://www.nabble.com/Blogger-RSS-Parsing-Error-t1462722.html#a3953532
Sent from the Nutch - User forum at Nabble.com.