You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@abdera.apache.org by Bruce Snyder <br...@gmail.com> on 2008/11/13 10:06:07 UTC

DOCTYPE declaration causing WstxUnexpectedCharException

I'm using the Abdera API to grab Atom feeds. I've tried a few
different Atom feeds and I'm getting the following exception with all
of them:

---------------------------------------------------------------------------------------------------
Exception in thread "main" org.apache.abdera.parser.ParseException:
org.apache.abdera.parser.ParseException:
com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '-'
(code 45) in external DTD subset; expected closing '>' after ENTITY
declaration
 at [row,col,system-id]: [81,5,"http://www.w3.org/TR/html4/strict.dtd"]
 from [row,col {unknown-source}]: [1,1]
	at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:132)
	at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:96)
	at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:74)
	at com.sonatype.feedeater.FeedEater.grabUris(FeedEater.java:52)
	at com.sonatype.feedeater.FeedEater.run(FeedEater.java:41)
	at com.sonatype.feedeater.FeedEater.main(FeedEater.java:34)
Caused by: org.apache.abdera.parser.ParseException:
com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '-'
(code 45) in external DTD subset; expected closing '>' after ENTITY
declaration
 at [row,col,system-id]: [81,5,"http://www.w3.org/TR/html4/strict.dtd"]
 from [row,col {unknown-source}]: [1,1]
	at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:260)
	at org.apache.abdera.parser.stax.FOMBuilder.getFomDocument(FOMBuilder.java:333)
	at org.apache.abdera.parser.stax.FOMParser.getDocument(FOMParser.java:72)
	at org.apache.abdera.parser.stax.FOMParser.parse(FOMParser.java:207)
	at org.apache.abdera.parser.stax.FOMParser.parse(FOMParser.java:145)
	at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:119)
	... 5 more
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
character '-' (code 45) in external DTD subset; expected closing '>'
after ENTITY declaration
 at [row,col,system-id]: [81,5,"http://www.w3.org/TR/html4/strict.dtd"]
 from [row,col {unknown-source}]: [1,1]
	at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:623)
	at com.ctc.wstx.dtd.FullDTDReader.throwDTDUnexpectedChar(FullDTDReader.java:2013)
	at com.ctc.wstx.dtd.FullDTDReader.parseEntityValue(FullDTDReader.java:1533)
	at com.ctc.wstx.dtd.FullDTDReader.handleEntityDecl(FullDTDReader.java:2419)
	at com.ctc.wstx.dtd.FullDTDReader.handleDeclaration(FullDTDReader.java:2075)
	at com.ctc.wstx.dtd.FullDTDReader.parseDirective(FullDTDReader.java:720)
	at com.ctc.wstx.dtd.FullDTDReader.parseDTD(FullDTDReader.java:599)
	at com.ctc.wstx.dtd.FullDTDReader.readExternalSubset(FullDTDReader.java:457)
	at com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:478)
	at com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
	at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3349)
	at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
	at org.apache.abdera.parser.stax.FOMBuilder.getNextElementToParse(FOMBuilder.java:163)
	at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:187)
	... 10 more
---------------------------------------------------------------------------------------------------

The errors seem to occur from the call to the
ClientResponse.getDocument(). As far as I can tell, the Abdera API is
having trouble with the DOCTYPE declaration and is trying to fetch the
strict.dtd. Is there a way to work around the DOCTYPE declaration?

Bruce
-- 
perl -e 'print unpack("u30","D0G)U8V4\@4VYY9&5R\"F)R=6-E+G-N>61E<D\!G;6%I;\"YC;VT*"
);'

Apache ActiveMQ - http://activemq.org/
Apache Camel - http://activemq.org/camel/
Apache ServiceMix - http://servicemix.org/

Blog: http://bruceblog.org/

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Thu, Nov 13, 2008 at 1:28 PM, Bruce Snyder <br...@gmail.com> wrote:
> On Thu, Nov 13, 2008 at 11:02 AM, Garrett Rooney
> <ro...@electricjellyfish.net> wrote:
>> On Thu, Nov 13, 2008 at 11:43 AM, Bruce Snyder <br...@gmail.com> wrote:
>>> On Thu, Nov 13, 2008 at 9:25 AM, Garrett Rooney
>>> <ro...@electricjellyfish.net> wrote:
>>>
>>>> Does the document actually have a <feed> element at it's root?  That's
>>>> the kind of error you'd get if you parsed (for example) an Atom
>>>> <entry> instead of an Atom <feed>.
>>>
>>> Yep, it sure does. I'm just using the Google News Atom URL for my testing:
>>>
>>> http://news.google.com/?output=atom
>>
>> That's the problem.  That's an atom 0.3 feed, abdera only supports the
>> actual 1.0 standard.  The namespaces are different, which is why it
>> doesn't think it's the right kind of document.  It wouldn't be
>> impossible to add support for 0.3, but it doesn't do it yet.
>
> Damn :-(.
>
> Would it be difficult for Abdera to support other Atom versions by
> just poking the feed and then deciding which parser version to use?

Well, I know you could do it by adding the old elements as essentially
an extension, but that's a fair amount of work, as you'd be adding a
lot of classes and essentially duplicating a lot of work that's
already done for the 1.0 version.  Not sure if there's an easier way,
maybe convincing the existing code to accept either namespace.  It's
hard to say what the cost/benefit would be here, as atom 0.3 does seem
to be going away relatively quickly.

-garrett

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Bruce Snyder <br...@gmail.com>.

On Thu, Nov 13, 2008 at 11:39 AM, Adam Constabaris
<ad...@clownsinmycoffee.net> wrote:
>> Would it be difficult for Abdera to support other Atom versions by
>> just poking the feed and then deciding which parser version to use?
>
> If you're not actually using AtomPub (read/write), but just want to read
> Atom 0.3 (and 1.0, and RSS-*) feeds, you might want to give ROME
> (https://rome.dev.java.net) a shot.   If you can use Python,
> http://feedparser.org/ is probably your best bet.

Yeah, I've already used Rome for this task and it works flawlessly.

> I know there's precedent for supporting something other than AtomPub in
> Abdera but truth told, if there's work to be done here I'd rather see it go
> towards lobbying this "google" web site to get with the times =)

Agreed, but I'm not sure exactly how one goes about lobbying Google ;-).

Bruce
-- 
perl -e 'print unpack("u30","D0G)U8V4\@4VYY9&5R\"F)R=6-E+G-N>61E<D\!G;6%I;\"YC;VT*"
);'

Apache ActiveMQ - http://activemq.org/
Apache Camel - http://activemq.org/camel/
Apache ServiceMix - http://servicemix.org/

Blog: http://bruceblog.org/

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Adam Constabaris <ad...@clownsinmycoffee.net>.

 > Would it be difficult for Abdera to support other Atom versions by
 > just poking the feed and then deciding which parser version to use?

If you're not actually using AtomPub (read/write), but just want to read 
Atom 0.3 (and 1.0, and RSS-*) feeds, you might want to give ROME 
(https://rome.dev.java.net) a shot.   If you can use Python, 
http://feedparser.org/ is probably your best bet.

I know there's precedent for supporting something other than AtomPub in 
Abdera but truth told, if there's work to be done here I'd rather see it 
go towards lobbying this "google" web site to get with the times =)

cheers,

AC

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Bruce Snyder <br...@gmail.com>.

On Thu, Nov 13, 2008 at 11:02 AM, Garrett Rooney
<ro...@electricjellyfish.net> wrote:
> On Thu, Nov 13, 2008 at 11:43 AM, Bruce Snyder <br...@gmail.com> wrote:
>> On Thu, Nov 13, 2008 at 9:25 AM, Garrett Rooney
>> <ro...@electricjellyfish.net> wrote:
>>
>>> Does the document actually have a <feed> element at it's root?  That's
>>> the kind of error you'd get if you parsed (for example) an Atom
>>> <entry> instead of an Atom <feed>.
>>
>> Yep, it sure does. I'm just using the Google News Atom URL for my testing:
>>
>> http://news.google.com/?output=atom
>
> That's the problem.  That's an atom 0.3 feed, abdera only supports the
> actual 1.0 standard.  The namespaces are different, which is why it
> doesn't think it's the right kind of document.  It wouldn't be
> impossible to add support for 0.3, but it doesn't do it yet.

Damn :-(.

Would it be difficult for Abdera to support other Atom versions by
just poking the feed and then deciding which parser version to use?

Bruce
-- 
perl -e 'print unpack("u30","D0G)U8V4\@4VYY9&5R\"F)R=6-E+G-N>61E<D\!G;6%I;\"YC;VT*"
);'

Apache ActiveMQ - http://activemq.org/
Apache Camel - http://activemq.org/camel/
Apache ServiceMix - http://servicemix.org/

Blog: http://bruceblog.org/

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Thu, Nov 13, 2008 at 11:43 AM, Bruce Snyder <br...@gmail.com> wrote:
> On Thu, Nov 13, 2008 at 9:25 AM, Garrett Rooney
> <ro...@electricjellyfish.net> wrote:
>
>> Does the document actually have a <feed> element at it's root?  That's
>> the kind of error you'd get if you parsed (for example) an Atom
>> <entry> instead of an Atom <feed>.
>
> Yep, it sure does. I'm just using the Google News Atom URL for my testing:
>
> http://news.google.com/?output=atom

That's the problem.  That's an atom 0.3 feed, abdera only supports the
actual 1.0 standard.  The namespaces are different, which is why it
doesn't think it's the right kind of document.  It wouldn't be
impossible to add support for 0.3, but it doesn't do it yet.

-garrett

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Bruce Snyder <br...@gmail.com>.

On Thu, Nov 13, 2008 at 9:25 AM, Garrett Rooney
<ro...@electricjellyfish.net> wrote:

> Does the document actually have a <feed> element at it's root?  That's
> the kind of error you'd get if you parsed (for example) an Atom
> <entry> instead of an Atom <feed>.

Yep, it sure does. I'm just using the Google News Atom URL for my testing:

http://news.google.com/?output=atom

Bruce
-- 
perl -e 'print unpack("u30","D0G)U8V4\@4VYY9&5R\"F)R=6-E+G-N>61E<D\!G;6%I;\"YC;VT*"
);'

Apache ActiveMQ - http://activemq.org/
Apache Camel - http://activemq.org/camel/
Apache ServiceMix - http://servicemix.org/

Blog: http://bruceblog.org/

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On Thu, Nov 13, 2008 at 11:22 AM, Bruce Snyder <br...@gmail.com> wrote:
> On Thu, Nov 13, 2008 at 3:44 AM, James Abley <ja...@gmail.com> wrote:
>
>> You're pulling down Atom feeds that have an html DOCTYPE? Are you sure
>> that they're valid Atom feeds? What does the feedvalidator [1] say?
>>
>> Cheers,
>>
>> James
>>
>> [1] http://www.feedvalidator.org/
>
> Thanks, James. Your suggestion made me realize that the URL was
> incorrect. Now I have the correct Atom URL and I'm getting an error in
> the processing. Please see the code block and error below:
>
> for (int i = 0; i < uris.length; ++i) {
>        String uri = (String) uris[i];
>        ClientResponse resp = client.get(uri);
>        if (resp.getType() == ResponseType.SUCCESS) {
>                Document<Feed> doc = resp.getDocument();
>                Feed feed = doc.getRoot(); // error occurs here
>                LOG.info(feed.getTitle());
>
>                for (Entry entry : feed.getEntries()) {
>                  LOG.info("\t" + entry.getTitle());
>                }
>        } else {
>                System.out.println("Failure");
>        }
> }
>
>
> Exception in thread "main" java.lang.ClassCastException:
> org.apache.abdera.parser.stax.FOMExtensibleElement
>        at com.sonatype.feedeater.FeedEater.grabUris(FeedEater.java:53)
>        at com.sonatype.feedeater.FeedEater.run(FeedEater.java:41)
>        at com.sonatype.feedeater.FeedEater.main(FeedEater.java:34)
>
>
> This could very well be due to my lack of knowledge of Abdera and Atom
> feeds in general. Any suggestions are appreciated.

Does the document actually have a <feed> element at it's root?  That's
the kind of error you'd get if you parsed (for example) an Atom
<entry> instead of an Atom <feed>.

-garrett

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by Bruce Snyder <br...@gmail.com>.

On Thu, Nov 13, 2008 at 3:44 AM, James Abley <ja...@gmail.com> wrote:

> You're pulling down Atom feeds that have an html DOCTYPE? Are you sure
> that they're valid Atom feeds? What does the feedvalidator [1] say?
>
> Cheers,
>
> James
>
> [1] http://www.feedvalidator.org/

Thanks, James. Your suggestion made me realize that the URL was
incorrect. Now I have the correct Atom URL and I'm getting an error in
the processing. Please see the code block and error below:

for (int i = 0; i < uris.length; ++i) {
	String uri = (String) uris[i];
	ClientResponse resp = client.get(uri);
	if (resp.getType() == ResponseType.SUCCESS) {
		Document<Feed> doc = resp.getDocument();
		Feed feed = doc.getRoot(); // error occurs here
		LOG.info(feed.getTitle());
		
		for (Entry entry : feed.getEntries()) {
		  LOG.info("\t" + entry.getTitle());
		}
	} else {
		System.out.println("Failure");
	}
}


Exception in thread "main" java.lang.ClassCastException:
org.apache.abdera.parser.stax.FOMExtensibleElement
	at com.sonatype.feedeater.FeedEater.grabUris(FeedEater.java:53)
	at com.sonatype.feedeater.FeedEater.run(FeedEater.java:41)
	at com.sonatype.feedeater.FeedEater.main(FeedEater.java:34)


This could very well be due to my lack of knowledge of Abdera and Atom
feeds in general. Any suggestions are appreciated.

Bruce
-- 
perl -e 'print unpack("u30","D0G)U8V4\@4VYY9&5R\"F)R=6-E+G-N>61E<D\!G;6%I;\"YC;VT*"
);'

Apache ActiveMQ - http://activemq.org/
Apache Camel - http://activemq.org/camel/
Apache ServiceMix - http://servicemix.org/

Blog: http://bruceblog.org/

Re: DOCTYPE declaration causing WstxUnexpectedCharException

Posted by James Abley <ja...@gmail.com>.

2008/11/13 Bruce Snyder <br...@gmail.com>:
> I'm using the Abdera API to grab Atom feeds. I've tried a few
> different Atom feeds and I'm getting the following exception with all
> of them:
>
> ---------------------------------------------------------------------------------------------------
> Exception in thread "main" org.apache.abdera.parser.ParseException:
> org.apache.abdera.parser.ParseException:
> com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '-'
> (code 45) in external DTD subset; expected closing '>' after ENTITY
> declaration
>  at [row,col,system-id]: [81,5,"http://www.w3.org/TR/html4/strict.dtd"]
>  from [row,col {unknown-source}]: [1,1]
>        at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:132)
>        at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:96)
>        at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:74)
>        at com.sonatype.feedeater.FeedEater.grabUris(FeedEater.java:52)
>        at com.sonatype.feedeater.FeedEater.run(FeedEater.java:41)
>        at com.sonatype.feedeater.FeedEater.main(FeedEater.java:34)
> Caused by: org.apache.abdera.parser.ParseException:
> com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '-'
> (code 45) in external DTD subset; expected closing '>' after ENTITY
> declaration
>  at [row,col,system-id]: [81,5,"http://www.w3.org/TR/html4/strict.dtd"]
>  from [row,col {unknown-source}]: [1,1]
>        at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:260)
>        at org.apache.abdera.parser.stax.FOMBuilder.getFomDocument(FOMBuilder.java:333)
>        at org.apache.abdera.parser.stax.FOMParser.getDocument(FOMParser.java:72)
>        at org.apache.abdera.parser.stax.FOMParser.parse(FOMParser.java:207)
>        at org.apache.abdera.parser.stax.FOMParser.parse(FOMParser.java:145)
>        at org.apache.abdera.protocol.client.AbstractClientResponse.getDocument(AbstractClientResponse.java:119)
>        ... 5 more
> Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
> character '-' (code 45) in external DTD subset; expected closing '>'
> after ENTITY declaration
>  at [row,col,system-id]: [81,5,"http://www.w3.org/TR/html4/strict.dtd"]
>  from [row,col {unknown-source}]: [1,1]
>        at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:623)
>        at com.ctc.wstx.dtd.FullDTDReader.throwDTDUnexpectedChar(FullDTDReader.java:2013)
>        at com.ctc.wstx.dtd.FullDTDReader.parseEntityValue(FullDTDReader.java:1533)
>        at com.ctc.wstx.dtd.FullDTDReader.handleEntityDecl(FullDTDReader.java:2419)
>        at com.ctc.wstx.dtd.FullDTDReader.handleDeclaration(FullDTDReader.java:2075)
>        at com.ctc.wstx.dtd.FullDTDReader.parseDirective(FullDTDReader.java:720)
>        at com.ctc.wstx.dtd.FullDTDReader.parseDTD(FullDTDReader.java:599)
>        at com.ctc.wstx.dtd.FullDTDReader.readExternalSubset(FullDTDReader.java:457)
>        at com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:478)
>        at com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
>        at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3349)
>        at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
>        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
>        at org.apache.abdera.parser.stax.FOMBuilder.getNextElementToParse(FOMBuilder.java:163)
>        at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:187)
>        ... 10 more
> ---------------------------------------------------------------------------------------------------
>
> The errors seem to occur from the call to the
> ClientResponse.getDocument(). As far as I can tell, the Abdera API is
> having trouble with the DOCTYPE declaration and is trying to fetch the
> strict.dtd. Is there a way to work around the DOCTYPE declaration?
>
> Bruce
> --
> perl -e 'print unpack("u30","D0G)U8V4\@4VYY9&5R\"F)R=6-E+G-N>61E<D\!G;6%I;\"YC;VT*"
> );'
>
> Apache ActiveMQ - http://activemq.org/
> Apache Camel - http://activemq.org/camel/
> Apache ServiceMix - http://servicemix.org/
>
> Blog: http://bruceblog.org/
>

Hi Bruce,

You're pulling down Atom feeds that have an html DOCTYPE? Are you sure
that they're valid Atom feeds? What does the feedvalidator [1] say?

Cheers,

James

[1] http://www.feedvalidator.org/