You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by Torgeir Veimo <to...@pobox.com> on 2006/09/27 23:23:41 UTC

digester parsing with html content

I'm trying to use digester for parsing xml that were previously  
parsed with jaxb 1.0-ea. Some of the content is xhtml fragments  
inside xml, eg.

<body-text><xhtml>...</xhtml><body-text>

and I'd like to retrieve the content as a String bean property.   
However, I'd like the parser to threat the content of body-text as  
opaque. Now it tries to parse it and chokes on eg. &oslash; entities.

Any clues on how I can configure digester, or more precisely, the  
underlying parser, to avoid these problems?

-- 
Torgeir Veimo
torgeir@pobox.com




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Re: digester parsing with html content

Posted by Craig McClanahan <cr...@apache.org>.
On 9/27/06, Torgeir Veimo <to...@pobox.com> wrote:
>
> I'm trying to use digester for parsing xml that were previously
> parsed with jaxb 1.0-ea. Some of the content is xhtml fragments
> inside xml, eg.
>
> <body-text><xhtml>...</xhtml><body-text>
>
> and I'd like to retrieve the content as a String bean property.
> However, I'd like the parser to threat the content of body-text as
> opaque. Now it tries to parse it and chokes on eg. &oslash; entities.
>
> Any clues on how I can configure digester, or more precisely, the
> underlying parser, to avoid these problems?


One general strategy would be to define all of the entities that HTML
defines by default, in the DOCTYPE of the surrounding XML document that you
are parsing.  That way, they would just get expanded (at the XML parsing
level) and not cause you any problem.

--
> Torgeir Veimo
> torgeir@pobox.com


Craig

Re: digester parsing with html content

Posted by Torgeir Veimo <to...@pobox.com>.
On 27 Sep 2006, at 22:23, Torgeir Veimo wrote:

> I'm trying to use digester for parsing xml that were previously  
> parsed with jaxb 1.0-ea. Some of the content is xhtml fragments  
> inside xml, eg.
>
> <body-text><xhtml>...</xhtml><body-text>
>
> and I'd like to retrieve the content as a String bean property.   
> However, I'd like the parser to threat the content of body-text as  
> opaque. Now it tries to parse it and chokes on eg. &oslash; entities.
>
> Any clues on how I can configure digester, or more precisely, the  
> underlying parser, to avoid these problems?

FYI, previously with jaxb, I was using this DTD:


<!ELEMENT article (title, lead-text?, body-text, ...)>
     <!ATTLIST article ... >

<!ELEMENT title (#PCDATA)>

<!ELEMENT lead-text (#PCDATA)>

<!ELEMENT body-text (#PCDATA)>

-- 
Torgeir Veimo
torgeir@pobox.com




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org