You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@commons.apache.org by rjn <ib...@gmail.com> on 2006/07/27 15:59:04 UTC

Ignoring Specific Tags with Digester

Hi Everyone,

I'm trying to write a Syndication Feed parser using Digester, however
I'm running into a stumbling block.  Many feeds have HTML in the
entries such as <a>, <br>, etc.   Digester tries to parse these as XML
tags, thus leading to blanks in the data I pull out.  I was wondering
if there was way to set Digester to ignore specific tags (in this
case, the HTML tags)?

Thanks,
RJ

-- 
em: ibgeek@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org

Re: Ignoring Specific Tags with Digester

Posted by Paul J DeCoursey <pa...@decoursey.net>.

rjn wrote:
> Thanks for the responses.  Yeah, so the XML file is valid, it's just
> that some of the tags have HTML embedded within them.  For Example:
>
> <entry><p>This is text.</p></entry>
>
> So Digestor seems this as:
> entry/p
>
> Rather than just entry.  I imagine I could just downloaded the XML
> documents and knowing the structure, seach for the entry fields and
> then cut out the text.  Then, store that separately.  I was just
> hoping there was a way to list tags to ignore.  For example: <p>,
> <br>, etc.
>
> Thanks anyway,
>
> On 7/27/06, rjn <ib...@gmail.com> wrote:
>> Hi Everyone,
>>
>> I'm trying to write a Syndication Feed parser using Digester, however
>> I'm running into a stumbling block.  Many feeds have HTML in the
>> entries such as <a>, <br>, etc.   Digester tries to parse these as XML
>> tags, thus leading to blanks in the data I pull out.  I was wondering
>> if there was way to set Digester to ignore specific tags (in this
>> case, the HTML tags)?
>>
>> Thanks,
>> RJ
>>
>> -- 
>> em: ibgeek@gmail.com
>>
>
>
Or tags to just copy as text.  I think that Simon had your answer with 
NodeCreateRule.  If I'm reading correctly it will create a Document 
Fragment of the Node in questions and it's childern, which you could 
pass to an XSLT processor to serialize it into the text you want, saving 
the html you which to keep, or stripping the tags if you wish.

Paul



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org

Re: Ignoring Specific Tags with Digester

Posted by rjn <ib...@gmail.com>.

Thanks for the responses.  Yeah, so the XML file is valid, it's just
that some of the tags have HTML embedded within them.  For Example:

<entry><p>This is text.</p></entry>

So Digestor seems this as:
entry/p

Rather than just entry.  I imagine I could just downloaded the XML
documents and knowing the structure, seach for the entry fields and
then cut out the text.  Then, store that separately.  I was just
hoping there was a way to list tags to ignore.  For example: <p>,
<br>, etc.

Thanks anyway,

On 7/27/06, rjn <ib...@gmail.com> wrote:
> Hi Everyone,
>
> I'm trying to write a Syndication Feed parser using Digester, however
> I'm running into a stumbling block.  Many feeds have HTML in the
> entries such as <a>, <br>, etc.   Digester tries to parse these as XML
> tags, thus leading to blanks in the data I pull out.  I was wondering
> if there was way to set Digester to ignore specific tags (in this
> case, the HTML tags)?
>
> Thanks,
> RJ
>
> --
> em: ibgeek@gmail.com
>

-- 
em: ibgeek@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org