You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@struts.apache.org by Chris Pratt <th...@gmail.com> on 2008/03/17 04:39:07 UTC

[OT] XML Preprocessing

Sorry I missed the normal Friday free-for-all with my Off Topic
question, but I'm hoping someone around here has already solved the
problem I'm staring at.

I am trying to pre-process a stream of HTML/XML.  My first thought was
to just use SAX (with TagSoup for the HTML) and watch for the tokens I
needed to modify while passing the rest through, but all the XML tools
I can find are geared towards processing XML, not pre-processing it.
So they help you out by automatically converting entities to their
values and other things that completely get in the way of
pre-processing.  Has anybody else had to solve this problem?  If so,
any pointers would be GREATLY appreciated.  Thanks.
  (*Chris*)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@struts.apache.org
For additional commands, e-mail: user-help@struts.apache.org


Re: [OT] XML Preprocessing

Posted by Dave Newton <ne...@yahoo.com>.
--- Chris Pratt <th...@gmail.com> wrote:
> What I'm trying to do is read a stream of HTML and make changes to
> certain tags ,like adding a target="_blank" to the <a> tags and
> setting the src attributes for <img>, <link> and others so they can't
> be loaded, for a mail viewer web application.  I'd prefer not to
> change the stream in any ways other than the intentional changes, so
> that I don't run into any weird bugs down the line.  But I haven't
> found a good technique to do that yet.

While I'd imagine there are HTML libraries that don't convert entities (you
might check out http://htmlparser.sourceforge.net/, at least) you can always
use regular expressions. If your input is well-formed I'd imagine that XSLT
would also work, but then you'd have to use XSLT, and we'd all stand around
and laugh and point.

I don't know if you're trying to do this from within a Java application or as
a standalone tool, but if standalone, I'd probably just use one of the
[J]Ruby alternatives; I do a lot of massaging with a combination of regex and
some of the XML/HTML libraries.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@struts.apache.org
For additional commands, e-mail: user-help@struts.apache.org


Re: [OT] XML Preprocessing

Posted by Chris Pratt <th...@gmail.com>.
On Mon, Mar 17, 2008 at 8:02 AM, Roger Varley
<ro...@googlemail.com> wrote:
> I've never tried to do this since, normally, you want the XML processor to
>  handle entities such as the & symbol - and if either the input XML or output
>  XML contains these symbols unaltered then you don't have legal XML.
>
>  If you really need to leave these unprocessed, then perhaps you can replace
>  the SAX EntityResolver with your own implementation?
>
>  Perhaps if you could explain what you're trying to do?
>

Unfortunately the EntityResolver just finds files containing external
entities, so that doesn't seem to help much.

What I'm trying to do is read a stream of HTML and make changes to
certain tags ,like adding a target="_blank" to the <a> tags and
setting the src attributes for <img>, <link> and others so they can't
be loaded, for a mail viewer web application.  I'd prefer not to
change the stream in any ways other than the intentional changes, so
that I don't run into any weird bugs down the line.  But I haven't
found a good technique to do that yet.
  (*Chris*)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@struts.apache.org
For additional commands, e-mail: user-help@struts.apache.org


Re: [OT] XML Preprocessing

Posted by Roger Varley <ro...@googlemail.com>.
I've never tried to do this since, normally, you want the XML processor to 
handle entities such as the & symbol - and if either the input XML or output 
XML contains these symbols unaltered then you don't have legal XML.

If you really need to leave these unprocessed, then perhaps you can replace 
the SAX EntityResolver with your own implementation?

Perhaps if you could explain what you're trying to do?

Regards

On Monday 17 March 2008 05:39:07 Chris Pratt wrote:
> Sorry I missed the normal Friday free-for-all with my Off Topic
> question, but I'm hoping someone around here has already solved the
> problem I'm staring at.
>
> I am trying to pre-process a stream of HTML/XML.  My first thought was
> to just use SAX (with TagSoup for the HTML) and watch for the tokens I
> needed to modify while passing the rest through, but all the XML tools
> I can find are geared towards processing XML, not pre-processing it.
> So they help you out by automatically converting entities to their
> values and other things that completely get in the way of
> pre-processing.  Has anybody else had to solve this problem?  If so,
> any pointers would be GREATLY appreciated.  Thanks.
>   (*Chris*)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@struts.apache.org
For additional commands, e-mail: user-help@struts.apache.org