You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Justin Fagnani-Bell <ju...@paraliansoftware.com> on 2002/08/11 00:03:19 UTC

Arrgh, more XML/HTML problems now it's '&'

Hi again,

   <warning> this is a long post </warning>

   I'm still working on HTML forms where the user (me for the moment:) is 
supposed to input HTML into a text area that will be stored in an XML 
format. I'm still having problems, so I haven't written a SUMMARY post...

My new problem occurred last night when I'm testing the system and I put 
in an anchor tag with a url that has request parameters... like this:

<a href="http://www.something.net/apage.jsp?p1=hi&p2=bye">link</a>

Well, when I hit submit the form is supposed to come back filled out, 
but instead I get an error that states "the entity 'p2' must end with 
a ';'.

So I do some searching on on w3.org and sure enough URLs in XHTML have 
to use '&amp;' instead of '&'. Arrgh, I know this will cause problems 
once people who are used to normal HTML start using this. I'm 
considering writing a filter that will escape illegal characters on the 
way in, and un-escape them going back to the user, but that seems like a 
bit of a pain and combined with the problems I'm having making people 
type XML compliant HTML in the first place I'm wondering if there's a 
completely different way I could do this.

I'm sure someone else out there has come across these problems before. 
It seems inevitable when building a webapp where users can edit some 
content, that uses XML on the backend. The users only marginally know 
HTML in the first place and can't be expected to always follow the rules 
correctly every time. The app after all, is supposed to be easy to use.

I would love to start some discussion on different ideas for handling 
these types of problems. They must be common among Cocoon users, and 
maybe we can come up with a set of solutions (HOW-TO's, Java helper 
classes, taglibs) to make life easier on Cocoon developers and end-users.

Here's my little list of requirements, issues, and assumptions when 
dealing with forms, user input, and xml.

1) My users are used to HTML, not XML
2) My users are not fail proof, and are probably prone to occasional 
mistakes
3) Ideally I want them to be able to input HTML(non XML compliant), 
plain text, or XML (not HTML, but any XML. this is actually preferred, 
but sometimes users are just entering a news item, or a BBS post, and it 
seems reasonable to allow them to use HTML for formatting rather than 
inventing my own xml dialect)
4) The data is going to be in an XML document/SAX stream at some point
   (either stored that way, or stored in a database and turned into xml 
through a generator)
5) sometimes I want to run xsl transformations on the data when it is 
output.
6) when editing the data, I'd like to have it appear exactly as the user 
typed
7) but i'd also like to have the ability to clean it up (as on option)
8) The browsers like HTML 4 much better than XHTML, therefore the pages 
I send them work better if I use the HTMLSerializer

Here are some problems I've encountered so far.

1) users don't follow XML rules very well (goes along with point 1)
2) the HTMLSerializer changes the users data by turning <br/> into <br>, 
etc
3) the XML Serializer changes the users data by turning 
<textarea></textarea> into <textarea/>, etc
4) bad user input will cause SAXExceptions if it's not enclosed in CDATA 
sections

(oh, to clarify here, I typically have two pages which show the data, 
one is the 'edit' page with the form, the other is where the data 
actually shows up, the 'viewing page', the HTMLserializer is no problem 
on the viewing page, just the editing page)

Some of these points interfere with some solutions. For example, I could 
wrap the data in a CDATA section to get around XML compliance, but then 
I wouldn't be able to run XSL transformations on it (correct me if I'm 
wrong anywhere). Maybe I could check if the data is xml compliant and 
wrap it only if it isn't.

Here are some ideas for solutions:

1) Create a new HTMLSerializer that can selectively determine which tags 
it will convert into HTML and which is will leave alone. This way you 
could specify that all textarea tags and their contents shouldn't be 
touched (I would think this would be a reasonable default feature anyway)
2) Create a jTidy like program that will turn HTML into XHTML, but work 
for fragments (jTidy seems to only output complete HTML documents)
3) Create a class that can find an XML error, and report it nicely back 
to the user so they can fix it. (I recall a demo with Cocoon 1.8.x that 
had something like this...)

Hmm, these three things might do it. the new serializer would work for 
editing, the Tidy-like class work work for either storing the data as 
xml, or just viewing it as xml. I think I have an idea on how to do the 
serializer, but it wouldn't rely on a transformer like the current one. 
I looked at the code for jTidy and there's a ton of classes, so I've yet 
to fully comprehend how it works, it might already be able to do what i 
want, and like I said I saw something similar to 3) a year or so ago...

ok, that's my thoughts...

Justin



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Arrgh, more XML/HTML problems now it's '&'

Posted by MJ Ray <ma...@cloaked.freeserve.co.uk>.
Justin Fagnani-Bell wrote:
> So I do some searching on on w3.org and sure enough URLs in XHTML have 
> to use '&amp;' instead of '&'. Arrgh, I know this will cause problems 
> once people who are used to normal HTML start using this.  [...]

I remember it being the same in HTML attributes too.  The HTML 2.0 spec even
explicitly says so.  For a blast from the past (well, 1995), see
http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.4
Maybe the "normal HTML" wasn't so normal...

I also remember a particularly annoying bug with an early version of
Netscape (or maybe a late Mosaic) that did something odd with entity
expansions of URLs typed into the location bar, as a side-effect of the
above requirement.

Hope that helps,

-- 
MJR|
---'
|-----[ Luminas internet applications http://www.luminas.co.uk/ ]-----|

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


RE: Arrgh, more XML/HTML problems now it's '&'

Posted by Vadim Gritsenko <va...@verizon.net>.
> From: Justin Fagnani-Bell [mailto:justin@paraliansoftware.com]
> 
> Hi again,
...
>    I'm still working on HTML forms where the user (me for the moment:)
is
> supposed to input HTML into a text area that will be stored in an XML
> format. I'm still having problems, so I haven't written a SUMMARY
post...
...
> 2) Create a jTidy like program that will turn HTML into XHTML, but
work
> for fragments (jTidy seems to only output complete HTML documents)

It won't be hard to run simple xpath to get rid of html/body tags.

Don't know is it easy or not to extend tidy...


Regards,
Vadim



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


RE: Arrgh, more XML/HTML problems now it's '&'

Posted by Conal Tuohy <co...@paradise.net.nz>.
Justin wrote:

> 2) Create a jTidy like program that will turn HTML into
> XHTML, but work
> for fragments (jTidy seems to only output complete HTML documents)

I think this is a good approach. I'm trying to deal with the same thing when
reading emails: an email may contain a section formatted in HTML, so it's
necessary to either treat it as CDATA or to parse it specially. You're
right - if you treat it as CDATA then you can't do anything useful with it
in XSLT. But using JTiday is not hard. See the classes HTMLGenerator,
AbstractStreamSource and XMLUtils.

But where to insert JTidy into the pipeline?

At the generator stage? How does the text area input data enter the Cocoon
pipeline? RequestGenerator?

Or perhaps you should write a transformer that can transform a CDATA node
containing HTML into an XHTML SAX stream?

Con


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Arrgh, more XML/HTML problems now it's '&'

Posted by Sheraz Khan <Sh...@Valorious.com>.
"&amp".... wow.. no wonder my url's never worked, i just used the "&", 
then i gave up..
thanks for the new info.. it will really help...

On Saturday, August 10, 2002, at 11:03 PM, Justin Fagnani-Bell wrote:

> Hi again,
>
>   <warning> this is a long post </warning>
>
>   I'm still working on HTML forms where the user (me for the moment:) 
> is supposed to input HTML into a text area that will be stored in an 
> XML format. I'm still having problems, so I haven't written a SUMMARY 
> post...
>
> My new problem occurred last night when I'm testing the system and I 
> put in an anchor tag with a url that has request parameters... like 
> this:
>
> <a href="http://www.something.net/apage.jsp?p1=hi&p2=bye">link</a>
>
> Well, when I hit submit the form is supposed to come back filled out, 
> but instead I get an error that states "the entity 'p2' must end with 
> a ';'.
>
> So I do some searching on on w3.org and sure enough URLs in XHTML have 
> to use '&amp;' instead of '&'. Arrgh, I know this will cause problems 
> once people who are used to normal HTML start using this. I'm 
> considering writing a filter that will escape illegal characters on the 
> way in, and un-escape them going back to the user, but that seems like 
> a bit of a pain and combined with the problems I'm having making people 
> type XML compliant HTML in the first place I'm wondering if there's a 
> completely different way I could do this.
>
> I'm sure someone else out there has come across these problems before. 
> It seems inevitable when building a webapp where users can edit some 
> content, that uses XML on the backend. The users only marginally know 
> HTML in the first place and can't be expected to always follow the 
> rules correctly every time. The app after all, is supposed to be easy 
> to use.
>
> I would love to start some discussion on different ideas for handling 
> these types of problems. They must be common among Cocoon users, and 
> maybe we can come up with a set of solutions (HOW-TO's, Java helper 
> classes, taglibs) to make life easier on Cocoon developers and 
> end-users.
>
> Here's my little list of requirements, issues, and assumptions when 
> dealing with forms, user input, and xml.
>
> 1) My users are used to HTML, not XML
> 2) My users are not fail proof, and are probably prone to occasional 
> mistakes
> 3) Ideally I want them to be able to input HTML(non XML compliant), 
> plain text, or XML (not HTML, but any XML. this is actually preferred, 
> but sometimes users are just entering a news item, or a BBS post, and 
> it seems reasonable to allow them to use HTML for formatting rather 
> than inventing my own xml dialect)
> 4) The data is going to be in an XML document/SAX stream at some point
>   (either stored that way, or stored in a database and turned into xml 
> through a generator)
> 5) sometimes I want to run xsl transformations on the data when it is 
> output.
> 6) when editing the data, I'd like to have it appear exactly as the 
> user typed
> 7) but i'd also like to have the ability to clean it up (as on option)
> 8) The browsers like HTML 4 much better than XHTML, therefore the pages 
> I send them work better if I use the HTMLSerializer
>
> Here are some problems I've encountered so far.
>
> 1) users don't follow XML rules very well (goes along with point 1)
> 2) the HTMLSerializer changes the users data by turning <br/> into 
> <br>, etc
> 3) the XML Serializer changes the users data by turning 
> <textarea></textarea> into <textarea/>, etc
> 4) bad user input will cause SAXExceptions if it's not enclosed in 
> CDATA sections
>
> (oh, to clarify here, I typically have two pages which show the data, 
> one is the 'edit' page with the form, the other is where the data 
> actually shows up, the 'viewing page', the HTMLserializer is no problem 
> on the viewing page, just the editing page)
>
> Some of these points interfere with some solutions. For example, I 
> could wrap the data in a CDATA section to get around XML compliance, 
> but then I wouldn't be able to run XSL transformations on it (correct 
> me if I'm wrong anywhere). Maybe I could check if the data is xml 
> compliant and wrap it only if it isn't.
>
> Here are some ideas for solutions:
>
> 1) Create a new HTMLSerializer that can selectively determine which 
> tags it will convert into HTML and which is will leave alone. This way 
> you could specify that all textarea tags and their contents shouldn't 
> be touched (I would think this would be a reasonable default feature 
> anyway)
> 2) Create a jTidy like program that will turn HTML into XHTML, but work 
> for fragments (jTidy seems to only output complete HTML documents)
> 3) Create a class that can find an XML error, and report it nicely back 
> to the user so they can fix it. (I recall a demo with Cocoon 1.8.x that 
> had something like this...)
>
> Hmm, these three things might do it. the new serializer would work for 
> editing, the Tidy-like class work work for either storing the data as 
> xml, or just viewing it as xml. I think I have an idea on how to do the 
> serializer, but it wouldn't rely on a transformer like the current one. 
> I looked at the code for jTidy and there's a ton of classes, so I've 
> yet to fully comprehend how it works, it might already be able to do 
> what i want, and like I said I saw something similar to 3) a year or so 
> ago...
>
> ok, that's my thoughts...
>
> Justin
>
>
>
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
> To unsubscribe, e-mail:     <co...@xml.apache.org>
> For additional commands, e-mail:   <co...@xml.apache.org>
>
>


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>