You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by PSA <po...@posom.com> on 2000/04/06 00:06:48 UTC

Converting HTML to XML/XSL

We are converting a large site to XML using Xalan/Xerces/Cocoon and
would like to convert existing content.  We have a large number of
simple documents in HTML 3/4 to be converted to XML/XSL so that they can
be handled via XML/XSL tools.

It seems like there must be others who have encountered this problem. 
Does anyone have any suggestions for products or methods to automate
this process?

Thank,
Paul Anguiano

Re: XSL Output

Posted by Matthew Cordes <mc...@maine.edu>.
Hello all, 2 questions...


1.	In an xsl document, how do I include other stylesheets.  I know what
	you're thinking - <xsl:include href="some_file"/>, but I'm confused 
	how exactly to do this. Should I have 2 separate well-formed 
	stylesheets? ( I assume yes ).  When i say <xsl:include...> does that 
	include the stylesheet's templates or can i use it to dump text directly
	into place.  I want to do something like apache's '#include'.

2.	I'm using xalan 1.0.0 and xerces 1.0.3 and am experiencing some strange
	behavior.  Here is a code fragment:

		<a name="{ALIAS}"></a>
		<b><xsl:value-of select="LASTNAME"/>
		<xsl:text>, </xsl:text>
		<xsl:value-of select="FIRSTNAME"/></b>	


	What I expect this to display is "<LASTNAME>, <FIRSTNAME>" , but for some 
	aggravating reason it always displays "<LASTNAME> <FIRSTNAME>,".  
	Notice that the comma is at the end !?!  Can anyone explain this 
	behavior? Or should I report it as a bug?


	Just in case I'm to blame here is the whole template:

<xsl:template match="PERSON">
    <tr>
        <td>
            <p>
                <a name="{ALIAS}"></a>
                <b><xsl:value-of select="LASTNAME"/>
                <xsl:text>, </xsl:text>
                <xsl:value-of select="FIRSTNAME"/></b>
                <br/>
                <xsl:apply-templates select="TITLE"/>
                <xsl:value-of select="BIOGRAPHY"/>
           </p>
           <table border="0" cellspacing="0" cellpadding="0">
                <xsl:apply-templates select="PHONE"/>
                <xsl:apply-templates select="EMAIL"/>
                <xsl:apply-templates select="WWW"/>
                <tr>
                    <td><br/> <!-- blank row -->
                    </td>
                </tr>
           </table>
        </td>
    </tr>
</xsl:template>

Thanks for reading this

-matt

	

Re: XSL Output

Posted by Mikael St�ldal <d9...@d.kth.se>.
In article <38...@bluecheese.co.uk>,
Darren Scott <ds...@bluecheese.co.uk> wrote:

>So - to expand, I would like to output validating XHTML rather than
>HTML. The issue is that a lot of browsers cannot cope with strict XHTML
>because they don't understand the trailing forward slashes on standalone
>tags (like <br/>) - they treat the slash as though it were part of the
>tagname.
>
>W3C (I'm sure everybody knows this) recommends putting a space character
>between the tag name and the slash to overcome this problem. The problem
>here is that there doesn't seem to be any way of outputting this type of
>backwards compatible tag with Cocoon.
>
>As I mentioned above, the HTML formatter will convert <br /> to <br> and
>the XML formatter will convert <br /> to <br/>

You need an XHTML formatter. For some strange reason, an XHTML
formatter is not included in Cocoon. However, it should be easy to
write your own XHTML formatter using the Xerces serializing classes
(in package org.apache.xml.serialize) which do support XHTML. Just take
the HTML formatter in Cocoon and modify it.

-- 
/****************************************************************\
* You have just read a message from Mikael Ståldal.              *
*                                                                *
* Remove "-ingen-reklam" from the address before mail replying.  *
\****************************************************************/

Re: XSL Output

Posted by Stefano Mazzocchi <st...@apache.org>.
Darren Scott wrote:

> As I mentioned above, the HTML formatter will convert <br /> to <br> and
> the XML formatter will convert <br /> to <br/>
> 
> I am being stupid?

No, now I get it :) (sorry Tom, but could not understand your problem in
your previous mail)
 
> Can I write a formatting class that will solve the problem? If so, I
> would gladly do that and contribute it to the collective... It'd be nice
> if you could specify an XSLT output method of '(X)HTML' or something
> like that(?)

I'll write an XHTML formatter tonight and see if this helps. Is it ok
with you guys?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------



Re: XSL Output

Posted by Darren Scott <ds...@bluecheese.co.uk>.
Doh, forgot to attach - here it is....

Darren Scott wrote:
> 
> Donald Ball wrote:
> > > Can I write a formatting class that will solve the problem? If so, I
> > > would gladly do that and contribute it to the collective... It'd be nice
> > > if you could specify an XSLT output method of '(X)HTML' or something
> > > like that(?)
> >
> > What's the matter with just using the XML formatter? The fact that you
> > want <br /> instead of <br/>? I reckon you could patch the Xerces XML
> > formatter to do that instead.
> 
> Correctomundo - the fact that I don't want the majority existing
> browsers to break when I try to feed XML to them!
> 
> Ok, I downloaded the source last night and had a poke around. It seems
> the Xerces people have already thought of this, and have built in an
> XHTML output method, so all we need is an XHTMLFormatter that will hook
> into it.
> 
> So that's what I did - it was v. simple case of making a couple of
> adjustments to the existing HTMLFormatter class. In my setup, I added a
> formatting type of text/xhtml to cocoon.properties - this isn't strictly
> correct, because XHTML should have a mime-type of text/html - this
> causes problems when you specify the type in an XSL PI instead of in the
> XML because Cocoon seems to send 'text/xhtml' instead of querying the
> formatter for the mime-type which is set to 'text/html'.
> 
> I have attached it (along with the classfile for other users
> convenience) - perhaps somebody could add it to the CVS tree? Stefano,
> any chance of it being added to the project and included in the new
> release next week? I am convinced that Cocoon needs to be able to output
> backwards compatible XHTML until XML-unaware browsers die out.
> 
> I hope somebody finds it useful enough to include it in the project
> since it's my first contribution. Not much I know but if it's received
> well I might get CVS access and become a regular(ish) developer.
> 
> Regards,
> 
> Darren Scott
> Production Director
> bluecheese.co.uk
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-users-unsubscribe@xml.apache.org
> For additional commands, e-mail: cocoon-users-help@xml.apache.org

Re: XSL Output

Posted by Stefano Mazzocchi <st...@apache.org>.
Darren Scott wrote:
> 
> Donald Ball wrote:
> > > Can I write a formatting class that will solve the problem? If so, I
> > > would gladly do that and contribute it to the collective... It'd be nice
> > > if you could specify an XSLT output method of '(X)HTML' or something
> > > like that(?)
> >
> > What's the matter with just using the XML formatter? The fact that you
> > want <br /> instead of <br/>? I reckon you could patch the Xerces XML
> > formatter to do that instead.
> 
> Correctomundo - the fact that I don't want the majority existing
> browsers to break when I try to feed XML to them!
> 
> Ok, I downloaded the source last night and had a poke around. It seems
> the Xerces people have already thought of this, and have built in an
> XHTML output method, so all we need is an XHTMLFormatter that will hook
> into it.
> 
> So that's what I did - it was v. simple case of making a couple of
> adjustments to the existing HTMLFormatter class. In my setup, I added a
> formatting type of text/xhtml to cocoon.properties - this isn't strictly
> correct, because XHTML should have a mime-type of text/html - this
> causes problems when you specify the type in an XSL PI instead of in the
> XML because Cocoon seems to send 'text/xhtml' instead of querying the
> formatter for the mime-type which is set to 'text/html'.
> 
> I have attached it (along with the classfile for other users
> convenience) - perhaps somebody could add it to the CVS tree? Stefano,
> any chance of it being added to the project and included in the new
> release next week? 

Sure :)

> I am convinced that Cocoon needs to be able to output
> backwards compatible XHTML until XML-unaware browsers die out.

+1
 
> I hope somebody finds it useful enough to include it in the project
> since it's my first contribution. Not much I know but if it's received
> well I might get CVS access and become a regular(ish) developer.

If your contributions come regularly and are good as this one, you'll
sure get the CVS access, willing or not ;-)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------



Re: XSL Output

Posted by Donald Ball <ba...@webslingerZ.com>.
On Thu, 20 Apr 2000, Darren Scott wrote:

> Donald Ball wrote:
> > > Can I write a formatting class that will solve the problem? If so, I
> > > would gladly do that and contribute it to the collective... It'd be nice
> > > if you could specify an XSLT output method of '(X)HTML' or something
> > > like that(?)
> > 
> > What's the matter with just using the XML formatter? The fact that you
> > want <br /> instead of <br/>? I reckon you could patch the Xerces XML
> > formatter to do that instead.
> 
> Correctomundo - the fact that I don't want the majority existing
> browsers to break when I try to feed XML to them!

So use the HTML Formatter. I still fail to see why you wouldn't want to do
that. Am I missing something? Simply spitting out XHTML with spaces before
the trailing slash isn't enough to ensure older browsers will work
properly. One of the 3.x browsers ignores <br />, if I recall aright, and
I'm also pretty sure that simulating non-valued attributes with constructs
like <select multiple=""> doesn't work as expected on one of the 3.x
browsers.

- donald


Re: XSL Output

Posted by Darren Scott <ds...@bluecheese.co.uk>.
Donald Ball wrote:
> > Can I write a formatting class that will solve the problem? If so, I
> > would gladly do that and contribute it to the collective... It'd be nice
> > if you could specify an XSLT output method of '(X)HTML' or something
> > like that(?)
> 
> What's the matter with just using the XML formatter? The fact that you
> want <br /> instead of <br/>? I reckon you could patch the Xerces XML
> formatter to do that instead.

Correctomundo - the fact that I don't want the majority existing
browsers to break when I try to feed XML to them!

Ok, I downloaded the source last night and had a poke around. It seems
the Xerces people have already thought of this, and have built in an
XHTML output method, so all we need is an XHTMLFormatter that will hook
into it.

So that's what I did - it was v. simple case of making a couple of
adjustments to the existing HTMLFormatter class. In my setup, I added a
formatting type of text/xhtml to cocoon.properties - this isn't strictly
correct, because XHTML should have a mime-type of text/html - this
causes problems when you specify the type in an XSL PI instead of in the
XML because Cocoon seems to send 'text/xhtml' instead of querying the
formatter for the mime-type which is set to 'text/html'.

I have attached it (along with the classfile for other users
convenience) - perhaps somebody could add it to the CVS tree? Stefano,
any chance of it being added to the project and included in the new
release next week? I am convinced that Cocoon needs to be able to output
backwards compatible XHTML until XML-unaware browsers die out.

I hope somebody finds it useful enough to include it in the project
since it's my first contribution. Not much I know but if it's received
well I might get CVS access and become a regular(ish) developer.

Regards,

Darren Scott
Production Director
bluecheese.co.uk

Re: XSL Output

Posted by Donald Ball <ba...@webslingerZ.com>.
On Wed, 19 Apr 2000, Darren Scott wrote:

> I posted this message a week or so ago and didn't get a response, so I
> thought I'd repost in case I didn't make any sense because I think it's
> an issue that everybody should be concerned about.
> 
> It touches on some recent discussion, such as Tom Stuart's posting about
> XSLT formatting.
> 
> Here's the message:
> 
> > My early experiences with XSL (or at least Xalan) show that it is not
> > easy to output HTML 4.01 compliant XHTML (ie with spaces before the
> > trailing slashes on standalone tags).
> > 
> > Setting the output to HTML removes all such trailing slashes, and
> > setting output to XML remove all such trailing spaces!
> 
> So - to expand, I would like to output validating XHTML rather than
> HTML. The issue is that a lot of browsers cannot cope with strict XHTML
> because they don't understand the trailing forward slashes on standalone
> tags (like <br/>) - they treat the slash as though it were part of the
> tagname.
> 
> W3C (I'm sure everybody knows this) recommends putting a space character
> between the tag name and the slash to overcome this problem. The problem
> here is that there doesn't seem to be any way of outputting this type of
> backwards compatible tag with Cocoon.
> 
> As I mentioned above, the HTML formatter will convert <br /> to <br> and
> the XML formatter will convert <br /> to <br/>
> 
> I am being stupid?
> 
> Can I write a formatting class that will solve the problem? If so, I
> would gladly do that and contribute it to the collective... It'd be nice
> if you could specify an XSLT output method of '(X)HTML' or something
> like that(?)

What's the matter with just using the XML formatter? The fact that you
want <br /> instead of <br/>? I reckon you could patch the Xerces XML
formatter to do that instead.

- donald


Re: XSL Output

Posted by tom stuart <to...@obsess.com>.
On Wed, 19 Apr 2000, Darren Scott wrote:

> It touches on some recent discussion, such as Tom Stuart's posting
> about XSLT formatting.
>
> My early experiences with XSL (or at least Xalan) show that it is not
> easy to output HTML 4.01 compliant XHTML (ie with spaces before the
> trailing slashes on standalone tags).
> 
> Setting the output to HTML removes all such trailing slashes, and
> setting output to XML remove all such trailing spaces!
> 
> So - to expand, I would like to output validating XHTML rather than
> HTML.

This is indeed essentially the same as what I was asking about. The
frustrating thing is that the (unformatted) Xalan output is almost exactly
what I want! It's just the (mandatory) Cocoon formatters that start
messing with it - and, as a petty aside, produce *really* ugly HTML.

Judging by Stefano's reply, though, I really am just missing something. I
guess that you really are getting XML out of Xalan, so it's only right
that it gets processed into something presentational before getting
output.

An XHTML processor would be nice, if you're up to the task. Oh, and can
you persuade it to leave the textual formatting alone as far as is
possible? org.apache.cocoon.formatter.HTMLFormatter is driving me up the
wall.

> I am being stupid?

I probably am too.

-Tom


XSL Output

Posted by Darren Scott <ds...@bluecheese.co.uk>.
Hi,

I posted this message a week or so ago and didn't get a response, so I
thought I'd repost in case I didn't make any sense because I think it's
an issue that everybody should be concerned about.

It touches on some recent discussion, such as Tom Stuart's posting about
XSLT formatting.

Here's the message:

> My early experiences with XSL (or at least Xalan) show that it is not
> easy to output HTML 4.01 compliant XHTML (ie with spaces before the
> trailing slashes on standalone tags).
> 
> Setting the output to HTML removes all such trailing slashes, and
> setting output to XML remove all such trailing spaces!

So - to expand, I would like to output validating XHTML rather than
HTML. The issue is that a lot of browsers cannot cope with strict XHTML
because they don't understand the trailing forward slashes on standalone
tags (like <br/>) - they treat the slash as though it were part of the
tagname.

W3C (I'm sure everybody knows this) recommends putting a space character
between the tag name and the slash to overcome this problem. The problem
here is that there doesn't seem to be any way of outputting this type of
backwards compatible tag with Cocoon.

As I mentioned above, the HTML formatter will convert <br /> to <br> and
the XML formatter will convert <br /> to <br/>

I am being stupid?

Can I write a formatting class that will solve the problem? If so, I
would gladly do that and contribute it to the collective... It'd be nice
if you could specify an XSLT output method of '(X)HTML' or something
like that(?)

Respect,

Darren Scott
Production Director
bluecheese.co.uk

Re: Converting HTML to XML/XSL

Posted by Darren Scott <ds...@bluecheese.co.uk>.
"K.C. Jones" wrote:
> 
> PSA wrote:
> > We have a large number of simple documents in HTML 3/4 to
> > be converted to XML/XSL so that they can be handled via
> > XML/XSL tools.
> 
> The W3C's HTML Tidy includes a tool which will attempt to
> convert HTML to XHTML.  You can download it free at:
> http://www.chami.com/html-kit/

Just slightly off-topic...

My early experiences with XSL (or at least Xalan) show that it is not
easy to output HTML 4.01 compliant XHTML (ie with spaces before the
trailing slashes on standalone tags).

Setting the output to HTML removes all such trailing slashes, and
setting output to XML remove all such trailing spaces!

Is there an output mode that can produce backwards-compatible XHTML?

Darren Scott
Production Director
bluecheese.co.uk

Re: Converting HTML to XML/XSL

Posted by Daniel Barclay <Da...@digitalfocus.com>.
Ed Knutson wrote:
> 
...
> 
> If all you need to do is translate from XHTML to HTML, that would not be a
> problem; you just need to use Cocoon's HTML formatter, which will basically
> reverse-tidy it. ...

That sounds interesting.  Where is that?

(I assume that uses the HTML parser I heard that Cocoon had somewhere.
Is that part of Xerces or somewhere else?)


Daniel
-- 
Daniel Barclay
Digital Focus
Daniel.Barclay@digitalfocus.com

Re: Converting HTML to XML/XSL

Posted by Ed Knutson <ed...@webcombo.net>.
On Tue, 11 Apr 2000, PSA wrote:
> Ed Knutson wrote:
> > You're still not going to get out of removing all the style related elements
> > and moving them to a style sheet.
> 
> Are there any "generic" HTML 3.2/4 XSL sheets out there which could
> provide the elemental framework for turning this XML back into
> non-stylesheet-based HTML?  We're looking for a foundation which could
> do the "grunt work" of translating the basic elements, and XHTML won't
> do since ~45% of our (rather large) userbase is stuck with netscape 3
> due to hardware limitations (yeah, I know, that's a difficult target
> these days).

Um.  No.

HTML Tidy simply converts HTML to XHTML.  It will save you the trouble of
making sure all the <p> tags are closed and that the <br>s become <br />.  You
still have to remove all the html tags you don't want, rename some of the
others to fit your own XML needs and to group your content into logical
sections, and then decide how you want these sections to be ultimately rendered.

You can not use a generic XSL stylesheet unless you are targeting a specific
XML extended format (such as MathML) for which someone may have written one. 
The nature of XSL makes it very dependent on the specific tags in the source
XML.

If all you need to do is translate from XHTML to HTML, that would not be a
problem; you just need to use Cocoon's HTML formatter, which will basically
reverse-tidy it.  So, if you write a stylesheet that will produce XHTML from
XML, Cocoon will do the rest, if you tell it to.  See the examples that come
with it for more info.  :)

-ed

Re: Converting HTML to XML/XSL

Posted by PSA <po...@posom.com>.
Ed Knutson wrote:
> 
> > But don't get your hopes up too much with HTML Tidy's XHTML
> > or XML conversion tools.  In my quick checks, the
> > HTML->XHTML conversion was significantly incomplete.  It
> > mangles overlapped tags, it fails to treat <script> elements
> > correctly,...  And in my use it's not too stable either.
> 
> I had a developer convert about 50 documents using Tidy.  It has a lot of
> options to configure/needs a lot of tweaking.
> 
> There is a mode of operation where it only fixes overlapped tags.  It's
> sometimes a good idea to use a two or three pass approach.  It doesn't get
> confused so easily when it has less simultaneous tasks.
> 
> You're still not going to get out of removing all the style related elements
> and moving them to a style sheet.

Are there any "generic" HTML 3.2/4 XSL sheets out there which could
provide the elemental framework for turning this XML back into
non-stylesheet-based HTML?  We're looking for a foundation which could
do the "grunt work" of translating the basic elements, and XHTML won't
do since ~45% of our (rather large) userbase is stuck with netscape 3
due to hardware limitations (yeah, I know, that's a difficult target
these days).

-PSA

Re: Converting HTML to XML/XSL

Posted by Ed Knutson <ed...@webcombo.net>.
> But don't get your hopes up too much with HTML Tidy's XHTML
> or XML conversion tools.  In my quick checks, the
> HTML->XHTML conversion was significantly incomplete.  It
> mangles overlapped tags, it fails to treat <script> elements
> correctly,...  And in my use it's not too stable either.

I had a developer convert about 50 documents using Tidy.  It has a lot of
options to configure/needs a lot of tweaking.

There is a mode of operation where it only fixes overlapped tags.  It's
sometimes a good idea to use a two or three pass approach.  It doesn't get
confused so easily when it has less simultaneous tasks.

You're still not going to get out of removing all the style related elements
and moving them to a style sheet.

-ed


Re: Converting HTML to XML/XSL

Posted by "K.C. Jones" <kj...@phoenix-pop.com>.
PSA wrote:
> We have a large number of simple documents in HTML 3/4 to
> be converted to XML/XSL so that they can be handled via
> XML/XSL tools.

The W3C's HTML Tidy includes a tool which will attempt to
convert HTML to XHTML.  You can download it free at:
http://www.chami.com/html-kit/

Re: Converting HTML to XML/XSL

Posted by sunil <su...@gainskeeper.com>.
I too have a similar requirement,
It seems that you already have setup the platform to do this..
I need some help to set up a similar environment for NT using the same tools
and I further want to convert the XML documents to PDF

Would appreciate any help in this regard

Thanks
Sunil