You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by "Jonathan M. Hollin" <ne...@digital-word.com> on 2002/08/23 17:54:10 UTC

[OT] HTML to XHTML conversion

[OFF TOPIC]

I am trying to find a module that can convert HTML to XHTML, but have 
drawn a blank on CPAN and GOOGLE.  Is there anything out there to do 
this other than HTML TIDY?

I am developing a mod_perl CMS application at the moment.  All its 
output is compliant with XHTML Transitional.  But its users can create 
content that isn't (and are likely to) and I'd like to parse this and 
convert it XHTML before it goes into the RDBMS if possible.

If nothing exists along these lines - would anyone like to collaborate 
on the development of a module for this purpose?  HTML::XHTML anyone?


-- 
Jonathan M. Hollin

Co-ordinator:  WYPUG (http://wypug.pm.org/)


Re: [CGI] [OT] HTML to XHTML conversion

Posted by Roy Schroeder <ra...@imap4.com>.
Complete automatic conversion is not possible since someone could enter
HTML code that omits or contains certain attributes are either required
or not allowed in XHTML Transitional and the conversion program would
1. not know what value to add for required but omitted attributes, or
2. removing the "not allowed" attributes will seriously change the
rendering of the page.

Of course the simple mechanical rules -
1. all tag names in lower case
2. all tags closed
3. all attribute values quoted
4. proper tag nesting
etc.
could be automated and may be sufficient for your purposes.

Regards
Roy


----- Original Message -----
From: "Jonathan M. Hollin" <ne...@digital-word.com>
To: <mo...@perl.apache.org>
Cc: "CGI List" <cg...@jann.com>
Sent: Friday, August 23, 2002 8:54 AM
Subject: [CGI] [OT] HTML to XHTML conversion


> [NOTICE: see the message footer for important information]
> [OFF TOPIC]
>
> I am trying to find a module that can convert HTML to XHTML, but have
> drawn a blank on CPAN and GOOGLE.  Is there anything out there to do
> this other than HTML TIDY?
>
> I am developing a mod_perl CMS application at the moment.  All its
> output is compliant with XHTML Transitional.  But its users can create
> content that isn't (and are likely to) and I'd like to parse this and
> convert it XHTML before it goes into the RDBMS if possible.
>
> If nothing exists along these lines - would anyone like to collaborate
> on the development of a module for this purpose?  HTML::XHTML anyone?
>
>
> --
> Jonathan M. Hollin
>
> Co-ordinator:  WYPUG (http://wypug.pm.org/)
>
> --
> To unusbcribe, send an email contining the words: 'unsubscribe
cgi-list' to the following email address: majordomo@jann.com
>
> Archives of the following mailing lists are available at:
http://www.perl.jann.com/
> the CGI Mailing List
> the mod_perl mailing list
> the embperl mailing list
> Searching, browsing and posting are available at
http://www.perl.jann.com/
>


Re: [OT] HTML to XHTML conversion

Posted by Ilya Martynov <il...@martynov.org>.
>>>>> On Fri, 23 Aug 2002 16:54:10 +0100, "Jonathan M. Hollin" <ne...@digital-word.com> said:

JMH> [OFF TOPIC]
JMH> I am trying to find a module that can convert HTML to XHTML, but have
JMH> drawn a blank on CPAN and GOOGLE.  Is there anything out there to do
JMH> this other than HTML TIDY?

You can try to use XML::LibXML to parse HTML and re-output
it using XML::SAX::Writer.

Something like:

use XML::LibXML;
use XML::SAX::Writer;
use XML::LibXML::SAX::Parser;

my $html = '<html><head></head><body></body></html>';

my $writer = XML::SAX::Writer->new(Output => \*STDOUT);
my $generator = XML::LibXML::SAX::Parser->new(Handler => $writer);
my $dom = XML::LibXML->new->parse_html_string($html);
$generator->generate($dom);


-- 
Ilya Martynov (http://martynov.org/)

Re: [OT] HTML to XHTML conversion

Posted by Ilya Martynov <il...@martynov.org>.
>>>>> On Wed, 28 Aug 2002 10:07:07 +0100, Jean-Michel Hiver <jh...@mkdoc.com> said:

JM> <body><body></body></body> is not valid XHTML for example.
JM> <input type="text" name="foo"></input> is not valid XHTML either.
JM> You have to be careful about block-level and inline elements.

Actually <input type="text" name="foo"></input> is valid XHTML.

Correct me if I'm wrong but AFAIK <xxx></xxx> is exactly equivalent to
<xxx/>.

<input type="text" name="foo">something</input> is not valid.

JM> etc. etc...

JM> Besides, you cannot use an XML parser to parse HTML. You have to use
JM> something like HTML::TreeBuilder instead. Part of HTML::Tree, excellent
JM> module IMHO.

XML::LibXML supports HTML too.

-- 
Ilya Martynov (http://martynov.org/)

Re: [OT] HTML to XHTML conversion

Posted by Jean-Michel Hiver <jh...@mkdoc.com>.
On Fri 23-Aug-2002 at 11:07:35AM -0500, D. Hageman wrote:
> 
> My suggestion would to just use a XML parser module like XML::LibXML.  
> Load the file up using the HTML loading functions and print it using the
> XML printing functions ... since the only difference I can see between 
> HTML and XHMTL is that optional ending tags are no longer optional (per 
> XML spec) and single tags must be ended properly (per XML spec).

There's a lot more than that.

<body><body></body></body> is not valid XHTML for example.
<input type="text" name="foo"></input> is not valid XHTML either.
You have to be careful about block-level and inline elements.

etc. etc...

Besides, you cannot use an XML parser to parse HTML. You have to use
something like HTML::TreeBuilder instead. Part of HTML::Tree, excellent
module IMHO.

Cheers,
-- 
IT'S TIME FOR A DIFFERENT KIND OF WEB
================================================================
  Jean-Michel Hiver - Software Director
  jhiver@mkdoc.com
  +44 (0)114 255 8097
================================================================
                                      VISIT HTTP://WWW.MKDOC.COM

Re: [OT] HTML to XHTML conversion

Posted by Adrian Howard <ad...@quietstars.com>.
On Friday, August 23, 2002, at 04:54  pm, Jonathan M. Hollin wrote:

> [OFF TOPIC]
>
> I am trying to find a module that can convert HTML to XHTML, but have 
> drawn a blank on CPAN and GOOGLE.  Is there anything out there to do 
> this other than HTML TIDY?
[snip]
> If nothing exists along these lines - would anyone like to collaborate 
> on the development of a module for this purpose?  HTML::XHTML anyone?
>
Out of curiosity... why not tidy? It seems to do a pretty darn good job 
of it - I use it all of the time.

Adrian


Re: [OT] HTML to XHTML conversion

Posted by "D. Hageman" <dh...@dracken.com>.
My suggestion would to just use a XML parser module like XML::LibXML.  
Load the file up using the HTML loading functions and print it using the
XML printing functions ... since the only difference I can see between 
HTML and XHMTL is that optional ending tags are no longer optional (per 
XML spec) and single tags must be ended properly (per XML spec).



On Fri, 23 Aug 2002, Jonathan M. Hollin wrote:

> [OFF TOPIC]
> 
> I am trying to find a module that can convert HTML to XHTML, but have 
> drawn a blank on CPAN and GOOGLE.  Is there anything out there to do 
> this other than HTML TIDY?
> 
> I am developing a mod_perl CMS application at the moment.  All its 
> output is compliant with XHTML Transitional.  But its users can create 
> content that isn't (and are likely to) and I'd like to parse this and 
> convert it XHTML before it goes into the RDBMS if possible.
> 
> If nothing exists along these lines - would anyone like to collaborate 
> on the development of a module for this purpose?  HTML::XHTML anyone?
> 
> 
> 

-- 
//========================================================\\
||  D. Hageman                    <dh...@dracken.com>  ||
\\========================================================//


RE: [OT] HTML to XHTML conversion

Posted by Jesse Erlbaum <je...@erlbaum.net>.
Hi Jonathan --

> I am trying to find a module that can convert HTML to XHTML, but have
> drawn a blank on CPAN and GOOGLE.  Is there anything out there to do
> this other than HTML TIDY?

The big problem you're likely to face is that non-XHTML HTML is
inherently... well... NON-XHTML!  For instance, the following will work as a
complete document on most web browsers:

----START---->
<h1>Hello World!</h1>
<----END----

To turn this into XHTML you have to make some assumptions -- such as <html>,
<head>, and <body> tags.

My example is a simple one.  Depending on how broken the HTML is your module
might end up not being very reusable at all.  It might start looking quite
specialized for your particular set of documents and problems as you add
more and more heuristic rules to accommodate non-XML, non-XHTML compliant
content.

Warmest regards,

-Jesse-


--

  Jesse Erlbaum
  The Erlbaum Group
  jesse@erlbaum.net
  Phone: 212-684-6161
  Fax: 212-684-6226