You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Jan Uhlir <es...@centrum.cz> on 2006/04/27 16:05:53 UTC

Convert HTML to XHTML with namespace prefix using Neko + Xerces

Alias: How to force a default namespace to use prefix

Sorry if I missed something important, I'm quite new to namespace problematics.
But I'm deadlocked at the last point to solve of the whole transformation process.

Everything works nice, except that XHTML namespace is set as default namespace, 
so no prefixes, preferably 'html' prefix, is not included in element names when 
serialized back to string.

I'm getting:
<html xmlns="http://www.w3.org/1999/xhtml">
<body> some <b> bold </b> text </body>
</html>

But I need:
<html xmlns:html="http://www.w3.org/1999/xhtml">
<html:body> some <html:b> bold </html:b> text </html:body>
</html>

Why? Because in reality I pick peaces of html - often corrupt! - from database 
transforming them to valid xhtml and finally assemble them into another, bigger 
XML, with multiple namespaces.  Indeed, I build RSS/Atom feed.

So my question is like:
how to force a default namespace to use prefix. 
Is this relevant to parser or serializer (transformer)?
how to pick a prefix name for namespace. Preferably 'html'.

Here is my code:

// set up Neko parser, set html tag fixing routines and namespaces on
org.cyberneko.html.parsers.DOMParser parser = new DOMParser();

parser.setFeature(
   "http://cyberneko.org/html/features/balance-tags", true);
parser.setProperty(
   "http://cyberneko.org/html/properties/names/elems", "lower");
parser.setFeature(
   "http://cyberneko.org/html/features/override-namespaces", 
   true);
parser.setFeature(
   "http://cyberneko.org/html/features/insert-namespaces",
    true);
parser.setProperty(
   "http://cyberneko.org/html/properties/namespaces-uri",
   "http://www.w3.org/1999/xhtml");
            
// parse html fragment, fix it and return full and valid XML document
parser.parse(
   new InputSource(
   new StringReader(htmlFragment)));
return  parser.getDocument();

// ..OK, let's serialize it back to string!

// prepare serializer
StringWriter sw = new StringWriter();
Transformer t = TransformerFactory.newInstance()
  .newTransformer();
t.setOutputProperty(OutputKeys.METHOD, "xml");
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");

// Serialize DOM tree
t.transform(new DOMSource(node),new StreamResult(sw));
String outputXHTML = sw.toString();

P.S.
NekoHTML parser is a real treasure! Helping much with closing html 
tags, misballanced tags etc. Thanks Andy.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Convert HTML to XHTML with namespace prefix using Neko + Xerces

Posted by Joseph Kesselman <ke...@us.ibm.com>.

>But then I have to pick the content of body tag, already serialized, using
> substring operation,

Don't use string operations to manipulate XML. Use XML APIs. They're
namespace-aware and will Do The Right Things.

______________________________________
"... Three things are most perilous: Connectors that corrode,
  Unproven algorithms, and self-modifying code! ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
(http://www.ovff.org/pegasus/songs/threes-rev-11.html)


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Convert HTML to XHTML with namespace prefix using Neko + Xerces

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

"Jan Uhlir" <es...@centrum.cz> wrote on 04/27/2006 11:33:21 AM:

<snip/>

> Is it always necessary for default namespaces to have prefix=null?
> Is it kind of a w3c rule?

Yes. Here's the definition [1][2][3].

> Regards,
> Jan
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

[1] http://www.w3.org/TR/xml-names11/#dt-defaultNS
[2] http://www.w3.org/TR/xml-names11/#defaulting
[3] http://www.w3.org/TR/REC-xml-names/#defaulting

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Convert HTML to XHTML with namespace prefix using Neko + Xerces

Posted by Jan Uhlir <es...@centrum.cz>.

Thanks for prompt reply!


> > Everything works nice, except that XHTML namespace is set as default
> > namespace, so no prefixes, preferably 'html' prefix, is not included in
> > element names when > serialized back to string.
> > I'm getting:
> > 
> >  some <b> bold </b> text 
> > 
> > But I need:
> > 
> > <html:body> some <html:b> bold </html:b> text </html:body>
> > 
> > [...]

> 
> I don't understand your problem. Both versions are equivalent. 

Right, the are equivalent.

But then I have to pick the content of body tag, already serialized, using substring 
operation, and insert it to generated JSP (RSS/Atom feed). And because of the 
substring, the namespace declaration gone, so I need the prefix to be present for 
each element.

Note: xhtml namespace is declared at the top of the JSP, so the fragment fit in well. 

> Even if you
> don't compose your final RSS/Atom document using DOM or similar, 

it's a JSP and Struts application, small part of bigger structure.

> but
> pasting the generated XML string into another one (which you shouldn't do
> IMHO), 

right, should not! but it would involve large redesign of  existing parts.
I event cannot imagine how. I have to stick with string field for a while.


> ..
> In the first version, the default namespace
> is set to the XHTML namespace. It therefore also applies to the  and
> <b> elements.

Yes, I suppose it's applied right. But as a default namespace, ie. 
prefix=null. That what I don't want.

Is it always necessary for default namespaces to have prefix=null?
Is it kind of a w3c rule?

Regards,
Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Convert HTML to XHTML with namespace prefix using Neko + Xerces

Posted by Klaus Malorny <Kl...@knipp.de>.

Jan Uhlir wrote:
> Alias: How to force a default namespace to use prefix
> 
> Sorry if I missed something important, I'm quite new to namespace problematics.
> But I'm deadlocked at the last point to solve of the whole transformation process.
> 
> Everything works nice, except that XHTML namespace is set as default namespace, 
> so no prefixes, preferably 'html' prefix, is not included in element names when 
> serialized back to string.
> 
> I'm getting:
> <html xmlns="http://www.w3.org/1999/xhtml">
> <body> some <b> bold </b> text </body>
> </html>
> 
> But I need:
> <html xmlns:html="http://www.w3.org/1999/xhtml">
> <html:body> some <html:b> bold </html:b> text </html:body>
> </html>
> 
> Why? Because in reality I pick peaces of html - often corrupt! - from database 
> transforming them to valid xhtml and finally assemble them into another, bigger 
> XML, with multiple namespaces.  Indeed, I build RSS/Atom feed.
> [...]

Hi,

I don't understand your problem. Both versions are equivalent. Even if you don't 
compose your final RSS/Atom document using DOM or similar, but pasting the 
generated XML string into another one (which you shouldn't do IMHO), it should 
be correct. In the first version, the default namespace is set to the XHTML 
namespace. It therefore also applies to the <body> and <b> elements.

Regards,

Klaus

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Convert HTML to XHTML with namespace prefix using Neko + Xerces

Posted by Jan Uhlir <es...@centrum.cz>.

> return  parser.getDocument();
Ooops, 
node = parser.getDocument();
..should be here.
Sorry, in reality the code is in two methods.

I also tried various document modifications, like:

root.setAttribute("xmlns:html", "http://www.w3.org/1999/xhtml");
..
root.setAttributeNS("http://www.w3.org/2000/xmlns/","xmlns:html", 
  "http://www.w3.org/1999/xhtml");
..
body.setPrefix("html"); // traverse to all elements

..in many variants but I always end up with  NAMESPACE_ERR exception 
or the same, un-prefixed output :(


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org