You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xerces.apache.org by Armin Pfarr <ap...@vipsurf.de> on 2000/02/25 14:59:27 UTC

Identity-transformation

Hi,

I'm parsing documents with the Xerces DOMParser, modify some nodes and then
want to write these document back to disk. At the moment, there doesn't seem
to be a working solution for this problem. If you leave out my
DOM-processing, the simple question is, whether there is a standard way to
parse a Document into memory via DOMParser and stream it out again so that
both input and output are identical.

1. Serializing with Xerces 1.0.2's XMLSerializer doesn't work
When trying to serialize the DOM-Document with

DOMParser parser = new DOMParser();
parser.parse(input);
Document d = parser.getDocument();
PrintWriter writer = new PrintWriter(.....);
OutputFormat format = new OutputFormat();
format.setMethod(Method.XML);
format.setOmitXMLDeclaration(false);
format.setPreserveSpace(true);
format.setVersion("1.0");
Serializer serializer =
SerializerFactory.getSerializerFactory(Method.XML).makeSerializer(writer,
format);
serializer.asDOMSerializer().serialize(document);

After serializing, the file does not contain a space between the public- and
the systemidentifier. I don't know if this is the only problem, but the
resulting file doesn't parse and is.not identical to the input.

2. When using Xalan 0.19.5, you run into major entity-problems
My file contains entity-references to the standard XHTML-Entity-sets (e.g.
&auml;) which are declared in a separate file. I don't want to convert these
references to unicode but want to leave them as they are. I tried several
stylesheets with serveral encodings, but wasn't able to produce a propper
output.
Here is a sample XSLT-stylesheet

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" encoding="UTF-8"> <!-- I also tried several other
codes -->
  <xsl:template match="*|@*|comment()|processing-instruction()|text()">
    <xsl:copy>
      <xsl:apply-templates
select="*|@*|comment()|processing-instruction()|text()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

As you can see, I just do a straight copy-over.

Has anybody run into the same problem before or does anybody have an idea
how to solve this without writing a specialized DOM-Serializer?

Armin



Re: Identity-transformation

Posted by Thomas Conradi <co...@prostep.de>.
Hello,

I have had that problem (writing bits of my document back to a file) some time
ago using xml4c.

I solved it with converting the DOMString to simple char*.

This function served my puposes in converting iso latin-1 and utf-8 encoded
sources. I´m not sure if it's a help for anybody else, but feel free to use it.

--------------------------------------------------------
void DOMString2CharP(DOMString in, char* out)
{
    int l=in.length(),j=0;
    XMLCh c;
    for (int i=0; i<l; i++)
    {
        c = in.charAt(i);

        if (<ISO LATIN-1>)   // ISO LATIN-1 instead of US ASCII 7BIT
        {
            switch (c)
            {
            case 196: out[j] = 'A';j++;out[j] = 'e'; //Ä = FFC4 -> C4 = 196
                break;
            case 228: out[j] = 'a';j++;out[j] = 'e'; //ä = FFE4 -> E4 = 228
                break;
            case 214: out[j] = 'O';j++;out[j] = 'e'; //Ö = FFD6 -> D6 = 214
                break;
            case 246: out[j] = 'o';j++;out[j] = 'e'; //ö = FFF6 -> F6 = 246
                break;
            case 220: out[j] = 'U';j++;out[j] = 'e'; //Ü = FFDC -> DC = 220
                break;
            case 252: out[j] = 'u';j++;out[j] = 'e'; //ü = FFFC -> FC = 252
                break;
            case 223: out[j] = 's';j++;out[j] = 's'; //ß = FFDF -> DF = 223
                break;
            default: out[j] = (char) c; //just use the lower byte
            }
        }
        j++;
    }
    out[j]='\0';
}
--------------------------------------------------------

Armin Pfarr wrote:

> Hi,
>
> I'm parsing documents with the Xerces DOMParser, modify some nodes and then
> want to write these document back to disk. At the moment, there doesn't seem
> to be a working solution for this problem. If you leave out my
> DOM-processing, the simple question is, whether there is a standard way to
> parse a Document into memory via DOMParser and stream it out again so that
> both input and output are identical.
>
> 1. Serializing with Xerces 1.0.2's XMLSerializer doesn't work
> When trying to serialize the DOM-Document with
>
> DOMParser parser = new DOMParser();
> parser.parse(input);
> Document d = parser.getDocument();
> PrintWriter writer = new PrintWriter(.....);
> OutputFormat format = new OutputFormat();
> format.setMethod(Method.XML);
> format.setOmitXMLDeclaration(false);
> format.setPreserveSpace(true);
> format.setVersion("1.0");
> Serializer serializer =
> SerializerFactory.getSerializerFactory(Method.XML).makeSerializer(writer,
> format);
> serializer.asDOMSerializer().serialize(document);
>
> After serializing, the file does not contain a space between the public- and
> the systemidentifier. I don't know if this is the only problem, but the
> resulting file doesn't parse and is.not identical to the input.
>
> 2. When using Xalan 0.19.5, you run into major entity-problems
> My file contains entity-references to the standard XHTML-Entity-sets (e.g.
> &auml;) which are declared in a separate file. I don't want to convert these
> references to unicode but want to leave them as they are. I tried several
> stylesheets with serveral encodings, but wasn't able to produce a propper
> output.
> Here is a sample XSLT-stylesheet
>
> <xsl:stylesheet version="1.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
>
>   <xsl:output method="xml" encoding="UTF-8"> <!-- I also tried several other
> codes -->
>   <xsl:template match="*|@*|comment()|processing-instruction()|text()">
>     <xsl:copy>
>       <xsl:apply-templates
> select="*|@*|comment()|processing-instruction()|text()"/>
>     </xsl:copy>
>   </xsl:template>
> </xsl:stylesheet>
>
> As you can see, I just do a straight copy-over.
>
> Has anybody run into the same problem before or does anybody have an idea
> how to solve this without writing a specialized DOM-Serializer?
>
> Armin

--
___________________________________________________________________________

ProSTEP GmbH                        Phone: +49-6151-9287381
Thomas Conradi                      Fax:   +49-6151-9287381
Julius-Reiber Str. 15               Email: conradi@prostep.de
D-64293 Darmstadt
___________________________________________________________________________