You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2019/05/13 18:38:03 UTC

Strange behavior with dfdl:encoding

Hello DFDL community,

My input is a single UTF-8 string. Parsing the input generates the expected XML document, but unparsing the XML results in a totally different string. Below is a graphic showing the input, parsing results, and unparsing results. Under it are the actual hex bytes. Note how the bytes for the input are very different than the bytes for the unparse results. Why such differences between the input and the parse output?  At the bottom is my DFDL schema. /Roger

[cid:image003.png@01D50999.77918750]

<xs:element name="UTF-8">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="string" type="xs:string" dfdl:encoding="utf-8" />
            <xs:element name="length" type="xs:integer"
                                       dfdl:inputValueCalc="{ fn:string-length(../string) }" />
        </xs:sequence>
    </xs:complexType>
</xs:element>





Re: Strange behavior with dfdl:encoding

Posted by Steve Lawrence <sl...@apache.org>.
I believe I've found the issue, and it is related to encoding and Windows.

With the CLI, when "xml" is used as he infoset type (which is the
default), Daffodil does not specify an encoding to use to decode the
XML, so Java defaults to the "file.encoding" system property. On
Brandon's and my machines, this property is probably UTF-8, and so the
right thing happens. But since you're on Windows, the default is
probably "Windows-1252". I can reproduce the behavior with the following:

$ export DAFFODIL_JAVA_OPTS="-Dfile.encoding=Windows-1252"
$ daffodil unparse -s test.dfdl.xsd test-UTF-8.xml | xxd
00000000: 46c3 83c2 b8c3 83c2 b6

So we can see that changing the encoding to Windows-1252 does result in
the extra bytes.

A workaround would be to modify DAFFODIL_JAVA_OPTS to set the java
file.encoding to "UTF-8", similar to above. Or you could use the
"scala-xml" infoset type (e.g. daffodil unparse -I scala-xml ...) which
does correctly look at the XML preamble to determine encoding. You might
also be able to change your default terminal encoding to UTF-8 with
"chcp 65001", but I'm not sure if Java uses that or not.

I've also created DAFFODIL-2128 to track this issue. When using the
"xml" infoset type, we should be inspecting the XML preamble to
determine the encoding.

- Steve

On 5/14/19 8:05 AM, Steve Lawrence wrote:
> I've seen encoding issues similar to this when running on Windows. One
> potential cause is however you're getting the XML into a file (e.g. copy
> paste, redirection in a shell), windows might be messing with the
> encoding and creating XML that isn't encoded as UTF-8, but is something
> else. If the XML is wrong, the unparsed output will be wrong too.
> 
> So in addition to the full schema, it might also be helpful to attach
> the actual XML file that you are unparsing and we can see what the
> encoding of that file is.
> 
> - Steve
> 
> On 5/13/19 5:15 PM, Sloane, Brandon wrote:
>> Roger,
>>
>>
>> I am unable to reproduce this. Can you post a complete schema?
>>
>>
>> Looking at your output, the only thing that jumps out to me is that the problem 
>> is 83 C2 being inserted between each character. My guess is you are setting some 
>> property that changes how strings are encoded, but nothing jumps out at me as 
>> being able to cause this type of encoding behavior.
>>
>>
>> Below is the schema I tried which does not reproduce this problem.
>>
>>
>> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>>             xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
>>             xmlns:tns="urn:a"
>>             xmlns:ex="http://example.com"
>>             xmlns:fn="http://www.w3.org/2005/xpath-functions"
>>             targetNamespace="urn:a" >
>>    <xs:include 
>> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />
>>
>>     <xs:annotation>
>>      <xs:appinfo source="http://www.ogf.org/dfdl/">
>>        <dfdl:format ref="tns:GeneralFormat"/>
>>     </xs:appinfo>
>>    </xs:annotation>
>>
>>
>> <xs:element name="UTF-8">
>>      <xs:complexType>
>>          <xs:sequence>
>>              <xs:element name="string" type="xs:string" dfdl:encoding="utf-8" 
>> dfdl:lengthKind="pattern" dfdl:lengthPattern=".*" />
>>              <xs:element name="length" type="xs:integer"
>>                                         dfdl:inputValueCalc="{ 
>> fn:string-length(../string) }" />
>>          </xs:sequence>
>>      </xs:complexType>
>> </xs:element>
>>
>> </xs:schema>
>>
>>
>> --------------------------------------------------------------------------------
>> *From:* Costello, Roger L. <co...@mitre.org>
>> *Sent:* Monday, May 13, 2019 2:38:03 PM
>> *To:* users@daffodil.apache.org
>> *Subject:* Strange behavior with dfdl:encoding
>>
>> Hello DFDL community,
>>
>> My input is a single UTF-8 string. Parsing the input generates the expected XML 
>> document, but unparsing the XML results in a totally different string. Below is 
>> a graphic showing the input, parsing results, and unparsing results. Under it 
>> are the actual hex bytes. Note how the bytes for the input are very different 
>> than the bytes for the unparse results. Why such differences between the input 
>> and the parse output?  At the bottom is my DFDL schema. /Roger
>>
>> <xs:elementname="UTF-8">
>> <xs:complexType>
>> <xs:sequence>
>> <xs:elementname="string"type="xs:string"dfdl:encoding="utf-8"/>
>> <xs:elementname="length"type="xs:integer"
>>                                         dfdl:inputValueCalc="{ 
>> fn:string-length(../string) }"/>
>> </xs:sequence>
>> </xs:complexType>
>> </xs:element>
>>
> 


Re: Strange behavior with dfdl:encoding

Posted by Steve Lawrence <sl...@apache.org>.
I've seen encoding issues similar to this when running on Windows. One
potential cause is however you're getting the XML into a file (e.g. copy
paste, redirection in a shell), windows might be messing with the
encoding and creating XML that isn't encoded as UTF-8, but is something
else. If the XML is wrong, the unparsed output will be wrong too.

So in addition to the full schema, it might also be helpful to attach
the actual XML file that you are unparsing and we can see what the
encoding of that file is.

- Steve

On 5/13/19 5:15 PM, Sloane, Brandon wrote:
> Roger,
> 
> 
> I am unable to reproduce this. Can you post a complete schema?
> 
> 
> Looking at your output, the only thing that jumps out to me is that the problem 
> is 83 C2 being inserted between each character. My guess is you are setting some 
> property that changes how strings are encoded, but nothing jumps out at me as 
> being able to cause this type of encoding behavior.
> 
> 
> Below is the schema I tried which does not reproduce this problem.
> 
> 
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>             xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
>             xmlns:tns="urn:a"
>             xmlns:ex="http://example.com"
>             xmlns:fn="http://www.w3.org/2005/xpath-functions"
>             targetNamespace="urn:a" >
>    <xs:include 
> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />
> 
>     <xs:annotation>
>      <xs:appinfo source="http://www.ogf.org/dfdl/">
>        <dfdl:format ref="tns:GeneralFormat"/>
>     </xs:appinfo>
>    </xs:annotation>
> 
> 
> <xs:element name="UTF-8">
>      <xs:complexType>
>          <xs:sequence>
>              <xs:element name="string" type="xs:string" dfdl:encoding="utf-8" 
> dfdl:lengthKind="pattern" dfdl:lengthPattern=".*" />
>              <xs:element name="length" type="xs:integer"
>                                         dfdl:inputValueCalc="{ 
> fn:string-length(../string) }" />
>          </xs:sequence>
>      </xs:complexType>
> </xs:element>
> 
> </xs:schema>
> 
> 
> --------------------------------------------------------------------------------
> *From:* Costello, Roger L. <co...@mitre.org>
> *Sent:* Monday, May 13, 2019 2:38:03 PM
> *To:* users@daffodil.apache.org
> *Subject:* Strange behavior with dfdl:encoding
> 
> Hello DFDL community,
> 
> My input is a single UTF-8 string. Parsing the input generates the expected XML 
> document, but unparsing the XML results in a totally different string. Below is 
> a graphic showing the input, parsing results, and unparsing results. Under it 
> are the actual hex bytes. Note how the bytes for the input are very different 
> than the bytes for the unparse results. Why such differences between the input 
> and the parse output?  At the bottom is my DFDL schema. /Roger
> 
> <xs:elementname="UTF-8">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="string"type="xs:string"dfdl:encoding="utf-8"/>
> <xs:elementname="length"type="xs:integer"
>                                         dfdl:inputValueCalc="{ 
> fn:string-length(../string) }"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> 


Re: Strange behavior with dfdl:encoding

Posted by "Sloane, Brandon" <bs...@tresys.com>.
Roger,


I am unable to reproduce this. Can you post a complete schema?


Looking at your output, the only thing that jumps out to me is that the problem is 83 C2 being inserted between each character. My guess is you are setting some property that changes how strings are encoded, but nothing jumps out at me as being able to cause this type of encoding behavior.


Below is the schema I tried which does not reproduce this problem.


<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
           xmlns:tns="urn:a"
           xmlns:ex="http://example.com"
           xmlns:fn="http://www.w3.org/2005/xpath-functions"
           targetNamespace="urn:a" >
  <xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

   <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format ref="tns:GeneralFormat"/>
   </xs:appinfo>
  </xs:annotation>


<xs:element name="UTF-8">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="string" type="xs:string" dfdl:encoding="utf-8" dfdl:lengthKind="pattern" dfdl:lengthPattern=".*" />
            <xs:element name="length" type="xs:integer"
                                       dfdl:inputValueCalc="{ fn:string-length(../string) }" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

</xs:schema>



________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Monday, May 13, 2019 2:38:03 PM
To: users@daffodil.apache.org
Subject: Strange behavior with dfdl:encoding


Hello DFDL community,



My input is a single UTF-8 string. Parsing the input generates the expected XML document, but unparsing the XML results in a totally different string. Below is a graphic showing the input, parsing results, and unparsing results. Under it are the actual hex bytes. Note how the bytes for the input are very different than the bytes for the unparse results. Why such differences between the input and the parse output?  At the bottom is my DFDL schema. /Roger



[cid:image003.png@01D50999.77918750]



<xs:element name="UTF-8">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="string" type="xs:string" dfdl:encoding="utf-8" />
            <xs:element name="length" type="xs:integer"
                                       dfdl:inputValueCalc="{ fn:string-length(../string) }" />
        </xs:sequence>
    </xs:complexType>
</xs:element>