You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2019/02/04 19:50:19 UTC

Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

Hello DFDL community,

As Steve explained a while back, endian-ness applies to multi-byte words.

Endian-ness does not apply to ASCII characters because each character is a single byte.

Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. 

Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character é, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character é appear in a hex editor as A9 C3 whereas if the file is in Big Endian the character é would appear in a hex editor as C3 A9?

/Roger

Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

Posted by Steve Lawrence <sl...@apache.org>.
Nope, byte order never applies to UTF-8. The é character would always
appear as C3 A9 in the data, regardless of the byte order. Also note
that the dfdl:byteOrder property does not apply for encodings like
UTF-16BE, UTF-32LE. The byteOrder is defined by the character encoding
and so dfdl:byteOrder is ignored.

- Steve

On 2/4/19 2:50 PM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> As Steve explained a while back, endian-ness applies to multi-byte words.
> 
> Endian-ness does not apply to ASCII characters because each character is a single byte.
> 
> Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. 
> 
> Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character é, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character é appear in a hex editor as A9 C3 whereas if the file is in Big Endian the character é would appear in a hex editor as C3 A9?
> 
> /Roger
>