You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Michael Beckerle (JIRA)" <ji...@apache.org> on 2018/10/23 16:59:00 UTC

[jira] [Commented] (DAFFODIL-639) unicodeByteOrderMark feature

    [ https://issues.apache.org/jira/browse/DAFFODIL-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660965#comment-16660965 ] 

Michael Beckerle commented on DAFFODIL-639:
-------------------------------------------

DFDL Workgroup is discussing (Oct 2018) whether all this BOM stuff should be optional functionality. If so then we're unlikely to implement this at all.


> unicodeByteOrderMark feature
> ----------------------------
>
>                 Key: DAFFODIL-639
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-639
>             Project: Daffodil
>          Issue Type: New Feature
>          Components: Back End, DFDL Language
>            Reporter: Michael Beckerle
>            Priority: Minor
>
> This is not a property. The unicodeByteOrderMark is a member of the Infoset Document Item. (aka the root element). 
> It depends on the dfdl:encoding property, which can be a runtime expression; hence, this must be computed in an Evaluatable which in turn evaluates the encodingEv.
> Likely an Evaluatable[Option[ByteOrder]] is the type. 
> If no encoding property is defined this should be a constant None. 
> If the encoding property is defined and known to NOT be one of UTF-8, UTF-16, or UTF-32, then this should be a constant None. 
> When unparsing, the value will either have been set from parsing, or can be set from an API call. (New API method on Infoset needed.)
> The API call is allowed, but the value ignored/unused by the unparser unless the encoding is UTF-8, UTF-16, or UTF-32. 
> When the encoding evaluates to UTF-8, then the unicodeByteOrderMark will be determined by the first 3 bytes being:
> * 0xEF 0xBF 0xBE - ByteOrder.LittleEndian - 3 bytes are consumed (note: strictly speaking, this shouldn't occur, but will if a naive utf-8 encoder encodes a little-endian BOM into a 3-byte UTF-8 sequence. To insure such data will round trip between UTF-8 and UTF-16 (LE - via BOM), we match this sequence, and choose LittleEndian byte order)
> * 0xEF 0xBB 0xBF - ByteOrder-BigEndian - 3 bytes are consumed
> * anything else - no bytes are consumed, and the unicodeByteOrderMark is not set (has no value)
> when unparsing, if unicodeByteOrderMark is not set, then no byte order mark is output. 
> For UTF-16,
> * 0xFE 0xFF - byteOrder.BigEndian - 2 bytes are consumed
> * 0xFF 0xFE - byteOrder.LittleEndian - 2 bytes are consumed
> * anything else - parse error.
> When unparsing, if encoding is UTF-16, and unicodeByteOrderMark is not set - unparse error.
> UTF-32 works like utf-16, except the byte patterns are 00 00 FE FF for bigEndian, and FF FE 00 00 for littleEndian.
> Recommended: package this code for reuse, assuming it needs to be used as a library for reading/decoding strings generally. It's not impossible that the above runtime errors when the byte order is not known, will be augmented in the future by a mode where each individual text string at fine granularity is examined for a byte order mark at the start.  There also may be a need for utf-16 heuristic byte-order determination - that is by looking at the bytes for the characters and determining if they make more sense as big-endian or little endian. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)