You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cordova.apache.org by Josh Soref <js...@blackberry.com> on 2013/11/26 16:31:37 UTC

Unicode and XML files

https://issues.apache.org/jira/browse/CB-5468

I have three pull requests which I haven’t actually submitted:
BlackBerry [1], Cli [2], Plugman [3]

Windows for historical reasons doesn’t default to treating text files as UTF-8. Instead files are typically treated as Latin-1 or some other random historical encoding. If you want a file to be treated by your typical Windows program as UTF-8, you are expected to insert the UTF-8 version of the Unicode BOM at the start of the file.

Per ConfigurationFiles [4], CLI and Plugman take a user authored config.xml and generate a platform one (actually, the documentation there claims that only Plugman does so, but my work [2] indicates that CLI also does so occasionally…). That file is then used by platform code [1] and it could also be opened by the user (when there’s a problem).

Since Windows editors (including Eclipse on Windows [5]…) don’t use UTF-8 by default, the results can be fairly random or at least inconsistent with expectations.

I’m not absolutely certain that I like my patch set, I could probably get away with only doing [1], but it feels like the right behavior for a parser would be to honor the input format when producing output, I.e. If there were a BOM in the user’s config.xml, produce a BOM in the generated one…

Thoughts?


[1] https://github.com/blackberry/cordova-blackberry/compare/cb_5468?expand=1
[2] https://github.com/blackberry/cordova-cli/compare/cb_5468?expand=1
[3] https://github.com/blackberry/cordova-plugman/compare/cb_5468?expand=1
[4] http://wiki.apache.org/cordova/ConfigurationFiles
[5] http://stackoverflow.com/questions/13792061/why-does-eclipse-use-cp1252-encoding

---------------------------------------------------------------------
This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful.


Re: Unicode and XML files

Posted by Brian LeRoux <b...@brian.io>.
Yes please do make noise. Everyone is super busy. Calling ppl out for help
directly is appreciated.


On Tue, Dec 3, 2013 at 8:11 AM, Marcel Kinard <cm...@gmail.com> wrote:

> Here is a list of the component owners in Jira, which may provide some
> hints of who to ping for review, when an email to the list doesn't get a
> response.
>
>
> https://issues.apache.org/jira/browse/CB#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel
>
> On Dec 2, 2013, at 3:18 PM, Michal Mocny <mm...@chromium.org> wrote:
>
> > If you want your stuff
> > reviewed and landed, its best to find a specific person to do it.
>

Re: Unicode and XML files

Posted by Marcel Kinard <cm...@gmail.com>.
Here is a list of the component owners in Jira, which may provide some hints of who to ping for review, when an email to the list doesn't get a response.

https://issues.apache.org/jira/browse/CB#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel

On Dec 2, 2013, at 3:18 PM, Michal Mocny <mm...@chromium.org> wrote:

> If you want your stuff
> reviewed and landed, its best to find a specific person to do it.

Re: Unicode and XML files

Posted by Michal Mocny <mm...@chromium.org>.
If everyone is responsible for reviews, no one is.  If you want your stuff
reviewed and landed, its best to find a specific person to do it.  If you
are working on stuff others are interested in, this is usually not
difficult to do (or heck, they come without asking).  If you aren't, you
may find it hard to get eyeballs.  Working as intended, imho.


On Mon, Dec 2, 2013 at 1:06 PM, Braden Shepherdson <br...@chromium.org>wrote:

> Our "review process" "works" by sending mail to this list, and then pinging
> it after a few days go by with no action, and then pinging it again if we
> still don't get to it. It's two parts diffusion of responsibility/hoping
> someone else will get it, and one part the fact there's no list of
> outstanding reviews.
>
> As to this change, it looks fine to me as long as these BOMs aren't leaking
> into files that will end up on Unixy platforms; many Unix tools don't like
> BOMs.
>
> Braden
>
>
> On Mon, Dec 2, 2013 at 7:50 AM, Josh Soref <js...@blackberry.com> wrote:
>
> > Andrew wrote:
> > >Had this starred for a while, but just reading now.
> >
> > Thanks.
> >
> > I have a bunch of other requests which I¹m wondering if people have
> > decided not to consider Š I really wish I understood how the review
> > process worked Š
> >
> > >Looks like your changes just add the BOM unconditionally (not dependent
> on
> > >whether it was there already).
> >
> > Yeah, I favor unconditionally, but if someone has an argument for doing
> it
> > conditionally, I could do extra work...
> >
> > > That said, if it doesn't break anything, it probably is more correct to
> > >have a BOM.
> >
> > ---------------------------------------------------------------------
> > This transmission (including any attachments) may contain confidential
> > information, privileged material (including material protected by the
> > solicitor-client or other applicable privileges), or constitute
> non-public
> > information. Any use of this information by anyone other than the
> intended
> > recipient is prohibited. If you have received this transmission in error,
> > please immediately reply to the sender and delete this information from
> > your system. Use, dissemination, distribution, or reproduction of this
> > transmission by unintended recipients is not authorized and may be
> unlawful.
> >
> >
>

Re: Unicode and XML files

Posted by Braden Shepherdson <br...@chromium.org>.
My sufficient condition for merging this change is "it doesn't break
anything that already works", which includes our XML parsing, on any
desktop OS or mobile platform, or the CLI tools, choking on those BOMs.

Can you try it out on Linux or Mac to see what happens to the CLI tools? If
not, I guess push it and we'll see if it breaks for anyone. This may
require an upstream patch against node-elementtree or something, if the
spec insists on parsers needing to recognize these bytes.

Braden


On Mon, Dec 2, 2013 at 2:21 PM, Josh Soref <js...@blackberry.com> wrote:

> Braden wrote:
> >As to this change, it looks fine to me as long as these BOMs aren't
> >leaking
> >into files that will end up on Unixy platforms; many Unix tools don't like
> >BOMs.
>
> Hrm, config.xml is something everyone usesŠ
>
> http://www.w3.org/TR/xml11/#charencoding
>
> «Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin
> with the Byte Order Mark described in ISO/IEC 10646 or Unicode. This is an
> encoding signature, not part of either the markup or the character data of
> the XML document. XML processors _MUST_ be able to use this character to
> differentiate between UTF-8 and UTF-16 encoded documents.»
>
>
> So, given that the file in question is an xml file, and given that per
> spec all XML processors _MUST_ be able to handle this character in this
> position, I think that there¹s no reason not to apply it.
>
>
> In general, you aren¹t supposed to `cat` two .xml files together (by the
> very nature of an xml file having a single root element and a closing tag
> for it, the concatenation can¹t result in a valid document).
>
> Braden: is this sufficient for you to merge it?
>
> ---------------------------------------------------------------------
> This transmission (including any attachments) may contain confidential
> information, privileged material (including material protected by the
> solicitor-client or other applicable privileges), or constitute non-public
> information. Any use of this information by anyone other than the intended
> recipient is prohibited. If you have received this transmission in error,
> please immediately reply to the sender and delete this information from
> your system. Use, dissemination, distribution, or reproduction of this
> transmission by unintended recipients is not authorized and may be unlawful.
>
>

Re: Unicode and XML files

Posted by Josh Soref <js...@blackberry.com>.
Braden wrote:
>As to this change, it looks fine to me as long as these BOMs aren't
>leaking
>into files that will end up on Unixy platforms; many Unix tools don't like
>BOMs.

Hrm, config.xml is something everyone usesŠ

http://www.w3.org/TR/xml11/#charencoding

«Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin
with the Byte Order Mark described in ISO/IEC 10646 or Unicode. This is an
encoding signature, not part of either the markup or the character data of
the XML document. XML processors _MUST_ be able to use this character to
differentiate between UTF-8 and UTF-16 encoded documents.»


So, given that the file in question is an xml file, and given that per
spec all XML processors _MUST_ be able to handle this character in this
position, I think that there¹s no reason not to apply it.


In general, you aren¹t supposed to `cat` two .xml files together (by the
very nature of an xml file having a single root element and a closing tag
for it, the concatenation can¹t result in a valid document).

Braden: is this sufficient for you to merge it?

---------------------------------------------------------------------
This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful.


Re: Unicode and XML files

Posted by Braden Shepherdson <br...@chromium.org>.
Our "review process" "works" by sending mail to this list, and then pinging
it after a few days go by with no action, and then pinging it again if we
still don't get to it. It's two parts diffusion of responsibility/hoping
someone else will get it, and one part the fact there's no list of
outstanding reviews.

As to this change, it looks fine to me as long as these BOMs aren't leaking
into files that will end up on Unixy platforms; many Unix tools don't like
BOMs.

Braden


On Mon, Dec 2, 2013 at 7:50 AM, Josh Soref <js...@blackberry.com> wrote:

> Andrew wrote:
> >Had this starred for a while, but just reading now.
>
> Thanks.
>
> I have a bunch of other requests which I¹m wondering if people have
> decided not to consider Š I really wish I understood how the review
> process worked Š
>
> >Looks like your changes just add the BOM unconditionally (not dependent on
> >whether it was there already).
>
> Yeah, I favor unconditionally, but if someone has an argument for doing it
> conditionally, I could do extra work...
>
> > That said, if it doesn't break anything, it probably is more correct to
> >have a BOM.
>
> ---------------------------------------------------------------------
> This transmission (including any attachments) may contain confidential
> information, privileged material (including material protected by the
> solicitor-client or other applicable privileges), or constitute non-public
> information. Any use of this information by anyone other than the intended
> recipient is prohibited. If you have received this transmission in error,
> please immediately reply to the sender and delete this information from
> your system. Use, dissemination, distribution, or reproduction of this
> transmission by unintended recipients is not authorized and may be unlawful.
>
>

Re: Unicode and XML files

Posted by Josh Soref <js...@blackberry.com>.
Andrew wrote:
>Had this starred for a while, but just reading now.

Thanks.

I have a bunch of other requests which I¹m wondering if people have
decided not to consider Š I really wish I understood how the review
process worked Š 

>Looks like your changes just add the BOM unconditionally (not dependent on
>whether it was there already).

Yeah, I favor unconditionally, but if someone has an argument for doing it
conditionally, I could do extra work...

> That said, if it doesn't break anything, it probably is more correct to
>have a BOM.

---------------------------------------------------------------------
This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful.


Re: Unicode and XML files

Posted by Andrew Grieve <ag...@chromium.org>.
Had this starred for a while, but just reading now.

Looks like your changes just add the BOM unconditionally (not dependent on
whether it was there already). That said, if it doesn't break anything, it
probably is more correct to have a BOM.


On Tue, Nov 26, 2013 at 10:31 AM, Josh Soref <js...@blackberry.com> wrote:

> https://issues.apache.org/jira/browse/CB-5468
>
> I have three pull requests which I haven’t actually submitted:
> BlackBerry [1], Cli [2], Plugman [3]
>
> Windows for historical reasons doesn’t default to treating text files as
> UTF-8. Instead files are typically treated as Latin-1 or some other random
> historical encoding. If you want a file to be treated by your typical
> Windows program as UTF-8, you are expected to insert the UTF-8 version of
> the Unicode BOM at the start of the file.
>
> Per ConfigurationFiles [4], CLI and Plugman take a user authored
> config.xml and generate a platform one (actually, the documentation there
> claims that only Plugman does so, but my work [2] indicates that CLI also
> does so occasionally…). That file is then used by platform code [1] and it
> could also be opened by the user (when there’s a problem).
>
> Since Windows editors (including Eclipse on Windows [5]…) don’t use UTF-8
> by default, the results can be fairly random or at least inconsistent with
> expectations.
>
> I’m not absolutely certain that I like my patch set, I could probably get
> away with only doing [1], but it feels like the right behavior for a parser
> would be to honor the input format when producing output, I.e. If there
> were a BOM in the user’s config.xml, produce a BOM in the generated one…
>
> Thoughts?
>
>
> [1]
> https://github.com/blackberry/cordova-blackberry/compare/cb_5468?expand=1
> [2] https://github.com/blackberry/cordova-cli/compare/cb_5468?expand=1
> [3] https://github.com/blackberry/cordova-plugman/compare/cb_5468?expand=1
> [4] http://wiki.apache.org/cordova/ConfigurationFiles
> [5]
> http://stackoverflow.com/questions/13792061/why-does-eclipse-use-cp1252-encoding
>
> ---------------------------------------------------------------------
> This transmission (including any attachments) may contain confidential
> information, privileged material (including material protected by the
> solicitor-client or other applicable privileges), or constitute non-public
> information. Any use of this information by anyone other than the intended
> recipient is prohibited. If you have received this transmission in error,
> please immediately reply to the sender and delete this information from
> your system. Use, dissemination, distribution, or reproduction of this
> transmission by unintended recipients is not authorized and may be unlawful.
>
>