You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Taki Kamiya <tk...@us.fujitsu.com> on 2008/08/06 02:20:38 UTC

single non-BMP character counted as two characters

Hi,

The following schema, which is supposedly valid, results in this error:

  cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid with respect to length '1' 
  for type '#AnonType_act'.

The default value "&#x2000B;" for attribute "a" is a single non-BMP character.
It is as though a surrogate pair is counted as two characters.

Regards,

-taki



<xsd:schema targetNamespace="urn:foo"
           xmlns:xsd="http://www.w3.org/2001/XMLSchema"
           xmlns:foo="urn:foo">

<xsd:complexType name="ct">
  <xsd:attribute name="a" default="&#x2000B;"><!-- single character in SIP (U+2000B) -->
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:length value="1"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:attribute>
</xsd:complexType>

</xsd:schema>


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: single non-BMP character counted as two characters

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Taki,

I wasn't implying that we wouldn't fix this. Just hoping that whatever we
end up doing is better than the obvious solution.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Taki Kamiya" <tk...@us.fujitsu.com> wrote on 08/06/2008 01:02:54 AM:

> Hi Michael,
>
> I think it should deserve fixed at some point, because it is making
> some valid schemas considered invalid, instead of making invalid
> schemas to be processed as being valid.
>
> I created the test case about a while back, in preparation for vista's
> extended support for more variety of fonts, which enables users
> to use non-BMP characters in ways they had not been able to before.
> We modified in-house schema processor to take surrogate pairs
> into account whenever we check length, minLength and maxLength
> facet, and has not seen any major performance penalty because
> of that, which is probably partly because the use of length is not
> very common in practice.
>
> Thanks!
>
> -taki
>
> ________________________________
>
> From: Michael Glavassevich [mailto:mrglavas@ca.ibm.com]
> Sent: Tuesday, August 05, 2008 6:00 PM
> To: j-users@xerces.apache.org
> Subject: Re: single non-BMP character counted as two characters
>
>
>
> Hi Taki,
>
> It's a long standing bug/limitation. Xerces uses String.length()
> (which returns the length of the string in chars rather than Unicode
> code points) for checking the length facet.
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> "Taki Kamiya" <tk...@us.fujitsu.com> wrote on 08/05/2008 08:20:38 PM:
>
> > Hi,
> >
> > The following schema, which is supposedly valid, results in this error:
> >
> >   cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid
> > with respect to length '1'
> >   for type '#AnonType_act'.
> >
> > The default value "&#x2000B;" for attribute "a" is a single non-
> BMP character.
> > It is as though a surrogate pair is counted as two characters.
> >
> > Regards,
> >
> > -taki
> >
> >
> >
> > <xsd:schema targetNamespace="urn:foo"
> >            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
> >            xmlns:foo="urn:foo">
> >
> > <xsd:complexType name="ct">
> >   <xsd:attribute name="a" default="&#x2000B;"><!-- single
> character 
> in SIP (U+2000B) -->
> >     <xsd:simpleType>
> >       <xsd:restriction base="xsd:string">
> >         <xsd:length value="1"/>
> >       </xsd:restriction>
> >     </xsd:simpleType>
> >   </xsd:attribute>
> > </xsd:complexType>
> >
> > </xsd:schema>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-users-help@xerces.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

RE: single non-BMP character counted as two characters

Posted by Taki Kamiya <tk...@us.fujitsu.com>.
Hi Michael,

I think it should deserve fixed at some point, because it is making 
some valid schemas considered invalid, instead of making invalid 
schemas to be processed as being valid.

I created the test case about a while back, in preparation for vista's
extended support for more variety of fonts, which enables users
to use non-BMP characters in ways they had not been able to before.
We modified in-house schema processor to take surrogate pairs
into account whenever we check length, minLength and maxLength
facet, and has not seen any major performance penalty because
of that, which is probably partly because the use of length is not
very common in practice.

Thanks!

-taki

________________________________

From: Michael Glavassevich [mailto:mrglavas@ca.ibm.com] 
Sent: Tuesday, August 05, 2008 6:00 PM
To: j-users@xerces.apache.org
Subject: Re: single non-BMP character counted as two characters



Hi Taki,

It's a long standing bug/limitation. Xerces uses String.length() (which returns the length of the string in chars rather than Unicode code points) for checking the length facet.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Taki Kamiya" <tk...@us.fujitsu.com> wrote on 08/05/2008 08:20:38 PM:

> Hi,
> 
> The following schema, which is supposedly valid, results in this error:
> 
>   cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid 
> with respect to length '1' 
>   for type '#AnonType_act'.
> 
> The default value "&#x2000B;" for attribute "a" is a single non-BMP character.
> It is as though a surrogate pair is counted as two characters.
> 
> Regards,
> 
> -taki
> 
> 
> 
> <xsd:schema targetNamespace="urn:foo"
>            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>            xmlns:foo="urn:foo">
> 
> <xsd:complexType name="ct">
>   <xsd:attribute name="a" default="&#x2000B;"><!-- single character 
> in SIP (U+2000B) -->
>     <xsd:simpleType>
>       <xsd:restriction base="xsd:string">
>         <xsd:length value="1"/>
>       </xsd:restriction>
>     </xsd:simpleType>
>   </xsd:attribute>
> </xsd:complexType>
> 
> </xsd:schema>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: single non-BMP character counted as two characters

Posted by Nathan Beyer <nd...@apache.org>.
That's essentially what's happening in the Harmony code base. The code
essentially delegates to Character.codePointCount(CharSequence,int,int),
which loops over the chars looking for high surrogates. This could certainly
be optimized though.

-Nathan

On Tue, Aug 5, 2008 at 8:44 PM, Michael Glavassevich <mr...@ca.ibm.com>wrote:

> Hi Nathan,
>
> Is the implementation of that method any better than iterating over the
> string and counting the number of code points? I think the last time I
> noticed this bug in the code I resisted fixing it because of the negative
> performance impact on the majority of input which only contains characters
> in BMP.
>
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> nbeyer@gmail.com wrote on 08/05/2008 09:34:35 PM:
>
>
> > This might be an additional impetus to move the code base for future
> > development to Java 5 libraries, so things like String.
> > codePointCount can be used.
> >
> > -Nathan
>
> > On Tue, Aug 5, 2008 at 7:59 PM, Michael Glavassevich <
> mrglavas@ca.ibm.com
> > > wrote:
> > Hi Taki,
> >
> > It's a long standing bug/limitation. Xerces uses String.length()
> > (which returns the length of the string in chars rather than Unicode
> > code points) for checking the length facet.
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
> > "Taki Kamiya" <tk...@us.fujitsu.com> wrote on 08/05/2008 08:20:38 PM:
> >
> >
> > > Hi,
> > >
> > > The following schema, which is supposedly valid, results in this error:
> > >
> > >   cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid
> > > with respect to length '1'
> > >   for type '#AnonType_act'.
> > >
> > > The default value "&#x2000B;" for attribute "a" is a single non-
> > BMP character.
> > > It is as though a surrogate pair is counted as two characters.
> > >
> > > Regards,
> > >
> > > -taki
> > >
> > >
> > >
> > > <xsd:schema targetNamespace="urn:foo"
> > >            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
> > >            xmlns:foo="urn:foo">
> > >
> > > <xsd:complexType name="ct">
> > >   <xsd:attribute name="a" default="&#x2000B;"><!-- single character
> > > in SIP (U+2000B) -->
> > >     <xsd:simpleType>
> > >       <xsd:restriction base="xsd:string">
> > >         <xsd:length value="1"/>
> > >       </xsd:restriction>
> > >     </xsd:simpleType>
> > >   </xsd:attribute>
> > > </xsd:complexType>
> > >
> > > </xsd:schema>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > > For additional commands, e-mail: j-users-help@xerces.apache.org
>

Re: single non-BMP character counted as two characters

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Nathan,

Is the implementation of that method any better than iterating over the
string and counting the number of code points? I think the last time I
noticed this bug in the code I resisted fixing it because of the negative
performance impact on the majority of input which only contains characters
in BMP.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

nbeyer@gmail.com wrote on 08/05/2008 09:34:35 PM:

> This might be an additional impetus to move the code base for future
> development to Java 5 libraries, so things like String.
> codePointCount can be used.
>
> -Nathan

> On Tue, Aug 5, 2008 at 7:59 PM, Michael Glavassevich <mrglavas@ca.ibm.com
> > wrote:
> Hi Taki,
>
> It's a long standing bug/limitation. Xerces uses String.length()
> (which returns the length of the string in chars rather than Unicode
> code points) for checking the length facet.
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> "Taki Kamiya" <tk...@us.fujitsu.com> wrote on 08/05/2008 08:20:38 PM:
>
>
> > Hi,
> >
> > The following schema, which is supposedly valid, results in this error:
> >
> >   cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid
> > with respect to length '1'
> >   for type '#AnonType_act'.
> >
> > The default value "&#x2000B;" for attribute "a" is a single non-
> BMP character.
> > It is as though a surrogate pair is counted as two characters.
> >
> > Regards,
> >
> > -taki
> >
> >
> >
> > <xsd:schema targetNamespace="urn:foo"
> >            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
> >            xmlns:foo="urn:foo">
> >
> > <xsd:complexType name="ct">
> >   <xsd:attribute name="a" default="&#x2000B;"><!-- single character
> > in SIP (U+2000B) -->
> >     <xsd:simpleType>
> >       <xsd:restriction base="xsd:string">
> >         <xsd:length value="1"/>
> >       </xsd:restriction>
> >     </xsd:simpleType>
> >   </xsd:attribute>
> > </xsd:complexType>
> >
> > </xsd:schema>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-users-help@xerces.apache.org

Re: single non-BMP character counted as two characters

Posted by Nathan Beyer <nd...@apache.org>.
This might be an additional impetus to move the code base for future
development to Java 5 libraries, so things like String.codePointCount can be
used.

-Nathan

On Tue, Aug 5, 2008 at 7:59 PM, Michael Glavassevich <mr...@ca.ibm.com>wrote:

> Hi Taki,
>
> It's a long standing bug/limitation. Xerces uses String.length() (which
> returns the length of the string in chars rather than Unicode code points)
> for checking the length facet.
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> "Taki Kamiya" <tk...@us.fujitsu.com> wrote on 08/05/2008 08:20:38 PM:
>
>
> > Hi,
> >
> > The following schema, which is supposedly valid, results in this error:
> >
> >   cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid
> > with respect to length '1'
> >   for type '#AnonType_act'.
> >
> > The default value "&#x2000B;" for attribute "a" is a single non-BMP
> character.
> > It is as though a surrogate pair is counted as two characters.
> >
> > Regards,
> >
> > -taki
> >
> >
> >
> > <xsd:schema targetNamespace="urn:foo"
> >            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
> >            xmlns:foo="urn:foo">
> >
> > <xsd:complexType name="ct">
> >   <xsd:attribute name="a" default="&#x2000B;"><!-- single character
> > in SIP (U+2000B) -->
> >     <xsd:simpleType>
> >       <xsd:restriction base="xsd:string">
> >         <xsd:length value="1"/>
> >       </xsd:restriction>
> >     </xsd:simpleType>
> >   </xsd:attribute>
> > </xsd:complexType>
> >
> > </xsd:schema>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-users-help@xerces.apache.org
>

Re: single non-BMP character counted as two characters

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Taki,

It's a long standing bug/limitation. Xerces uses String.length() (which
returns the length of the string in chars rather than Unicode code points)
for checking the length facet.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Taki Kamiya" <tk...@us.fujitsu.com> wrote on 08/05/2008 08:20:38 PM:

> Hi,
>
> The following schema, which is supposedly valid, results in this error:
>
>   cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid
> with respect to length '1'
>   for type '#AnonType_act'.
>
> The default value "&#x2000B;" for attribute "a" is a single non-BMP
character.
> It is as though a surrogate pair is counted as two characters.
>
> Regards,
>
> -taki
>
>
>
> <xsd:schema targetNamespace="urn:foo"
>            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>            xmlns:foo="urn:foo">
>
> <xsd:complexType name="ct">
>   <xsd:attribute name="a" default="&#x2000B;"><!-- single character
> in SIP (U+2000B) -->
>     <xsd:simpleType>
>       <xsd:restriction base="xsd:string">
>         <xsd:length value="1"/>
>       </xsd:restriction>
>     </xsd:simpleType>
>   </xsd:attribute>
> </xsd:complexType>
>
> </xsd:schema>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org