You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Claude Warren <cl...@xenei.com> on 2018/02/13 13:10:38 UTC

XMLChar.isNameStart error?

Is there a reason that Jena does not support the full range of XML name
start characters?

see https://www.w3.org/TR/xml/#NT-NameStartChar

I wrote a quick test and found that there were a number of characters that
Jena does not support.
Miscategorization appears to start at 0x132.  There are 936990
miscategorized characters.

The issue is actually in the Xerces util class XMLChar

Is this because of the version of Xerces we are stuck with?  Is there a way
around this issue?

Claude

p.s. Since I can't attach a file, here is the test code I wrote.

import static org.junit.Assert.assertTrue;

import org.apache.xerces.util.XMLChar;
import org.junit.Test;

public class NameTest {
    /*
     * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] |
     * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
     * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] |
     * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
     */

    int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
            { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
0x2070, 0x218F }, { 0x2C00, 0x2FEF },
            { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
0x10000, 0xEFFFF } };

    @Test
    public void testNameStart() {

        for (int[] range : ranges) {
            for (int c = range[0]; c <= range[1]; c++) {
                assertTrue( String.format( "character %s
0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
            }
        }

    }

    @Test
    public void listNameStartErr() {
        int cnt = 0;
        for (int[] range : ranges) {
            for (int c = range[0]; c <= range[1]; c++) {
                if (!XMLChar.isNameStart( c ))
                {
                    System.out.print( String.format( "0x%s
",Integer.toHexString( c )) );
                    cnt++;
                    if (cnt % 25 == 0)
                    {
                        System.out.println();
                    }

                }

            }
        }
        System.out.println();
        System.out.println( cnt+" characters miscategorized"  );
    }

}


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: XMLChar.isNameStart error?

Posted by Andy Seaborne <an...@apache.org>.

This is about "editions" of XML 1.0.

On 14/02/18 10:52, Claude Warren wrote:
> My error.  I should have specifed XML 1.0 as that is the spec that I drew
> the test code from:  https://www.w3.org/TR/xml/#NT-NameStartChar

Is the XMLChar in the JDK correct? I don't know what edition the 
built-in Java XML parser supports.thing).

> 
> So this is an error in Xerces to meet the XML 1.0 naming spec.  I have
> opened a defect with Xerces (
> https://issues.apache.org/jira/browse/XERCESJ-1690)  but I don't expect
> much movement there.

Apache Xerces claims suport for "XML 1.0 (4th Edition)", not edition 5.

I looked at XML 1.0 edition 4 and it looks different
"| [#x0100-#x0131] | [#x0134-#x013E] |"
no x132.

Xerces was going  to release 2.12 last year but I think that ran out of 
energy.  No sure what edition is targeted.

----

Jena is not so heavily tied Xerces.   Theer are only a couple of files 
that import org.apache.xerces datatype code.

We could extract the datatype source and adopt, then use the Java 
builtin parser or any other because we then don't depend/ship Xerces.

Xerces gets to tbe the XML parser by ServiceLoading.

----

 >> it will not split the URL correctly.

A "feature" of RDF/XML

Actually, there isn't a "correct split" though we all expect split at 
"/" or "#".

     Andy


> 
> Claude
> 
> 
> On Wed, Feb 14, 2018 at 10:38 AM, Rob Vesse <rv...@dotnetrdf.org> wrote:
> 
>> If memory serves this is mostly historical, once upon a time RDF/XML was
>> the only serialisation available and so everything had to be XML compliant.
>> Obviously things have evolved over time but the implementation is
>> conservative in this regards.
>>
>> Also I think XML 1.1 post-dates RDF/XML and various other specifications
>> all of which are defined in terms of XML 1.0. For maximum compatibility it
>> is better for us to be conservative because most of the ecosystem has not
>> adopted XML 1.1 yet
>>
>> Rob
>>
>> On 14/02/2018, 09:04, "Claude Warren" <cl...@xenei.com> wrote:
>>
>>      The issue is that predicate namespaces are parsed with XMLChar.  So if
>> I
>>      have one that is correctly formed based on XML 1.1 spec but the XMLChar
>>      code does not recognizes the first character of the local name it will
>> not
>>      split the URL correctly.  All code that depende upon
>>      Resource.getNamespace() and Resource.getLocalName() will be
>> incorrect.  It
>>      seems to me this is a low level problem.
>>
>>      While it should be easy to fix the parsing problem, I am not certain
>> what
>>      effect that will have on any other code that is dependent upon the
>> Xerces
>>      code (where XMLChar originates).
>>
>>      Claude
>>
>>      On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne <an...@apache.org>
>> wrote:
>>
>>      > Maybe SplitIRI will help?
>>      >
>>      > It does Turtle splitting as well as XML.
>>      >
>>      >     Andy
>>      >
>>      >
>>      > On 13/02/18 17:39, Claude Warren wrote:
>>      >
>>      >> It is used in org.apache.jena.rdf.model.impl.Util namespace
>> splitting
>>      >> code.
>>      >>
>>      >> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <an...@apache.org>
>> wrote:
>>      >>
>>      >> Where is XMLChar.isNameStart being used?
>>      >>>
>>      >>>
>>      >>> On 13/02/18 13:10, Claude Warren wrote:
>>      >>>
>>      >>> Is there a reason that Jena does not support the full range of XML
>> name
>>      >>>> start characters?
>>      >>>>
>>      >>>> see https://www.w3.org/TR/xml/#NT-NameStartChar
>>      >>>>
>>      >>>> I wrote a quick test and found that there were a number of
>> characters
>>      >>>> that
>>      >>>> Jena does not support.
>>      >>>> Miscategorization appears to start at 0x132.  There are 936990
>>      >>>> miscategorized characters.
>>      >>>>
>>      >>>> The issue is actually in the Xerces util class XMLChar
>>      >>>>
>>      >>>> Is this because of the version of Xerces we are stuck with?  Is
>> there a
>>      >>>> way
>>      >>>> around this issue?
>>      >>>>
>>      >>>> Claude
>>      >>>>
>>      >>>> p.s. Since I can't attach a file, here is the test code I wrote.
>>      >>>>
>>      >>>> import static org.junit.Assert.assertTrue;
>>      >>>>
>>      >>>> import org.apache.xerces.util.XMLChar;
>>      >>>> import org.junit.Test;
>>      >>>>
>>      >>>> public class NameTest {
>>      >>>>       /*
>>      >>>>        * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |
>> [#xC0-#xD6] |
>>      >>>> [#xD8-#xF6] |
>>      >>>>        * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
>>      >>>> [#x200C-#x200D] |
>>      >>>>        * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
>>      >>>> [#xF900-#xFDCF] |
>>      >>>>        * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
>>      >>>>        */
>>      >>>>
>>      >>>>       int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_'
>> }, {
>>      >>>> 0xC0,
>>      >>>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>>      >>>>               { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C,
>> 0x200D }, {
>>      >>>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>>      >>>>               { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0,
>> 0xFFFD
>>      >>>> }, {
>>      >>>> 0x10000, 0xEFFFF } };
>>      >>>>
>>      >>>>       @Test
>>      >>>>       public void testNameStart() {
>>      >>>>
>>      >>>>           for (int[] range : ranges) {
>>      >>>>               for (int c = range[0]; c <= range[1]; c++) {
>>      >>>>                   assertTrue( String.format( "character %s
>>      >>>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>>      >>>>               }
>>      >>>>           }
>>      >>>>
>>      >>>>       }
>>      >>>>
>>      >>>>       @Test
>>      >>>>       public void listNameStartErr() {
>>      >>>>           int cnt = 0;
>>      >>>>           for (int[] range : ranges) {
>>      >>>>               for (int c = range[0]; c <= range[1]; c++) {
>>      >>>>                   if (!XMLChar.isNameStart( c ))
>>      >>>>                   {
>>      >>>>                       System.out.print( String.format( "0x%s
>>      >>>> ",Integer.toHexString( c )) );
>>      >>>>                       cnt++;
>>      >>>>                       if (cnt % 25 == 0)
>>      >>>>                       {
>>      >>>>                           System.out.println();
>>      >>>>                       }
>>      >>>>
>>      >>>>                   }
>>      >>>>
>>      >>>>               }
>>      >>>>           }
>>      >>>>           System.out.println();
>>      >>>>           System.out.println( cnt+" characters miscategorized"  );
>>      >>>>       }
>>      >>>>
>>      >>>> }
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>
>>      >>
>>
>>
>>      --
>>      I like: Like Like - The likeliest place on the web
>>      <http://like-like.xenei.com>
>>      LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>>
>>
>>
>>
>>
> 
>

Re: XMLChar.isNameStart error?

Posted by Claude Warren <cl...@xenei.com>.

My error.  I should have specifed XML 1.0 as that is the spec that I drew
the test code from:  https://www.w3.org/TR/xml/#NT-NameStartChar

So this is an error in Xerces to meet the XML 1.0 naming spec.  I have
opened a defect with Xerces (
https://issues.apache.org/jira/browse/XERCESJ-1690)  but I don't expect
much movement there.

Claude


On Wed, Feb 14, 2018 at 10:38 AM, Rob Vesse <rv...@dotnetrdf.org> wrote:

> If memory serves this is mostly historical, once upon a time RDF/XML was
> the only serialisation available and so everything had to be XML compliant.
> Obviously things have evolved over time but the implementation is
> conservative in this regards.
>
> Also I think XML 1.1 post-dates RDF/XML and various other specifications
> all of which are defined in terms of XML 1.0. For maximum compatibility it
> is better for us to be conservative because most of the ecosystem has not
> adopted XML 1.1 yet
>
> Rob
>
> On 14/02/2018, 09:04, "Claude Warren" <cl...@xenei.com> wrote:
>
>     The issue is that predicate namespaces are parsed with XMLChar.  So if
> I
>     have one that is correctly formed based on XML 1.1 spec but the XMLChar
>     code does not recognizes the first character of the local name it will
> not
>     split the URL correctly.  All code that depende upon
>     Resource.getNamespace() and Resource.getLocalName() will be
> incorrect.  It
>     seems to me this is a low level problem.
>
>     While it should be easy to fix the parsing problem, I am not certain
> what
>     effect that will have on any other code that is dependent upon the
> Xerces
>     code (where XMLChar originates).
>
>     Claude
>
>     On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne <an...@apache.org>
> wrote:
>
>     > Maybe SplitIRI will help?
>     >
>     > It does Turtle splitting as well as XML.
>     >
>     >     Andy
>     >
>     >
>     > On 13/02/18 17:39, Claude Warren wrote:
>     >
>     >> It is used in org.apache.jena.rdf.model.impl.Util namespace
> splitting
>     >> code.
>     >>
>     >> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <an...@apache.org>
> wrote:
>     >>
>     >> Where is XMLChar.isNameStart being used?
>     >>>
>     >>>
>     >>> On 13/02/18 13:10, Claude Warren wrote:
>     >>>
>     >>> Is there a reason that Jena does not support the full range of XML
> name
>     >>>> start characters?
>     >>>>
>     >>>> see https://www.w3.org/TR/xml/#NT-NameStartChar
>     >>>>
>     >>>> I wrote a quick test and found that there were a number of
> characters
>     >>>> that
>     >>>> Jena does not support.
>     >>>> Miscategorization appears to start at 0x132.  There are 936990
>     >>>> miscategorized characters.
>     >>>>
>     >>>> The issue is actually in the Xerces util class XMLChar
>     >>>>
>     >>>> Is this because of the version of Xerces we are stuck with?  Is
> there a
>     >>>> way
>     >>>> around this issue?
>     >>>>
>     >>>> Claude
>     >>>>
>     >>>> p.s. Since I can't attach a file, here is the test code I wrote.
>     >>>>
>     >>>> import static org.junit.Assert.assertTrue;
>     >>>>
>     >>>> import org.apache.xerces.util.XMLChar;
>     >>>> import org.junit.Test;
>     >>>>
>     >>>> public class NameTest {
>     >>>>       /*
>     >>>>        * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |
> [#xC0-#xD6] |
>     >>>> [#xD8-#xF6] |
>     >>>>        * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
>     >>>> [#x200C-#x200D] |
>     >>>>        * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
>     >>>> [#xF900-#xFDCF] |
>     >>>>        * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
>     >>>>        */
>     >>>>
>     >>>>       int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_'
> }, {
>     >>>> 0xC0,
>     >>>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>     >>>>               { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C,
> 0x200D }, {
>     >>>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>     >>>>               { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0,
> 0xFFFD
>     >>>> }, {
>     >>>> 0x10000, 0xEFFFF } };
>     >>>>
>     >>>>       @Test
>     >>>>       public void testNameStart() {
>     >>>>
>     >>>>           for (int[] range : ranges) {
>     >>>>               for (int c = range[0]; c <= range[1]; c++) {
>     >>>>                   assertTrue( String.format( "character %s
>     >>>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>     >>>>               }
>     >>>>           }
>     >>>>
>     >>>>       }
>     >>>>
>     >>>>       @Test
>     >>>>       public void listNameStartErr() {
>     >>>>           int cnt = 0;
>     >>>>           for (int[] range : ranges) {
>     >>>>               for (int c = range[0]; c <= range[1]; c++) {
>     >>>>                   if (!XMLChar.isNameStart( c ))
>     >>>>                   {
>     >>>>                       System.out.print( String.format( "0x%s
>     >>>> ",Integer.toHexString( c )) );
>     >>>>                       cnt++;
>     >>>>                       if (cnt % 25 == 0)
>     >>>>                       {
>     >>>>                           System.out.println();
>     >>>>                       }
>     >>>>
>     >>>>                   }
>     >>>>
>     >>>>               }
>     >>>>           }
>     >>>>           System.out.println();
>     >>>>           System.out.println( cnt+" characters miscategorized"  );
>     >>>>       }
>     >>>>
>     >>>> }
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>
>     >>
>
>
>     --
>     I like: Like Like - The likeliest place on the web
>     <http://like-like.xenei.com>
>     LinkedIn: http://www.linkedin.com/in/claudewarren
>
>
>
>
>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: XMLChar.isNameStart error?

Posted by Rob Vesse <rv...@dotnetrdf.org>.

If memory serves this is mostly historical, once upon a time RDF/XML was the only serialisation available and so everything had to be XML compliant. Obviously things have evolved over time but the implementation is conservative in this regards.

Also I think XML 1.1 post-dates RDF/XML and various other specifications all of which are defined in terms of XML 1.0. For maximum compatibility it is better for us to be conservative because most of the ecosystem has not adopted XML 1.1 yet

Rob

On 14/02/2018, 09:04, "Claude Warren" <cl...@xenei.com> wrote:

    The issue is that predicate namespaces are parsed with XMLChar.  So if I
    have one that is correctly formed based on XML 1.1 spec but the XMLChar
    code does not recognizes the first character of the local name it will not
    split the URL correctly.  All code that depende upon
    Resource.getNamespace() and Resource.getLocalName() will be incorrect.  It
    seems to me this is a low level problem.
    
    While it should be easy to fix the parsing problem, I am not certain what
    effect that will have on any other code that is dependent upon the Xerces
    code (where XMLChar originates).
    
    Claude
    
    On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne <an...@apache.org> wrote:
    
    > Maybe SplitIRI will help?
    >
    > It does Turtle splitting as well as XML.
    >
    >     Andy
    >
    >
    > On 13/02/18 17:39, Claude Warren wrote:
    >
    >> It is used in org.apache.jena.rdf.model.impl.Util namespace splitting
    >> code.
    >>
    >> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <an...@apache.org> wrote:
    >>
    >> Where is XMLChar.isNameStart being used?
    >>>
    >>>
    >>> On 13/02/18 13:10, Claude Warren wrote:
    >>>
    >>> Is there a reason that Jena does not support the full range of XML name
    >>>> start characters?
    >>>>
    >>>> see https://www.w3.org/TR/xml/#NT-NameStartChar
    >>>>
    >>>> I wrote a quick test and found that there were a number of characters
    >>>> that
    >>>> Jena does not support.
    >>>> Miscategorization appears to start at 0x132.  There are 936990
    >>>> miscategorized characters.
    >>>>
    >>>> The issue is actually in the Xerces util class XMLChar
    >>>>
    >>>> Is this because of the version of Xerces we are stuck with?  Is there a
    >>>> way
    >>>> around this issue?
    >>>>
    >>>> Claude
    >>>>
    >>>> p.s. Since I can't attach a file, here is the test code I wrote.
    >>>>
    >>>> import static org.junit.Assert.assertTrue;
    >>>>
    >>>> import org.apache.xerces.util.XMLChar;
    >>>> import org.junit.Test;
    >>>>
    >>>> public class NameTest {
    >>>>       /*
    >>>>        * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
    >>>> [#xD8-#xF6] |
    >>>>        * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
    >>>> [#x200C-#x200D] |
    >>>>        * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
    >>>> [#xF900-#xFDCF] |
    >>>>        * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
    >>>>        */
    >>>>
    >>>>       int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, {
    >>>> 0xC0,
    >>>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
    >>>>               { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
    >>>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
    >>>>               { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD
    >>>> }, {
    >>>> 0x10000, 0xEFFFF } };
    >>>>
    >>>>       @Test
    >>>>       public void testNameStart() {
    >>>>
    >>>>           for (int[] range : ranges) {
    >>>>               for (int c = range[0]; c <= range[1]; c++) {
    >>>>                   assertTrue( String.format( "character %s
    >>>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
    >>>>               }
    >>>>           }
    >>>>
    >>>>       }
    >>>>
    >>>>       @Test
    >>>>       public void listNameStartErr() {
    >>>>           int cnt = 0;
    >>>>           for (int[] range : ranges) {
    >>>>               for (int c = range[0]; c <= range[1]; c++) {
    >>>>                   if (!XMLChar.isNameStart( c ))
    >>>>                   {
    >>>>                       System.out.print( String.format( "0x%s
    >>>> ",Integer.toHexString( c )) );
    >>>>                       cnt++;
    >>>>                       if (cnt % 25 == 0)
    >>>>                       {
    >>>>                           System.out.println();
    >>>>                       }
    >>>>
    >>>>                   }
    >>>>
    >>>>               }
    >>>>           }
    >>>>           System.out.println();
    >>>>           System.out.println( cnt+" characters miscategorized"  );
    >>>>       }
    >>>>
    >>>> }
    >>>>
    >>>>
    >>>>
    >>>>
    >>
    >>
    
    
    -- 
    I like: Like Like - The likeliest place on the web
    <http://like-like.xenei.com>
    LinkedIn: http://www.linkedin.com/in/claudewarren

Re: XMLChar.isNameStart error?

Posted by Claude Warren <cl...@xenei.com>.

The issue is that predicate namespaces are parsed with XMLChar.  So if I
have one that is correctly formed based on XML 1.1 spec but the XMLChar
code does not recognizes the first character of the local name it will not
split the URL correctly.  All code that depende upon
Resource.getNamespace() and Resource.getLocalName() will be incorrect.  It
seems to me this is a low level problem.

While it should be easy to fix the parsing problem, I am not certain what
effect that will have on any other code that is dependent upon the Xerces
code (where XMLChar originates).

Claude

On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne <an...@apache.org> wrote:

> Maybe SplitIRI will help?
>
> It does Turtle splitting as well as XML.
>
>     Andy
>
>
> On 13/02/18 17:39, Claude Warren wrote:
>
>> It is used in org.apache.jena.rdf.model.impl.Util namespace splitting
>> code.
>>
>> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>> Where is XMLChar.isNameStart being used?
>>>
>>>
>>> On 13/02/18 13:10, Claude Warren wrote:
>>>
>>> Is there a reason that Jena does not support the full range of XML name
>>>> start characters?
>>>>
>>>> see https://www.w3.org/TR/xml/#NT-NameStartChar
>>>>
>>>> I wrote a quick test and found that there were a number of characters
>>>> that
>>>> Jena does not support.
>>>> Miscategorization appears to start at 0x132.  There are 936990
>>>> miscategorized characters.
>>>>
>>>> The issue is actually in the Xerces util class XMLChar
>>>>
>>>> Is this because of the version of Xerces we are stuck with?  Is there a
>>>> way
>>>> around this issue?
>>>>
>>>> Claude
>>>>
>>>> p.s. Since I can't attach a file, here is the test code I wrote.
>>>>
>>>> import static org.junit.Assert.assertTrue;
>>>>
>>>> import org.apache.xerces.util.XMLChar;
>>>> import org.junit.Test;
>>>>
>>>> public class NameTest {
>>>>       /*
>>>>        * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
>>>> [#xD8-#xF6] |
>>>>        * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
>>>> [#x200C-#x200D] |
>>>>        * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
>>>> [#xF900-#xFDCF] |
>>>>        * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
>>>>        */
>>>>
>>>>       int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, {
>>>> 0xC0,
>>>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>>>>               { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
>>>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>>>>               { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD
>>>> }, {
>>>> 0x10000, 0xEFFFF } };
>>>>
>>>>       @Test
>>>>       public void testNameStart() {
>>>>
>>>>           for (int[] range : ranges) {
>>>>               for (int c = range[0]; c <= range[1]; c++) {
>>>>                   assertTrue( String.format( "character %s
>>>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>>>>               }
>>>>           }
>>>>
>>>>       }
>>>>
>>>>       @Test
>>>>       public void listNameStartErr() {
>>>>           int cnt = 0;
>>>>           for (int[] range : ranges) {
>>>>               for (int c = range[0]; c <= range[1]; c++) {
>>>>                   if (!XMLChar.isNameStart( c ))
>>>>                   {
>>>>                       System.out.print( String.format( "0x%s
>>>> ",Integer.toHexString( c )) );
>>>>                       cnt++;
>>>>                       if (cnt % 25 == 0)
>>>>                       {
>>>>                           System.out.println();
>>>>                       }
>>>>
>>>>                   }
>>>>
>>>>               }
>>>>           }
>>>>           System.out.println();
>>>>           System.out.println( cnt+" characters miscategorized"  );
>>>>       }
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>>
>>
>>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: XMLChar.isNameStart error?

Posted by Andy Seaborne <an...@apache.org>.

Maybe SplitIRI will help?

It does Turtle splitting as well as XML.

     Andy

On 13/02/18 17:39, Claude Warren wrote:
> It is used in org.apache.jena.rdf.model.impl.Util namespace splitting code.
> 
> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <an...@apache.org> wrote:
> 
>> Where is XMLChar.isNameStart being used?
>>
>>
>> On 13/02/18 13:10, Claude Warren wrote:
>>
>>> Is there a reason that Jena does not support the full range of XML name
>>> start characters?
>>>
>>> see https://www.w3.org/TR/xml/#NT-NameStartChar
>>>
>>> I wrote a quick test and found that there were a number of characters that
>>> Jena does not support.
>>> Miscategorization appears to start at 0x132.  There are 936990
>>> miscategorized characters.
>>>
>>> The issue is actually in the Xerces util class XMLChar
>>>
>>> Is this because of the version of Xerces we are stuck with?  Is there a
>>> way
>>> around this issue?
>>>
>>> Claude
>>>
>>> p.s. Since I can't attach a file, here is the test code I wrote.
>>>
>>> import static org.junit.Assert.assertTrue;
>>>
>>> import org.apache.xerces.util.XMLChar;
>>> import org.junit.Test;
>>>
>>> public class NameTest {
>>>       /*
>>>        * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
>>> [#xD8-#xF6] |
>>>        * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
>>>        * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
>>> [#xF900-#xFDCF] |
>>>        * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
>>>        */
>>>
>>>       int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
>>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>>>               { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
>>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>>>               { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
>>> 0x10000, 0xEFFFF } };
>>>
>>>       @Test
>>>       public void testNameStart() {
>>>
>>>           for (int[] range : ranges) {
>>>               for (int c = range[0]; c <= range[1]; c++) {
>>>                   assertTrue( String.format( "character %s
>>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>>>               }
>>>           }
>>>
>>>       }
>>>
>>>       @Test
>>>       public void listNameStartErr() {
>>>           int cnt = 0;
>>>           for (int[] range : ranges) {
>>>               for (int c = range[0]; c <= range[1]; c++) {
>>>                   if (!XMLChar.isNameStart( c ))
>>>                   {
>>>                       System.out.print( String.format( "0x%s
>>> ",Integer.toHexString( c )) );
>>>                       cnt++;
>>>                       if (cnt % 25 == 0)
>>>                       {
>>>                           System.out.println();
>>>                       }
>>>
>>>                   }
>>>
>>>               }
>>>           }
>>>           System.out.println();
>>>           System.out.println( cnt+" characters miscategorized"  );
>>>       }
>>>
>>> }
>>>
>>>
>>>
> 
>

Re: XMLChar.isNameStart error?

Posted by Claude Warren <cl...@xenei.com>.

It is used in org.apache.jena.rdf.model.impl.Util namespace splitting code.

On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <an...@apache.org> wrote:

> Where is XMLChar.isNameStart being used?
>
>
> On 13/02/18 13:10, Claude Warren wrote:
>
>> Is there a reason that Jena does not support the full range of XML name
>> start characters?
>>
>> see https://www.w3.org/TR/xml/#NT-NameStartChar
>>
>> I wrote a quick test and found that there were a number of characters that
>> Jena does not support.
>> Miscategorization appears to start at 0x132.  There are 936990
>> miscategorized characters.
>>
>> The issue is actually in the Xerces util class XMLChar
>>
>> Is this because of the version of Xerces we are stuck with?  Is there a
>> way
>> around this issue?
>>
>> Claude
>>
>> p.s. Since I can't attach a file, here is the test code I wrote.
>>
>> import static org.junit.Assert.assertTrue;
>>
>> import org.apache.xerces.util.XMLChar;
>> import org.junit.Test;
>>
>> public class NameTest {
>>      /*
>>       * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
>> [#xD8-#xF6] |
>>       * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
>>       * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
>> [#xF900-#xFDCF] |
>>       * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
>>       */
>>
>>      int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>>              { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>>              { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
>> 0x10000, 0xEFFFF } };
>>
>>      @Test
>>      public void testNameStart() {
>>
>>          for (int[] range : ranges) {
>>              for (int c = range[0]; c <= range[1]; c++) {
>>                  assertTrue( String.format( "character %s
>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>>              }
>>          }
>>
>>      }
>>
>>      @Test
>>      public void listNameStartErr() {
>>          int cnt = 0;
>>          for (int[] range : ranges) {
>>              for (int c = range[0]; c <= range[1]; c++) {
>>                  if (!XMLChar.isNameStart( c ))
>>                  {
>>                      System.out.print( String.format( "0x%s
>> ",Integer.toHexString( c )) );
>>                      cnt++;
>>                      if (cnt % 25 == 0)
>>                      {
>>                          System.out.println();
>>                      }
>>
>>                  }
>>
>>              }
>>          }
>>          System.out.println();
>>          System.out.println( cnt+" characters miscategorized"  );
>>      }
>>
>> }
>>
>>
>>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: XMLChar.isNameStart error?

Posted by Andy Seaborne <an...@apache.org>.

Where is XMLChar.isNameStart being used?

On 13/02/18 13:10, Claude Warren wrote:
> Is there a reason that Jena does not support the full range of XML name
> start characters?
> 
> see https://www.w3.org/TR/xml/#NT-NameStartChar
> 
> I wrote a quick test and found that there were a number of characters that
> Jena does not support.
> Miscategorization appears to start at 0x132.  There are 936990
> miscategorized characters.
> 
> The issue is actually in the Xerces util class XMLChar
> 
> Is this because of the version of Xerces we are stuck with?  Is there a way
> around this issue?
> 
> Claude
> 
> p.s. Since I can't attach a file, here is the test code I wrote.
> 
> import static org.junit.Assert.assertTrue;
> 
> import org.apache.xerces.util.XMLChar;
> import org.junit.Test;
> 
> public class NameTest {
>      /*
>       * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
> [#xD8-#xF6] |
>       * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
>       * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
> [#xF900-#xFDCF] |
>       * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
>       */
> 
>      int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>              { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>              { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
> 0x10000, 0xEFFFF } };
> 
>      @Test
>      public void testNameStart() {
> 
>          for (int[] range : ranges) {
>              for (int c = range[0]; c <= range[1]; c++) {
>                  assertTrue( String.format( "character %s
> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>              }
>          }
> 
>      }
> 
>      @Test
>      public void listNameStartErr() {
>          int cnt = 0;
>          for (int[] range : ranges) {
>              for (int c = range[0]; c <= range[1]; c++) {
>                  if (!XMLChar.isNameStart( c ))
>                  {
>                      System.out.print( String.format( "0x%s
> ",Integer.toHexString( c )) );
>                      cnt++;
>                      if (cnt % 25 == 0)
>                      {
>                          System.out.println();
>                      }
> 
>                  }
> 
>              }
>          }
>          System.out.println();
>          System.out.println( cnt+" characters miscategorized"  );
>      }
> 
> }
> 
>