You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Tim Harsch (Created) (JIRA)" <ji...@apache.org> on 2011/10/03 17:28:35 UTC

[jira] [Created] (JENA-129) RIOT not parsing UTF combining characters correctly

RIOT not parsing UTF combining characters correctly
---------------------------------------------------

                 Key: JENA-129
                 URL: https://issues.apache.org/jira/browse/JENA-129
             Project: Jena
          Issue Type: Bug
          Components: RIOT
         Environment: Java 1.6, Windows 7, ARQ 2.8.8
            Reporter: Tim Harsch


Background on the issue can be found at the list archive:
http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201110.mbox/%3C4E88320C.8040300@apache.org%3E

RIOT failed to parse the SPARQL 1.0 DAWG test: "i18n/normalization-01.ttl".

In offline email Andy also noted:

I see one oddity:

[[
==== DAWG-Final/i18n/normalization-02.ttl
WARN  [line: 7, col: 8 ] Bad IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz>
Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment
/../ not at the beginning of a relative reference, or it contains a /./
These should be removed.
]]

because the test is on the input, not the resultant form used for toString.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (JENA-129) RIOT not parsing UTF combining characters correctly

Posted by "Andy Seaborne (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JENA-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-129:
-------------------------------

    Labels: RIOT  (was: parse)
    
> RIOT not parsing UTF combining characters correctly
> ---------------------------------------------------
>
>                 Key: JENA-129
>                 URL: https://issues.apache.org/jira/browse/JENA-129
>             Project: Jena
>          Issue Type: Bug
>          Components: RIOT
>         Environment: Java 1.6, Windows 7, ARQ 2.8.8
>            Reporter: Tim Harsch
>            Assignee: Andy Seaborne
>            Priority: Minor
>              Labels: RIOT
>
> Background on the issue can be found at the list archive:
> http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201110.mbox/%3C4E88320C.8040300@apache.org%3E
> RIOT failed to parse the SPARQL 1.0 DAWG test: "i18n/normalization-01.ttl".
> In offline email Andy also noted:
> I see one oddity:
> [[
> ==== DAWG-Final/i18n/normalization-02.ttl
> WARN  [line: 7, col: 8 ] Bad IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz>
> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment
> /../ not at the beginning of a relative reference, or it contains a /./
> These should be removed.
> ]]
> because the test is on the input, not the resultant form used for toString.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (JENA-129) RIOT not parsing UTF combining characters correctly

Posted by "Andy Seaborne (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JENA-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne closed JENA-129.
------------------------------


RIOT parses DAWG test: "i18n/normalization-01.ttl". 
                
> RIOT not parsing UTF combining characters correctly
> ---------------------------------------------------
>
>                 Key: JENA-129
>                 URL: https://issues.apache.org/jira/browse/JENA-129
>             Project: Jena
>          Issue Type: Bug
>          Components: RIOT
>         Environment: Java 1.6, Windows 7, ARQ 2.8.8
>            Reporter: Tim Harsch
>            Assignee: Andy Seaborne
>            Priority: Minor
>              Labels: RIOT
>
> Background on the issue can be found at the list archive:
> http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201110.mbox/%3C4E88320C.8040300@apache.org%3E
> RIOT failed to parse the SPARQL 1.0 DAWG test: "i18n/normalization-01.ttl".
> In offline email Andy also noted:
> I see one oddity:
> [[
> ==== DAWG-Final/i18n/normalization-02.ttl
> WARN  [line: 7, col: 8 ] Bad IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz>
> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment
> /../ not at the beginning of a relative reference, or it contains a /./
> These should be removed.
> ]]
> because the test is on the input, not the resultant form used for toString.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (JENA-129) RIOT not parsing UTF combining characters correctly

Posted by "Andy Seaborne (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119517#comment-13119517 ] 

Andy Seaborne commented on JENA-129:
------------------------------------

The issue with i18n/normalization-01.ttl is that the tokenization for prefix names does not include combiningg characters (unicode: [#0300-#036F]).

Priority change to minor.  Combing character are unusual in URIs (and lead to trouble anyway with unequal URIs with identical visual appearance e.g.  é and é are different (the second is two characters, the second being a combining diacritic accent).  

The background email message has a confusing title - this is not a UTF-8 issue.  Also beware the email is best viewed as plain text to show the difference of between é and e followed by  ́.

RIOT uses the Java platform decoder for UTF-8.

The validation warning parsing "DAWG-Final/i18n/normalization-02.ttl" is unrelated.  The message is correct (look at the input URI).  It's outputing the post-resolution IRI.

                
> RIOT not parsing UTF combining characters correctly
> ---------------------------------------------------
>
>                 Key: JENA-129
>                 URL: https://issues.apache.org/jira/browse/JENA-129
>             Project: Jena
>          Issue Type: Bug
>          Components: RIOT
>         Environment: Java 1.6, Windows 7, ARQ 2.8.8
>            Reporter: Tim Harsch
>              Labels: RIOT
>
> Background on the issue can be found at the list archive:
> http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201110.mbox/%3C4E88320C.8040300@apache.org%3E
> RIOT failed to parse the SPARQL 1.0 DAWG test: "i18n/normalization-01.ttl".
> In offline email Andy also noted:
> I see one oddity:
> [[
> ==== DAWG-Final/i18n/normalization-02.ttl
> WARN  [line: 7, col: 8 ] Bad IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz>
> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment
> /../ not at the beginning of a relative reference, or it contains a /./
> These should be removed.
> ]]
> because the test is on the input, not the resultant form used for toString.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (JENA-129) RIOT not parsing UTF combining characters correctly

Posted by "Andy Seaborne (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JENA-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne resolved JENA-129.
--------------------------------

    Resolution: Fixed

TokenizerText has been reworked to use new tests in RiotChars for the specific character classes used in Turtle and SPARQL.  The rules for prefixed names (prefix part and local part) have been rewritten to use this use character tests.
                
> RIOT not parsing UTF combining characters correctly
> ---------------------------------------------------
>
>                 Key: JENA-129
>                 URL: https://issues.apache.org/jira/browse/JENA-129
>             Project: Jena
>          Issue Type: Bug
>          Components: RIOT
>         Environment: Java 1.6, Windows 7, ARQ 2.8.8
>            Reporter: Tim Harsch
>            Assignee: Andy Seaborne
>            Priority: Minor
>              Labels: RIOT
>
> Background on the issue can be found at the list archive:
> http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201110.mbox/%3C4E88320C.8040300@apache.org%3E
> RIOT failed to parse the SPARQL 1.0 DAWG test: "i18n/normalization-01.ttl".
> In offline email Andy also noted:
> I see one oddity:
> [[
> ==== DAWG-Final/i18n/normalization-02.ttl
> WARN  [line: 7, col: 8 ] Bad IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz>
> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment
> /../ not at the beginning of a relative reference, or it contains a /./
> These should be removed.
> ]]
> because the test is on the input, not the resultant form used for toString.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (JENA-129) RIOT not parsing UTF combining characters correctly

Posted by "Andy Seaborne (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JENA-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-129:
-------------------------------

    Priority: Minor  (was: Major)
      Labels: parse  (was: UTF-8 parse)
    
> RIOT not parsing UTF combining characters correctly
> ---------------------------------------------------
>
>                 Key: JENA-129
>                 URL: https://issues.apache.org/jira/browse/JENA-129
>             Project: Jena
>          Issue Type: Bug
>          Components: RIOT
>         Environment: Java 1.6, Windows 7, ARQ 2.8.8
>            Reporter: Tim Harsch
>            Assignee: Andy Seaborne
>            Priority: Minor
>              Labels: RIOT
>
> Background on the issue can be found at the list archive:
> http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201110.mbox/%3C4E88320C.8040300@apache.org%3E
> RIOT failed to parse the SPARQL 1.0 DAWG test: "i18n/normalization-01.ttl".
> In offline email Andy also noted:
> I see one oddity:
> [[
> ==== DAWG-Final/i18n/normalization-02.ttl
> WARN  [line: 7, col: 8 ] Bad IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz>
> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment
> /../ not at the beginning of a relative reference, or it contains a /./
> These should be removed.
> ]]
> because the test is on the input, not the resultant form used for toString.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (JENA-129) RIOT not parsing UTF combining characters correctly

Posted by "Andy Seaborne (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JENA-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne reassigned JENA-129:
----------------------------------

    Assignee: Andy Seaborne
    
> RIOT not parsing UTF combining characters correctly
> ---------------------------------------------------
>
>                 Key: JENA-129
>                 URL: https://issues.apache.org/jira/browse/JENA-129
>             Project: Jena
>          Issue Type: Bug
>          Components: RIOT
>         Environment: Java 1.6, Windows 7, ARQ 2.8.8
>            Reporter: Tim Harsch
>            Assignee: Andy Seaborne
>              Labels: RIOT
>
> Background on the issue can be found at the list archive:
> http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201110.mbox/%3C4E88320C.8040300@apache.org%3E
> RIOT failed to parse the SPARQL 1.0 DAWG test: "i18n/normalization-01.ttl".
> In offline email Andy also noted:
> I see one oddity:
> [[
> ==== DAWG-Final/i18n/normalization-02.ttl
> WARN  [line: 7, col: 8 ] Bad IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz>
> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment
> /../ not at the beginning of a relative reference, or it contains a /./
> These should be removed.
> ]]
> because the test is on the input, not the resultant form used for toString.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira