You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@chemistry.apache.org by "Jens Hübel (JIRA)" <ji...@apache.org> on 2011/03/31 12:42:05 UTC

[jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding

     [ https://issues.apache.org/jira/browse/CMIS-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jens Hübel reassigned CMIS-344:
-------------------------------

    Assignee: Jens Hübel

> Query parser should not use UTF-8 encoding
> ------------------------------------------
>
>                 Key: CMIS-344
>                 URL: https://issues.apache.org/jira/browse/CMIS-344
>             Project: Chemistry
>          Issue Type: Bug
>          Components: opencmis-server
>    Affects Versions: OpenCMIS 0.4.0
>            Reporter: Michael Dürig
>            Assignee: Jens Hübel
>         Attachments: CMIS-344.patch
>
>
> QueryUtil converts the query statement to a UTF-8 encoded byte array which is used as input to the lexer instead of using the string directly. 
> Instead of
>     CharStream input = new ANTLRInputStream(new ByteArrayInputStream(statement.getBytes("UTF-8")));
> the input stream should be obtained like this:
>     CharStream input = new ANTLRStringStream(statement);
> The former method transforms the characters in the contains clause of the query 
>     SELECT * FROM cmis:document WHERE CONTAINS ('\u4E2D\u6587')
> in an incorrect way. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding

Posted by Michael Dürig <mi...@gmail.com>.
> Note though that SELECT * FROM cmis:document WHERE CONTAINS
> ('\u4E2D\u6587') isn't actually legal CMISQL, as currently CMISQL has
> no notion of Unicode escaping. The query would have to contain actual
> Unicode characters.

But doesn't this query contain actual Unicode characters? \u4E2D and 
\u6587 are Java Unicode Escapes [1].

Michael
[1] 
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#100850

> NB: Unicode escaping is only specified in SQL-2008, not SQL-92. See
> this for a summary:
> http://hsqldb.org/doc/2.0/guide/dataaccess-chapt.html#N11E65
>
> Florent
>
> On Thu, Mar 31, 2011 at 2:00 PM, Florent Guillaume<fg...@nuxeo.com>  wrote:
>> No objection, I probably wasn't aware of ANTLRStringStream when I
>> wrote that code.
>>
>> Florent
>>
>> On Thu, Mar 31, 2011 at 12:47 PM, Jens Hübel<jh...@opentext.com>  wrote:
>>> Florent,
>>>
>>> as far as I remember this code came originally from your side. Would you have any objections to apply the proposed patch? Would this break something on your side?
>>>
>>> Jens
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Jens Hübel (JIRA) [mailto:jira@apache.org]
>>> Sent: Donnerstag, 31. März 2011 12:42
>>> To: dev@chemistry.apache.org
>>> Subject: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding
>>>
>>>
>>>      [ https://issues.apache.org/jira/browse/CMIS-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Jens Hübel reassigned CMIS-344:
>>> -------------------------------
>>>
>>>     Assignee: Jens Hübel
>>>
>>>> Query parser should not use UTF-8 encoding
>>>> ------------------------------------------
>>>>
>>>>                  Key: CMIS-344
>>>>                  URL: https://issues.apache.org/jira/browse/CMIS-344
>>>>              Project: Chemistry
>>>>           Issue Type: Bug
>>>>           Components: opencmis-server
>>>>     Affects Versions: OpenCMIS 0.4.0
>>>>             Reporter: Michael Dürig
>>>>             Assignee: Jens Hübel
>>>>          Attachments: CMIS-344.patch
>>>>
>>>>
>>>> QueryUtil converts the query statement to a UTF-8 encoded byte array which is used as input to the lexer instead of using the string directly.
>>>> Instead of
>>>>      CharStream input = new ANTLRInputStream(new ByteArrayInputStream(statement.getBytes("UTF-8")));
>>>> the input stream should be obtained like this:
>>>>      CharStream input = new ANTLRStringStream(statement);
>>>> The former method transforms the characters in the contains clause of the query
>>>>      SELECT * FROM cmis:document WHERE CONTAINS ('\u4E2D\u6587')
>>>> in an incorrect way.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>
>>
>>
>> --
>> Florent Guillaume, Director of R&D, Nuxeo
>> Open Source, Java EE based, Enterprise Content Management (ECM)
>> http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87
>>
>
>
>


RE: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding

Posted by Jens Hübel <jh...@opentext.com>.
Interesting. So perhaps this is something we should bring back to the TC then. A future CMIS version probably should respect Unicode escaping.

Jens


-----Original Message-----
From: Florent Guillaume [mailto:fg@nuxeo.com] 
Sent: Donnerstag, 31. März 2011 14:17
To: dev@chemistry.apache.org
Cc: Jens Hübel
Subject: Re: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding

Note though that SELECT * FROM cmis:document WHERE CONTAINS
('\u4E2D\u6587') isn't actually legal CMISQL, as currently CMISQL has
no notion of Unicode escaping. The query would have to contain actual
Unicode characters.
NB: Unicode escaping is only specified in SQL-2008, not SQL-92. See
this for a summary:
http://hsqldb.org/doc/2.0/guide/dataaccess-chapt.html#N11E65

Florent

On Thu, Mar 31, 2011 at 2:00 PM, Florent Guillaume <fg...@nuxeo.com> wrote:
> No objection, I probably wasn't aware of ANTLRStringStream when I
> wrote that code.
>
> Florent
>
> On Thu, Mar 31, 2011 at 12:47 PM, Jens Hübel <jh...@opentext.com> wrote:
>> Florent,
>>
>> as far as I remember this code came originally from your side. Would you have any objections to apply the proposed patch? Would this break something on your side?
>>
>> Jens
>>
>>
>>
>> -----Original Message-----
>> From: Jens Hübel (JIRA) [mailto:jira@apache.org]
>> Sent: Donnerstag, 31. März 2011 12:42
>> To: dev@chemistry.apache.org
>> Subject: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding
>>
>>
>>     [ https://issues.apache.org/jira/browse/CMIS-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Jens Hübel reassigned CMIS-344:
>> -------------------------------
>>
>>    Assignee: Jens Hübel
>>
>>> Query parser should not use UTF-8 encoding
>>> ------------------------------------------
>>>
>>>                 Key: CMIS-344
>>>                 URL: https://issues.apache.org/jira/browse/CMIS-344
>>>             Project: Chemistry
>>>          Issue Type: Bug
>>>          Components: opencmis-server
>>>    Affects Versions: OpenCMIS 0.4.0
>>>            Reporter: Michael Dürig
>>>            Assignee: Jens Hübel
>>>         Attachments: CMIS-344.patch
>>>
>>>
>>> QueryUtil converts the query statement to a UTF-8 encoded byte array which is used as input to the lexer instead of using the string directly.
>>> Instead of
>>>     CharStream input = new ANTLRInputStream(new ByteArrayInputStream(statement.getBytes("UTF-8")));
>>> the input stream should be obtained like this:
>>>     CharStream input = new ANTLRStringStream(statement);
>>> The former method transforms the characters in the contains clause of the query
>>>     SELECT * FROM cmis:document WHERE CONTAINS ('\u4E2D\u6587')
>>> in an incorrect way.
>>
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>
>
>
> --
> Florent Guillaume, Director of R&D, Nuxeo
> Open Source, Java EE based, Enterprise Content Management (ECM)
> http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87
>



-- 
Florent Guillaume, Director of R&D, Nuxeo
Open Source, Java EE based, Enterprise Content Management (ECM)
http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87

Re: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding

Posted by Florent Guillaume <fg...@nuxeo.com>.
Note though that SELECT * FROM cmis:document WHERE CONTAINS
('\u4E2D\u6587') isn't actually legal CMISQL, as currently CMISQL has
no notion of Unicode escaping. The query would have to contain actual
Unicode characters.
NB: Unicode escaping is only specified in SQL-2008, not SQL-92. See
this for a summary:
http://hsqldb.org/doc/2.0/guide/dataaccess-chapt.html#N11E65

Florent

On Thu, Mar 31, 2011 at 2:00 PM, Florent Guillaume <fg...@nuxeo.com> wrote:
> No objection, I probably wasn't aware of ANTLRStringStream when I
> wrote that code.
>
> Florent
>
> On Thu, Mar 31, 2011 at 12:47 PM, Jens Hübel <jh...@opentext.com> wrote:
>> Florent,
>>
>> as far as I remember this code came originally from your side. Would you have any objections to apply the proposed patch? Would this break something on your side?
>>
>> Jens
>>
>>
>>
>> -----Original Message-----
>> From: Jens Hübel (JIRA) [mailto:jira@apache.org]
>> Sent: Donnerstag, 31. März 2011 12:42
>> To: dev@chemistry.apache.org
>> Subject: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding
>>
>>
>>     [ https://issues.apache.org/jira/browse/CMIS-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Jens Hübel reassigned CMIS-344:
>> -------------------------------
>>
>>    Assignee: Jens Hübel
>>
>>> Query parser should not use UTF-8 encoding
>>> ------------------------------------------
>>>
>>>                 Key: CMIS-344
>>>                 URL: https://issues.apache.org/jira/browse/CMIS-344
>>>             Project: Chemistry
>>>          Issue Type: Bug
>>>          Components: opencmis-server
>>>    Affects Versions: OpenCMIS 0.4.0
>>>            Reporter: Michael Dürig
>>>            Assignee: Jens Hübel
>>>         Attachments: CMIS-344.patch
>>>
>>>
>>> QueryUtil converts the query statement to a UTF-8 encoded byte array which is used as input to the lexer instead of using the string directly.
>>> Instead of
>>>     CharStream input = new ANTLRInputStream(new ByteArrayInputStream(statement.getBytes("UTF-8")));
>>> the input stream should be obtained like this:
>>>     CharStream input = new ANTLRStringStream(statement);
>>> The former method transforms the characters in the contains clause of the query
>>>     SELECT * FROM cmis:document WHERE CONTAINS ('\u4E2D\u6587')
>>> in an incorrect way.
>>
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>
>
>
> --
> Florent Guillaume, Director of R&D, Nuxeo
> Open Source, Java EE based, Enterprise Content Management (ECM)
> http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87
>



-- 
Florent Guillaume, Director of R&D, Nuxeo
Open Source, Java EE based, Enterprise Content Management (ECM)
http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87

Re: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding

Posted by Florent Guillaume <fg...@nuxeo.com>.
No objection, I probably wasn't aware of ANTLRStringStream when I
wrote that code.

Florent

On Thu, Mar 31, 2011 at 12:47 PM, Jens Hübel <jh...@opentext.com> wrote:
> Florent,
>
> as far as I remember this code came originally from your side. Would you have any objections to apply the proposed patch? Would this break something on your side?
>
> Jens
>
>
>
> -----Original Message-----
> From: Jens Hübel (JIRA) [mailto:jira@apache.org]
> Sent: Donnerstag, 31. März 2011 12:42
> To: dev@chemistry.apache.org
> Subject: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding
>
>
>     [ https://issues.apache.org/jira/browse/CMIS-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Jens Hübel reassigned CMIS-344:
> -------------------------------
>
>    Assignee: Jens Hübel
>
>> Query parser should not use UTF-8 encoding
>> ------------------------------------------
>>
>>                 Key: CMIS-344
>>                 URL: https://issues.apache.org/jira/browse/CMIS-344
>>             Project: Chemistry
>>          Issue Type: Bug
>>          Components: opencmis-server
>>    Affects Versions: OpenCMIS 0.4.0
>>            Reporter: Michael Dürig
>>            Assignee: Jens Hübel
>>         Attachments: CMIS-344.patch
>>
>>
>> QueryUtil converts the query statement to a UTF-8 encoded byte array which is used as input to the lexer instead of using the string directly.
>> Instead of
>>     CharStream input = new ANTLRInputStream(new ByteArrayInputStream(statement.getBytes("UTF-8")));
>> the input stream should be obtained like this:
>>     CharStream input = new ANTLRStringStream(statement);
>> The former method transforms the characters in the contains clause of the query
>>     SELECT * FROM cmis:document WHERE CONTAINS ('\u4E2D\u6587')
>> in an incorrect way.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>



-- 
Florent Guillaume, Director of R&D, Nuxeo
Open Source, Java EE based, Enterprise Content Management (ECM)
http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87

RE: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding

Posted by Jens Hübel <jh...@opentext.com>.
Florent,

as far as I remember this code came originally from your side. Would you have any objections to apply the proposed patch? Would this break something on your side?

Jens



-----Original Message-----
From: Jens Hübel (JIRA) [mailto:jira@apache.org] 
Sent: Donnerstag, 31. März 2011 12:42
To: dev@chemistry.apache.org
Subject: [jira] [Assigned] (CMIS-344) Query parser should not use UTF-8 encoding


     [ https://issues.apache.org/jira/browse/CMIS-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jens Hübel reassigned CMIS-344:
-------------------------------

    Assignee: Jens Hübel

> Query parser should not use UTF-8 encoding
> ------------------------------------------
>
>                 Key: CMIS-344
>                 URL: https://issues.apache.org/jira/browse/CMIS-344
>             Project: Chemistry
>          Issue Type: Bug
>          Components: opencmis-server
>    Affects Versions: OpenCMIS 0.4.0
>            Reporter: Michael Dürig
>            Assignee: Jens Hübel
>         Attachments: CMIS-344.patch
>
>
> QueryUtil converts the query statement to a UTF-8 encoded byte array which is used as input to the lexer instead of using the string directly. 
> Instead of
>     CharStream input = new ANTLRInputStream(new ByteArrayInputStream(statement.getBytes("UTF-8")));
> the input stream should be obtained like this:
>     CharStream input = new ANTLRStringStream(statement);
> The former method transforms the characters in the contains clause of the query 
>     SELECT * FROM cmis:document WHERE CONTAINS ('\u4E2D\u6587')
> in an incorrect way. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira