You are viewing a plain text version of this content. The canonical link for it is here.

Posted to cvs@cocoon.apache.org by pi...@apache.org on 2005/09/05 01:29:12 UTC

svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Author: pier
Date: Sun Sep  4 16:29:09 2005
New Revision: 278641

URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
Log:
Fixing wrong encoding bug

Modified:
    cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Modified: cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
URL: http://svn.apache.org/viewcvs/cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java?rev=278641&r1=278640&r2=278641&view=diff
==============================================================================
--- cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java (original)
+++ cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java Sun Sep  4 16:29:09 2005
@@ -211,7 +211,7 @@
                     parser.setState(EXPRESSION_CHAR_STATE);
                     break;
 
-                case 'ï¿½':
+                case '\u00B4':
                     parser.append(ch);
                     parser.setState(EXPRESSION_SHELL_STATE);
                     break;
@@ -235,10 +235,10 @@
     protected static final State EXPRESSION_CHAR_STATE = new QuotedState('\'');
 
     /**
-     * The parser has encountered 'ï¿½' in <code>{@link EXPRESSION_STATE}</code>
-     * to start a Python string constant.
+     * The parser has encountered '\u00B4' (Unicode Latin-1 Acute Accent) in
+     * <code>{@link EXPRESSION_STATE}</code> to start a Python string constant.
      */
-    protected static final State EXPRESSION_SHELL_STATE = new QuotedState('ï¿½');
+    protected static final State EXPRESSION_SHELL_STATE = new QuotedState('\u00B4');
 
     /**
      * The parser state

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Niclas Hedhman <ni...@hedhman.org>.

On Monday 05 September 2005 09:52, Pier Fumagalli wrote:
> Nah, I'm pretty confident that on this little nag, I'm right...

Yes, you are (reservation below).

And I find it amazing how difficult this topic is to understand for "most 
people", some of them pretty clever.

Now, the confusion adds to the matter as the JLS initially specified that java 
source files had to have ISO-8859-1 (IIRC) encoding, later interoduced the 
-encoding argument to the compiler, and AFAIU in Java 5 changed the default.

Pier seems to suggest that the platform settings also play a role in which 
encoding the compiler chooses. This I am not aware of.


The only proper way is that Cocoon declare an encoding for source files to 
use, and that this "setting" is explicitly given in the <javac> argument, and 
any deviations are bugs.


Cheers
Niclas

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Stefano Mazzocchi <st...@apache.org>.

Niclas Hedhman wrote:
> On Monday 05 September 2005 14:43, Antonio Gallardo wrote:
> 
> 
>>Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are
>>different UNICODE subset. This is same as you stated. 
> 
> 
> No. Pier doesn't mix the difference between Unicode (sequence of characters) 
> and the mapping of those characters to fixed or variable length encoded 
> bytestreams.
> The fact that character 65 in Unicode is in many encodings mapped to the byte 
> value 65 is for convenience only, and that fact should be ignored.
> 
> 
>>Our SVN uses UTF-8 as the default charset (or encoding) or not?
> 
> 
> Subversion uses binary data, and is agnostic to any encodings in the data (or 
> so they say). AFAIU, marking files as text only deals with the line endings 
> and how the diff mails are generated.
> The --encoding argument applies to commit messages.
> Paths, URLs/URIs has additional encoding requirements.

Correct.

And is also worth noting that SVN before 1.2 and CVS2SVN create a pretty 
broken combination when the commit message in CVS used an encoding that 
was not UTF-8.

As an example, try to get svn log of the apache repository and the svn 
client will fail, because we have three commit messages in latin-1 
placed, as binary, by cvs2svn into svn (and prior to 1.2 there was no 
encoding validation checking in svn) that get moved into the XML file 
that is passed between the svn server and client, which is using UTF-8 
as the encoding.

I've asked infra@ to fix this, but being not really high priority (only 
data archeologist like myself care about those things) it is unlikely to 
get fixed.

Anyhow, I agree with Pier, we should *only* use ASCII and escape unicode 
characters explicitly the \uxxxx way.

-- 
Stefano.

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Sylvain Wallez <sy...@apache.org>.

Niclas Hedhman wrote:

>On Monday 05 September 2005 14:43, Antonio Gallardo wrote:
>
>  
>
>>Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are
>>different UNICODE subset. This is same as you stated. 
>>    
>>
>
>No. Pier doesn't mix the difference between Unicode (sequence of characters) 
>and the mapping of those characters to fixed or variable length encoded 
>bytestreams.
>The fact that character 65 in Unicode is in many encodings mapped to the byte 
>value 65 is for convenience only, and that fact should be ignored.
>
>  
>
>>Our SVN uses UTF-8 as the default charset (or encoding) or not?
>>    
>>
>
>Subversion uses binary data, and is agnostic to any encodings in the data (or 
>so they say). AFAIU, marking files as text only deals with the line endings 
>and how the diff mails are generated.
>  
>

Problem is the interpretation of "line ending". On Unix, it's 0x10 which 
can be part of a multibyte character in a file encoded in UTF-8.

In such a case, although the file is a text file, setting the 
"eol-style=native" property may well break the file... Or is there a way 
to specify the encoding to SVN?

Sylvain

-- 
Sylvain Wallez                        Anyware Technologies
http://people.apache.org/~sylvain     http://www.anyware-tech.com
Apache Software Foundation Member     Research & Technology Director

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Niclas Hedhman <ni...@hedhman.org>.

On Monday 05 September 2005 14:43, Antonio Gallardo wrote:

> Of course that I am aware that both codesets (Shift-JIS and ISO-8859-1) are
> different UNICODE subset. This is same as you stated. 

No. Pier doesn't mix the difference between Unicode (sequence of characters) 
and the mapping of those characters to fixed or variable length encoded 
bytestreams.
The fact that character 65 in Unicode is in many encodings mapped to the byte 
value 65 is for convenience only, and that fact should be ignored.

> Our SVN uses UTF-8 as the default charset (or encoding) or not?

Subversion uses binary data, and is agnostic to any encodings in the data (or 
so they say). AFAIU, marking files as text only deals with the line endings 
and how the diff mails are generated.
The --encoding argument applies to commit messages.
Paths, URLs/URIs has additional encoding requirements.

Cheers
Niclas

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by David Crossley <cr...@apache.org>.

Pier Fumagalli wrote:
> David Crossley wrote:
> >Pier Fumagalli wrote:
> >>
> >>Nah, I'm pretty confident that on this little nag, I'm right...
> >
> >Does anyone have a pier2doc transformer?
> 
> I need to get into a meeting right now, but the first part of the  
> "pier2doc" translation of this thread is here:
> 
> http://www.betaversion.org/~pier/wiki/display/pier/Unicode+and+Encodings

Ah, that was easy for us. I knew there would be one
lying around somewhere. :-) Thanks. Ever grateful.

-David

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Pier Fumagalli <pi...@betaversion.org>.

On 5 Sep 2005, at 04:44, David Crossley wrote:
> Pier Fumagalli wrote:
>>
>> Nah, I'm pretty confident that on this little nag, I'm right...
>
> Does anyone have a pier2doc transformer?

I need to get into a meeting right now, but the first part of the  
"pier2doc" translation of this thread is here:

http://www.betaversion.org/~pier/wiki/display/pier/Unicode+and+Encodings

     Pier

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by David Crossley <cr...@apache.org>.

Pier Fumagalli wrote:
> Stefano Mazzocchi wrote:
> >David Crossley wrote:
> >>Pier Fumagalli wrote:
> >>
> >>>Nah, I'm pretty confident that on this little nag, I'm right...
> >>
> >>Does anyone have a pier2doc transformer?
> >
> >Why do you think this projet was started? :-)

;-)

> Darn, I can see in that 9 years my communication skills have hardly  
> improved!

Argh, i did not specify the need properly.

Pier communicates brilliantly. The reason for pier2doc
is to automate the documentation of that.

-David

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Pier Fumagalli <pi...@betaversion.org>.

On 5 Sep 2005, at 17:25, Stefano Mazzocchi wrote:
> David Crossley wrote:
>> Pier Fumagalli wrote:
>>
>>> Nah, I'm pretty confident that on this little nag, I'm right...
>>>
>> Does anyone have a pier2doc transformer?
>>
>
> Why do you think this projet was started? :-)

Darn, I can see in that 9 years my communication skills have hardly  
improved!

     Pier

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Stefano Mazzocchi <st...@apache.org>.

David Crossley wrote:
> Pier Fumagalli wrote:
> 
>>Nah, I'm pretty confident that on this little nag, I'm right...
> 
> 
> Does anyone have a pier2doc transformer?

Why do you think this projet was started? :-)

-- 
Stefano.

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by David Crossley <cr...@apache.org>.

Pier Fumagalli wrote:
> 
> Nah, I'm pretty confident that on this little nag, I'm right...

Does anyone have a pier2doc transformer?

-David

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Antonio Gallardo <ag...@agssa.net>.

Pier Fumagalli wrote:

> On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:
>
>> Pier Fumagalli wrote:
>
>>> Depending on your platform encoding (yours apparently ISO8859-1,  
>>> mine  UTF-8, my wife's -she's japanese- Shift-JIS) that sequence  
>>> (B4) of  BYTES as in the original source code will be interpreted  
>>> as a  different character.
>>
>>
>> The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is  
>> is exactly the same as using ISO-8859-1. We need to keep the  sources 
>> in UNICODE and there is also for Japanese: Hiragana,  Katakana, et 
>> al: http://www.unicode.org/charts/
>
>
> Err... Ehmmm.. No... The character in question (Latin-1 character B4,  
> Acute Accent) is encoded in ISO8850-1 as the bytes sequence "B4",  
> while in Shift-JIS the same character is encoded as byte sequence "81  
> 4C", quite different.
>
> Reading the byte sequence "B4" in Shift-JS will produce Unicode  
> character FF74 (Halfwidth katakana "E"), which is quite different  
> from an acute accent as you intended.
>
> Trust me, it's 9 years I'm doing this! :-)

Yes, I believe you. :-) When I told that using Shift-JIS and ISO-8859-1 
is the same. I had in mind that they don't represent the full unicode 
expectrum. I was just tryin to show this problem in other char-set So in 
fact we are in the same problem. Of course that I am aware that both 
codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This 
is same as you stated.

>
>>> Changing the binary sequence B4 to \u00B4 instructs the JVM that  
>>> no  matter what encoding your platform is set to, the resulting  
>>> character  will always (always) be UNICODE 00B4, the Acute Accent,  
>>> part of the  Latin-1 (0X0080) table.
>>
>>
>> If we wrote the code in UNICODE you will have the same effect. It  is 
>> exactly the same as with XML, isn't?
>
>
> Unicode is simply a list of characters. To save them on a disk, you  
> _need_ to use an encoding. Unicode characters are 32bits long (they  
> were 16 bits until Unicode 4 came along, but that ain't important  
> right now), bytes are 8bits long. It's as easy as that. To represent  
> 32 bits in 8, you need to "compress" them (or as said in I18N,  
> "encoding" them).
>
> Some encodings are complete (such as the family of UTF encodings)  
> meaning that the encoding CAN represent ALL Unicode characters, some  
> are not (such as ISO8859-1 which can represent only Unicode  
> characters from 00 to FF).

Yes. Please correct me here if I am wrong: Our SVN uses UTF-8 as the 
default charset (or encoding) or not? If not, then we need to take care 
not only of java sources but also of the chars above 7F in the XML files.

I have special interest in that, since we wrote mostly spanish messages. 
I will like to know if this is needed or not.

Best Regards,

Antonio Gallardo.

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Pier Fumagalli <pi...@betaversion.org>.

On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:
> Pier Fumagalli wrote:
>
>> It's not a UTF-8 character, it's an UNICODE character: \u doesn't   
>> mean "UTF" but rather "UNICODE" (which is not an encoding).
>
> First, I request excuses because I wrote up the previous phrase  
> very badly. I wanted to state that I don't see a reason to use Java  
> "Unicode escaping" for this case. Reading the Java Specification,  
> we found [0]: " Programs are written using the/ /Unicode character  
> set.". So IMO a UNICODE 00B4, the Acute Accent in Latin-1, should  
> be only represented by only one code.

Programs are written using (yes) the UNICODE specification, but  
source ".java" files are not. If you notice the output of "javac - 
help" it will say:

   -encoding <encoding>      Specify character encoding used by  
source files

So, the encoding parameter (that by default /methinks is the  
platform's default) will interpret the byte stream with the specified  
encoding, and then, the decoded UNICODE character stream will be  
parsed. Much like:

InputStream javaSource = new FileInputStream(javaSourceFile);
Reader reader = new InputStreamReader(encodingSpecifiedInCommandLine);
parse(reader);

So, programs are written using UNICODE characters, yes, the source  
files, though, are encoded using a some-sort of encoding mechanism  
(UTF-8, UTF-16, Shift-JIS, blablabla).

In our case the sequence "\u00B4" has the same byte representation  
(5c 75 30 30 62 34) in almost-all encodings (UTF-8, US-ASCII,  
ISO8859-1, Shift-JIS, ...), as it's composed by bytes in the range  
from 00 to 7F (which hardly changes in whatever encoding you put them  
into).

Java uses this syntax to represent a UNICODE character, because with  
most of the encodings you can use, it won't normally change its  
UNICODE meaning.

That said, this is NOT a safe mechanism, because if (for example) you  
were to read the byte sequence "5c 75 30 30 62 34" using the EBCDIC  
encoding (IBM's mainframes encoding) you woudln't read "backslash"  
"letter u" "zero" "zero" "letter b" "four", but you would read  
something quite different: "asterisk" "nil" "nil" "nil" "nil" "pn".

For an example of the EBCDIC encoding, look here: http:// 
www.dynamoo.com/technical/ebcdic.htm

>> Depending on your platform encoding (yours apparently ISO8859-1,  
>> mine  UTF-8, my wife's -she's japanese- Shift-JIS) that sequence  
>> (B4) of  BYTES as in the original source code will be interpreted  
>> as a  different character.
>
> The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is  
> is exactly the same as using ISO-8859-1. We need to keep the  
> sources in UNICODE and there is also for Japanese: Hiragana,  
> Katakana, et al: http://www.unicode.org/charts/

Err... Ehmmm.. No... The character in question (Latin-1 character B4,  
Acute Accent) is encoded in ISO8850-1 as the bytes sequence "B4",  
while in Shift-JIS the same character is encoded as byte sequence "81  
4C", quite different.

Reading the byte sequence "B4" in Shift-JS will produce Unicode  
character FF74 (Halfwidth katakana "E"), which is quite different  
from an acute accent as you intended.

Trust me, it's 9 years I'm doing this! :-)

>> Changing the binary sequence B4 to \u00B4 instructs the JVM that  
>> no  matter what encoding your platform is set to, the resulting  
>> character  will always (always) be UNICODE 00B4, the Acute Accent,  
>> part of the  Latin-1 (0X0080) table.
>
> If we wrote the code in UNICODE you will have the same effect. It  
> is exactly the same as with XML, isn't?

Unicode is simply a list of characters. To save them on a disk, you  
_need_ to use an encoding. Unicode characters are 32bits long (they  
were 16 bits until Unicode 4 came along, but that ain't important  
right now), bytes are 8bits long. It's as easy as that. To represent  
32 bits in 8, you need to "compress" them (or as said in I18N,  
"encoding" them).

Some encodings are complete (such as the family of UTF encodings)  
meaning that the encoding CAN represent ALL Unicode characters, some  
are not (such as ISO8859-1 which can represent only Unicode  
characters from 00 to FF).

Comparing Unicode to an encoding is like comparing an apple to a the  
speed of light: there's nothing in common between the two, but if you  
say that an apple is 1 meter per second, you can say that the speed  
of light is (roughly) 299.792.458 apples.

>> Let's call it defensive programming, and actually, in the source   
>> code, we should be using only characters in the range 00-7F  
>> (Unicode  BASIC-Latin, encoding US-ASCII), as that's the "most- 
>> common" amongst  all different encodings (even if when thinking  
>> about IBM's EBCDIC,  even that one might have some problems in  
>> some cases).
>
> I am sorry, but I do not like to cover the sun with a finger.

???

>  I believe Thorsten Schalab can tell us more about this topic. ;-)

Nah, I'm pretty confident that on this little nag, I'm right...

     Pier

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Antonio Gallardo <ag...@agssa.net>.

Pier Fumagalli wrote:

> On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:
>
>> pier@apache.org wrote:
>>
>>> Author: pier
>>> Date: Sun Sep  4 16:29:09 2005
>>> New Revision: 278641
>>>
>>> URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
>>> Log:
>>> Fixing wrong encoding bug
>>>
>>> Modified:
>>>    cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ 
>>> cocoon/components/language/markup/xsp/XSPExpressionParser.java
>>>
>>> @@ -211,7 +211,7 @@
>>>                     parser.setState(EXPRESSION_CHAR_STATE);
>>>                     break;
>>> -                case 'ï¿½':
>>> +                case '\u00B4':
>>>                     parser.append(ch);
>>>                     parser.setState(EXPRESSION_SHELL_STATE);
>>>                     break;
>>> @@ -235,10 +235,10 @@
>>>     protected static final State EXPRESSION_CHAR_STATE = new  
>>> QuotedState('\'');
>>>     /**
>>> -     * The parser has encountered 'ï¿½' in <code>{@link  
>>> EXPRESSION_STATE}</code>
>>> -     * to start a Python string constant.
>>> +     * The parser has encountered '\u00B4' (Unicode Latin-1 Acute  
>>> Accent) in
>>> +     * <code>{@link EXPRESSION_STATE}</code> to start a Python  
>>> string constant.
>>>      */
>>> -    protected static final State EXPRESSION_SHELL_STATE = new  
>>> QuotedState('ï¿½');
>>> +    protected static final State EXPRESSION_SHELL_STATE = new  
>>> QuotedState('\u00B4');
>>>
>>>
>> Why not only left the original char as it was before your first  
>> change? It was working. Having a UTF-8 IMO is not good.
>
>
> It's not a UTF-8 character, it's an UNICODE character: \u doesn't  
> mean "UTF" but rather "UNICODE" (which is not an encoding).

First, I request excuses because I wrote up the previous phrase very 
badly. I wanted to state that I don't see a reason to use Java "Unicode 
escaping" for this case. Reading the Java Specification, we found [0]: " 
Programs are written using the/ /Unicode character set.". So IMO a 
UNICODE 00B4, the Acute Accent in Latin-1, should be only represented by 
only one code.

> Depending on your platform encoding (yours apparently ISO8859-1, mine  
> UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of  
> BYTES as in the original source code will be interpreted as a  
> different character.

The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is 
exactly the same as using ISO-8859-1. We need to keep the sources in 
UNICODE and there is also for Japanese: Hiragana, Katakana, et al: 
http://www.unicode.org/charts/

>
> Changing the binary sequence B4 to \u00B4 instructs the JVM that no  
> matter what encoding your platform is set to, the resulting character  
> will always (always) be UNICODE 00B4, the Acute Accent, part of the  
> Latin-1 (0X0080) table.

If we wrote the code in UNICODE you will have the same effect. It is 
exactly the same as with XML, isn't?

> Let's call it defensive programming, and actually, in the source  
> code, we should be using only characters in the range 00-7F (Unicode  
> BASIC-Latin, encoding US-ASCII), as that's the "most-common" amongst  
> all different encodings (even if when thinking about IBM's EBCDIC,  
> even that one might have some problems in some cases).

I am sorry, but I do not like to cover the sun with a finger.

I believe Thorsten Schalab can tell us more about this topic. ;-)

Best Regards,

Antonio Gallardo.

[0] 
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Pier Fumagalli <pi...@betaversion.org>.

On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:
> pier@apache.org wrote:
>
>> Author: pier
>> Date: Sun Sep  4 16:29:09 2005
>> New Revision: 278641
>>
>> URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
>> Log:
>> Fixing wrong encoding bug
>>
>> Modified:
>>    cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ 
>> cocoon/components/language/markup/xsp/XSPExpressionParser.java
>>
>> @@ -211,7 +211,7 @@
>>                     parser.setState(EXPRESSION_CHAR_STATE);
>>                     break;
>> -                case 'ï¿½':
>> +                case '\u00B4':
>>                     parser.append(ch);
>>                     parser.setState(EXPRESSION_SHELL_STATE);
>>                     break;
>> @@ -235,10 +235,10 @@
>>     protected static final State EXPRESSION_CHAR_STATE = new  
>> QuotedState('\'');
>>     /**
>> -     * The parser has encountered 'ï¿½' in <code>{@link  
>> EXPRESSION_STATE}</code>
>> -     * to start a Python string constant.
>> +     * The parser has encountered '\u00B4' (Unicode Latin-1 Acute  
>> Accent) in
>> +     * <code>{@link EXPRESSION_STATE}</code> to start a Python  
>> string constant.
>>      */
>> -    protected static final State EXPRESSION_SHELL_STATE = new  
>> QuotedState('ï¿½');
>> +    protected static final State EXPRESSION_SHELL_STATE = new  
>> QuotedState('\u00B4');
>>
>>
> Why not only left the original char as it was before your first  
> change? It was working. Having a UTF-8 IMO is not good.

It's not a UTF-8 character, it's an UNICODE character: \u doesn't  
mean "UTF" but rather "UNICODE" (which is not an encoding).

Depending on your platform encoding (yours apparently ISO8859-1, mine  
UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of  
BYTES as in the original source code will be interpreted as a  
different character.

Changing the binary sequence B4 to \u00B4 instructs the JVM that no  
matter what encoding your platform is set to, the resulting character  
will always (always) be UNICODE 00B4, the Acute Accent, part of the  
Latin-1 (0X0080) table.

Let's call it defensive programming, and actually, in the source  
code, we should be using only characters in the range 00-7F (Unicode  
BASIC-Latin, encoding US-ASCII), as that's the "most-common" amongst  
all different encodings (even if when thinking about IBM's EBCDIC,  
even that one might have some problems in some cases).

     Pier

Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java

Posted by Antonio Gallardo <ag...@agssa.net>.

pier@apache.org wrote:

>Author: pier
>Date: Sun Sep  4 16:29:09 2005
>New Revision: 278641
>
>URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
>Log:
>Fixing wrong encoding bug
>
>Modified:
>    cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
>
>@@ -211,7 +211,7 @@
>                     parser.setState(EXPRESSION_CHAR_STATE);
>                     break;
> 
>-                case 'ï¿½':
>+                case '\u00B4':
>                     parser.append(ch);
>                     parser.setState(EXPRESSION_SHELL_STATE);
>                     break;
>@@ -235,10 +235,10 @@
>     protected static final State EXPRESSION_CHAR_STATE = new QuotedState('\'');
> 
>     /**
>-     * The parser has encountered 'ï¿½' in <code>{@link EXPRESSION_STATE}</code>
>-     * to start a Python string constant.
>+     * The parser has encountered '\u00B4' (Unicode Latin-1 Acute Accent) in
>+     * <code>{@link EXPRESSION_STATE}</code> to start a Python string constant.
>      */
>-    protected static final State EXPRESSION_SHELL_STATE = new QuotedState('ï¿½');
>+    protected static final State EXPRESSION_SHELL_STATE = new QuotedState('\u00B4');
> 
>  
>
Why not only left the original char as it was before your first change? 
It was working. Having a UTF-8 IMO is not good.

Best Regards,

Antonio Gallardo.