You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2011/04/07 22:27:51 UTC

character escapes in source? ... was: Re: Eclipse: Invalid character constant

replying to dev...

: in eclipse you need to set your project's character encoding to UTF-8.
	...
: > Some language specific classes like GermanLightStemmer has invalid
: > character
: > compiler errors for code like:
: >      switch(s[i]) {
: >        case 'Ã¤':
: >        case 'Ã ':
: >        case 'Ã¡':
: > in Eclipse with JDK 1.6

...i seem to remember something similar coming up in the past, and I 
thought we decided we should use java unicode character escapes instead of 
literal UTF-8 characters in the source to minimize the number of headaches 
(and make it more self documenting *exactly* what character we were using.

should we revisit this?


-Hoss

Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Apr 8, 2011 at 2:49 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
> On Fri, Apr 8, 2011 at 03:01, Robert Muir <rc...@gmail.com> wrote:
>> On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
>> <ho...@fucit.org> wrote:
>>>
>>> : -1. These files should be readable, for maintaining, debugging and
>>> : knowing whats going on.
>>>
>>> Readability is my main concern ... i don't know (and frequently can't
>>> tell) the differnece between a lot of non ascii characters -- and i'm
>>> guessing i'm not alone.  when it's spelled out explicitly using the
>>> character name or escape code, there is no ambiquity about what character
>>> was intended, or wether it got screwed up by some tool along the way (ie:
>>> the svn server, an svn client, the patch command, a text editor, an IDE,
>>> ant's "fixcrlf" task, etc...)
>>
>> Please take the time, just 5 or 10 minutes, to look thru some of this
>> source code and tests.
>>
>> Imagine if you couldn't just look at the code to see what it does, but
>> had to decode from some crazy numeric encoding scheme.
>> Imagine if it were this way for things like stopword lists too.
>>
>> It would be basically impossible for you to look at the code and
>> figure out what it does!
>> For example, try looking at thai analyzer tests, if these were all
>> numbers, how would you know wtf is going on?
>>
>> Although this comes up from time to time, I stand firm on my -1
>> because its important to me for the source code to be readable.
>> I'm not willing to give this up just because some people cannot read
>> writing system XYZ.
>>
>> I have said before, i'm willing to change my -1 vote on this, if *ALL*
>> string constants (including english ones) are changed to be character
>> escapes.
>> If you imagine what the code would look like if english string
>> constants were instead codes, then I think you will understand my
>> point of view!
>>
>> Its really really important to source code readability to be able to
>> open a file and understand what it does, not to have to use some
>> decoder because it uses characters other people dont understand.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> I think having both raw characters /and/ encoded representation is the
> best? (one of them in comments)
> I'm all for unicode sources, but at least two things hit me repeatedly:
> 1. Tools do screw up, and you have to recover somehow.
> eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman
> in my case) and I've lost some text on things I shelved but never
> committed anywhere.
> 2. There are characters that look all the same.
> E.g. different whitespace/dashes. Or, (if you have cyrillic in your
> fonts) I dare you to discern between a/а, c/с, e/е, o/о.
> These are different characters from latin and cyrillic charsets (left
> latin/right cyrillic), but in 99% fonts they are visually identical.
> I had a filter that folded up similarily looking characters, and it
> was documented in exactly this way - raw char+code.
>

I've worked with a lot of characters on eclipse, and the ones that
confuse my eyes the most are l/1 and O/0

So again if we do this, then we must do it for all english text, too

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Fri, Apr 8, 2011 at 03:01, Robert Muir <rc...@gmail.com> wrote:
> On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : -1. These files should be readable, for maintaining, debugging and
>> : knowing whats going on.
>>
>> Readability is my main concern ... i don't know (and frequently can't
>> tell) the differnece between a lot of non ascii characters -- and i'm
>> guessing i'm not alone.  when it's spelled out explicitly using the
>> character name or escape code, there is no ambiquity about what character
>> was intended, or wether it got screwed up by some tool along the way (ie:
>> the svn server, an svn client, the patch command, a text editor, an IDE,
>> ant's "fixcrlf" task, etc...)
>
> Please take the time, just 5 or 10 minutes, to look thru some of this
> source code and tests.
>
> Imagine if you couldn't just look at the code to see what it does, but
> had to decode from some crazy numeric encoding scheme.
> Imagine if it were this way for things like stopword lists too.
>
> It would be basically impossible for you to look at the code and
> figure out what it does!
> For example, try looking at thai analyzer tests, if these were all
> numbers, how would you know wtf is going on?
>
> Although this comes up from time to time, I stand firm on my -1
> because its important to me for the source code to be readable.
> I'm not willing to give this up just because some people cannot read
> writing system XYZ.
>
> I have said before, i'm willing to change my -1 vote on this, if *ALL*
> string constants (including english ones) are changed to be character
> escapes.
> If you imagine what the code would look like if english string
> constants were instead codes, then I think you will understand my
> point of view!
>
> Its really really important to source code readability to be able to
> open a file and understand what it does, not to have to use some
> decoder because it uses characters other people dont understand.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

I think having both raw characters /and/ encoded representation is the
best? (one of them in comments)
I'm all for unicode sources, but at least two things hit me repeatedly:
1. Tools do screw up, and you have to recover somehow.
eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman
in my case) and I've lost some text on things I shelved but never
committed anywhere.
2. There are characters that look all the same.
E.g. different whitespace/dashes. Or, (if you have cyrillic in your
fonts) I dare you to discern between a/а, c/с, e/е, o/о.
These are different characters from latin and cyrillic charsets (left
latin/right cyrillic), but in 99% fonts they are visually identical.
I had a filter that folded up similarily looking characters, and it
was documented in exactly this way - raw char+code.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: earwin@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : -1. These files should be readable, for maintaining, debugging and
> : knowing whats going on.
>
> Readability is my main concern ... i don't know (and frequently can't
> tell) the differnece between a lot of non ascii characters -- and i'm
> guessing i'm not alone.  when it's spelled out explicitly using the
> character name or escape code, there is no ambiquity about what character
> was intended, or wether it got screwed up by some tool along the way (ie:
> the svn server, an svn client, the patch command, a text editor, an IDE,
> ant's "fixcrlf" task, etc...)

Please take the time, just 5 or 10 minutes, to look thru some of this
source code and tests.

Imagine if you couldn't just look at the code to see what it does, but
had to decode from some crazy numeric encoding scheme.
Imagine if it were this way for things like stopword lists too.

It would be basically impossible for you to look at the code and
figure out what it does!
For example, try looking at thai analyzer tests, if these were all
numbers, how would you know wtf is going on?

Although this comes up from time to time, I stand firm on my -1
because its important to me for the source code to be readable.
I'm not willing to give this up just because some people cannot read
writing system XYZ.

I have said before, i'm willing to change my -1 vote on this, if *ALL*
string constants (including english ones) are changed to be character
escapes.
If you imagine what the code would look like if english string
constants were instead codes, then I think you will understand my
point of view!

Its really really important to source code readability to be able to
open a file and understand what it does, not to have to use some
decoder because it uses characters other people dont understand.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

Posted by Chris Hostetter <ho...@fucit.org>.

: -1. These files should be readable, for maintaining, debugging and
: knowing whats going on.

Readability is my main concern ... i don't know (and frequently can't 
tell) the differnece between a lot of non ascii characters -- and i'm 
guessing i'm not alone.  when it's spelled out explicitly using the 
character name or escape code, there is no ambiquity about what character 
was intended, or wether it got screwed up by some tool along the way (ie: 
the svn server, an svn client, the patch command, a text editor, an IDE, 
ant's "fixcrlf" task, etc...)

: its the 21st century, we can use unicode

even if we're ok with using unicode literals in the source, it would still 
be nice in cases like these to have the long name of the character in 
question in the comment right next to it so there's no ambiguity about 
what was intended.

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Apr 7, 2011 at 4:27 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> replying to dev...
>
> : in eclipse you need to set your project's character encoding to UTF-8.
>        ...
> : > Some language specific classes like GermanLightStemmer has invalid
> : > character
> : > compiler errors for code like:
> : >      switch(s[i]) {
> : >        case 'Ã¤':
> : >        case 'Ã ':
> : >        case 'Ã¡':
> : > in Eclipse with JDK 1.6
>
> ...i seem to remember something similar coming up in the past, and I
> thought we decided we should use java unicode character escapes instead of
> literal UTF-8 characters in the source to minimize the number of headaches
> (and make it more self documenting *exactly* what character we were using.
>
>

-1. These files should be readable, for maintaining, debugging and
knowing whats going on.

its the 21st century, we can use unicode

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: character escapes in source? ... was: Re: Eclipse: Invalid character constant

Posted by Steven A Rowe <sa...@syr.edu>.

+1

I took an all-of-the-above approach, including the Unicode character description, for the ASCIIFoldingFilter-based stuff.  E.g. from the mapping file <http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/mapping-FoldToASCII.txt?view=markup>:

	# Ä [LATIN CAPITAL LETTER A WITH DIAERESIS]
	"\u00C4" => "A"

Steve

> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
> Sent: Thursday, April 07, 2011 4:28 PM
> To: Lucene Dev
> Subject: character escapes in source? ... was: Re: Eclipse: Invalid
> character constant
> 
> 
> replying to dev...
> 
> : in eclipse you need to set your project's character encoding to UTF-8.
> 	...
> : > Some language specific classes like GermanLightStemmer has invalid
> : > character
> : > compiler errors for code like:
> : >      switch(s[i]) {
> : >        case 'Ã¤':
> : >        case 'Ã ':
> : >        case 'Ã¡':
> : > in Eclipse with JDK 1.6
> 
> ...i seem to remember something similar coming up in the past, and I
> thought we decided we should use java unicode character escapes instead of
> literal UTF-8 characters in the source to minimize the number of headaches
> (and make it more self documenting *exactly* what character we were using.
> 
> should we revisit this?
> 
> 
> -Hoss