You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Stefan Wachter <St...@gmx.de> on 2004/11/26 11:42:18 UTC

encoding of german analyzer source files

Hi all,

in the 1.4.2 distribution the source files of the german analyzer 
classes are encoded in UTF-8. (CHANGES.txt reports that Otis Gospodnetic 
changed the file encoding of theses files to UTF-8. I guess that their 
original encoding was ISO-8859-1.) With UTF-8 encoding theses source 
files look rather strange when viewed on an "ISO-8859-1" development 
environment because they contain german umlauts and the "sharp s". In 
addition, they can not be compiled directly under such an environment. (The 
lucene build.xml sets the java compiler encoding to "utf-8" which make 
thinks fine.)

In order to make the source of the german analyzer class platform 
independent I propose to use the corresonding Java unicode escapes where 
the special characters are used.

Best regards,
--Stefan


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Daniel Naber <da...@t-online.de>.

On Friday 26 November 2004 23:40, Murray Altheim wrote:

> In grepping through the source I noted nine instances of a lowercase
> use of "<!doctype", which isn't valid. This should probably be
> registered as a bug. Kinda makes me wonder what's generating that,
> because when I run javadoc on my own stuff this doesn't happen.

This (package.html) isn't generated, and it doesn't appear in the HTML 
generated by javadoc, as javadoc only uses the parts between <body> and 
</body>.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Murray Altheim <m....@open.ac.uk>.

I wrote:
[...]> You can sniff the first few bytes (which is what is recommended
> in the XML 1.0 spec, you can see how they do it there), but making
> such an assumption may lead to program failure if the assumption
> is incorrect.
> 
>    Extensible Markup Language (XML) 1.0 (Third Edition)
>    Appendix F Autodetection of Character Encodings
>    http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing
> 
> The suggestions there are pretty usable for files that have nothing
> to do with XML.

I neglected to mention that the XML method relies on the beginning
of the file starting with "<?xml". In the case of source files for
the Lucene project, the beginnings of the files are likely one of
three:

    "package..."
    "..."     (some form of whitespace)
    "/*"      (the beginning of an Apache License)
    "<html"   (beginning of HTML file)
    <!DOCTYPE (beginning of HTML file)

It wouldn't be too hard to write a sniffer for this. I think most
all of the Lucene source starts with "package", and if not, it
certainly could.

In grepping through the source I noted nine instances of a lowercase
use of "<!doctype", which isn't valid. This should probably be registered
as a bug. Kinda makes me wonder what's generating that, because when
I run javadoc on my own stuff this doesn't happen.

org/apache/lucene/util/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/index/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/store/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/queryParser/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/search/spans/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/search/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/document/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/analysis/standard/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
org/apache/lucene/analysis/package.html:<!doctype html public "-//w3c//dtd html 4.0 transitional//en">

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   [International Committee of the Red Cross director] Kraehenbuhl
   pointed out that complying with international humanitarian law
   was "an obligation, not an option", for all sides of the conflict.
   "If these rules or any other applicable rules of international
   humanitarian law are violated, the persons responsible must be
   held accountable for their actions," he said. -- BBC News
   http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm

  "In my judgment, this new paradigm [the War on Terror] renders
   obsolete Geneva's strict limitations on questioning of enemy
   prisoners and renders quaint some of its provisions [...]
   Your determination [that the Geneva Conventions] does not apply
   would create a reasonable basis in law that [the War Crimes Act]
   does not apply, which would provide a solid defense to any future
   prosecution." -- Alberto Gonzalez, appointed US Attorney General,
   and likely Supreme Court nominee, in a memo to George W. Bush
   http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Murray Altheim <m....@open.ac.uk>.

Andi Vajda wrote:
>>I can tell the NetBeans-IDE the encoding of every single source file. But the 
>>problem is that I might not know which the correct encoding is. In case of 
>>Lucene it is quite clear because it is mentioned in the build.xml file. But 
>>what is the situation if someone sends you a stemmer class for example for 
>>Swahili and you do not know in which encoding the author wrote the source. 
>>Then you can try lots of encodings until the java compiler will be satisfied 
>>with it. And even then you might not be sure that you used the right 
>>encoding.
> 
>>Therefore it would be great if all Java programmers would agree on the same 
>>encoding of source files (let it be UTF-8, ISO-8859-1 or something really
> 
> Actually, the reason for the change to utf-8 was that for Lucene to compile on 
> Windows with gcj (mingw), the encoding better be utf-8 because of the typical 
> absence of iconv facility there. Therefore, it would be safe to assume the 
> swahili stemmer source to also be encoded in utf-8.
> 
> Andi..

Andi,

It may seem pretty safe to assume from practice, but from the Java
programmer's point of view, it's still not. It's perfectly possible
that the Swahili file be in UTF-8 or UTF-16, little-endian or big-
endian, or perhaps some other encoding we don't even know about.
A minor point I was trying to make is that absent some external
mechanism, there's really *no way* to know the encoding of a file.
You can sniff the first few bytes (which is what is recommended
in the XML 1.0 spec, you can see how they do it there), but making
such an assumption may lead to program failure if the assumption
is incorrect.

   Extensible Markup Language (XML) 1.0 (Third Edition)
   Appendix F Autodetection of Character Encodings
   http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing

The suggestions there are pretty usable for files that have nothing
to do with XML.

I don't know how many people on this list are familiar with
O'Reilly's "CJKV Information Processing" (with the puffer fish on
the cover), which opened up my eyes to a new world. After reading
it I got a terrible fright and couldn't sleep for weeks.

   "CJKV Information Processing: Chinese, Japanese, Korean
      & Vietnamese Computing", by Ken Lunde, O'Reilly Publishing.
   http://www.oreilly.com/catalog/cjkvinfo/index.html
   http://www.amazon.com/exec/obidos/tg/detail/-/1565922247/002-2766986-0676059?v=glance&vi=reviews

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   [International Committee of the Red Cross director] Kraehenbuhl
   pointed out that complying with international humanitarian law
   was "an obligation, not an option", for all sides of the conflict.
   "If these rules or any other applicable rules of international
   humanitarian law are violated, the persons responsible must be
   held accountable for their actions," he said. -- BBC News
   http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm

  "In my judgment, this new paradigm [the War on Terror] renders
   obsolete Geneva's strict limitations on questioning of enemy
   prisoners and renders quaint some of its provisions [...]
   Your determination [that the Geneva Conventions] does not apply
   would create a reasonable basis in law that [the War Crimes Act]
   does not apply, which would provide a solid defense to any future
   prosecution." -- Alberto Gonzalez, appointed US Attorney General,
   and likely Supreme Court nominee, in a memo to George W. Bush
   http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Andi Vajda <an...@osafoundation.org>.

> I can tell the NetBeans-IDE the encoding of every single source file. But the 
> problem is that I might not know which the correct encoding is. In case of 
> Lucene it is quite clear because it is mentioned in the build.xml file. But 
> what is the situation if someone sends you a stemmer class for example for 
> Swahili and you do not know in which encoding the author wrote the source. 
> Then you can try lots of encodings until the java compiler will be satisfied 
> with it. And even then you might not be sure that you used the right 
> encoding.

> Therefore it would be great if all Java programmers would agree on the same 
> encoding of source files (let it be UTF-8, ISO-8859-1 or something really

Actually, the reason for the change to utf-8 was that for Lucene to compile on 
Windows with gcj (mingw), the encoding better be utf-8 because of the typical 
absence of iconv facility there. Therefore, it would be safe to assume the 
swahili stemmer source to also be encoded in utf-8.

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Stefan Wachter <St...@gmx.de>.

Murray Altheim wrote:

> Stefan Wachter wrote:
>
>> Hi Daniel,
>>
>> I am using NetBeans 3.6 which certainly is unicode aware. Yet, 
>> NetBeans seems not to detect that the source files of Lucene are 
>> UTF-8 encoded automatically. I guess that it uses the platform 
>> specific default encoding which is ISO-8859-1 for my Linux operating 
>> system.
>
>
> In linux you can set the default encoding both at platform-level,
> at a user-level, and for individual applications. You're not forced
> to stay within ISO-8859-1. Think about it this way: if that were
> the case, how on a multi-user system like linux could a machine
> support only one encoding? This sounds more like a NetBeans problem
> than a OS problem. I don't use NetBeans, but there must be a way to
> indicate the encoding beyond what your particular user settings are.
> Otherwise, English programmers couldn't develop non-English programs,
> which is hard to believe.

I can tell the NetBeans-IDE the encoding of every single source file. 
But the problem is that I might not know which the correct encoding is. 
In case of Lucene it is quite clear because it is mentioned in the 
build.xml file. But what is the situation if someone sends you a stemmer 
class for example for Swahili and you do not know in which encoding the 
author wrote the source. Then you can try lots of encodings until the 
java compiler will be satisfied with it. And even then you might not be 
sure that you used the right encoding.

>
>> I think what Java lacks is a means to indicate the encoding of source 
>> files (e.g. <?java encoding="ISO-8859-1"?> in a XMLish way). The 
>> encoding has to be fed into the system from the outside. What else 
>> could be the reason for having an encoding switch to the java 
>> compiler? Therefore I think it is best to have Java source files to 
>> be plain ASCII.
>
>
> Java has quite a lot of localization features built into the
> language. Yes, the encoding has to be specified, just as one
> would have to tell any processor how to decode any given set
> of bytes. Java itself is Unicode aware for anything dealing
> with characters. For dealing with byte streams the encoding
> has to be specified. Here's a good article on the subject:
>
>    http://www.jorendorff.com/articles/unicode/java.html
>
> As for crippling files by forcing them into plain ASCII, why
> would we want to step back 20 years in computer science? It's
> been a long-fought battle to get to where we are now, and the
> desires of a few people to be able to look at a file in ASCII
> are far outweighed by the rest of the world, whose languages
> don't fit into that straitjacket. As was mentioned, it would
> make the code a great deal harder to both read and manage.
>
> I remember looking at a desktop publishing application
> developed at StoneHand in 1996 that had Arabic, Gujarati,
> Japanese, Chinese, English, and Hebrew on the screen at the
> same time and thinking damn! pretty impressive! We now have
> that kind of thing in our browsers and think little of it.
> I'd hate to step back to pre-1996 again.
>
> We should all be using Unicode-aware tools. It's what the rest
> of the world is doing, even in the Anglocentric US. For an
> international project like Lucene, there's no good reason to
> step back in time to ASCII. There are many programmers using
> the Lucene source code that have no problem with Unicode, and
> it would not be in their interest to be suddenly reading
> numeric character entities rather then normally-readable text.
>
> Murray

Of  course I also like all the unicode awareness of Java. In fact I 
wrote a Java-XML-Databinding including an XML parser (cf. www.jbind.org) 
that benefitted much of this awareness. In XML, there is a cleary 
defined mechanism how the file encoding can be determined (looking at 
the first 4 bytes). However, in Java there isn't such a mechanism. If I 
get some sources from somewhere and I want to compile them then I must 
know their encoding. If there are different encodings for different 
sources in a project then I have to be careful to call the compiler 
several times with changing encoding switches.

Therefore it would be great if all Java programmers would agree on the 
same encoding of source files (let it be UTF-8, ISO-8859-1 or something 
really exotic). This has nothing to do with the display - it is just the 
file encoding. Of course this is not realistic. So why not using just 
ASCII encoding with the amendment of the \u escape? Of course you 
opposed that then the sources are not so readable. But progamming books 
teach me to factor out the text parts from programm code.

--Stefan


>
> ......................................................................
> Murray Altheim                    http://kmi.open.ac.uk/people/murray/
> Knowledge Media Institute
> The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .
>
>    [International Committee of the Red Cross director] Kraehenbuhl
>    pointed out that complying with international humanitarian law
>    was "an obligation, not an option", for all sides of the conflict.
>    "If these rules or any other applicable rules of international
>    humanitarian law are violated, the persons responsible must be
>    held accountable for their actions," he said. -- BBC News
>    http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm
>
>   "In my judgment, this new paradigm [the War on Terror] renders
>    obsolete Geneva's strict limitations on questioning of enemy
>    prisoners and renders quaint some of its provisions [...]
>    Your determination [that the Geneva Conventions] does not apply
>    would create a reasonable basis in law that [the War Crimes Act]
>    does not apply, which would provide a solid defense to any future
>    prosecution." -- Alberto Gonzalez, appointed US Attorney General,
>    and likely Supreme Court nominee, in a memo to George W. Bush
>    http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Murray Altheim <m....@open.ac.uk>.

Stefan Wachter wrote:
> Hi Daniel,
> 
> I am using NetBeans 3.6 which certainly is unicode aware. Yet, NetBeans 
> seems not to detect that the source files of Lucene are UTF-8 encoded 
> automatically. I guess that it uses the platform specific default 
> encoding which is ISO-8859-1 for my Linux operating system.

In linux you can set the default encoding both at platform-level,
at a user-level, and for individual applications. You're not forced
to stay within ISO-8859-1. Think about it this way: if that were
the case, how on a multi-user system like linux could a machine
support only one encoding? This sounds more like a NetBeans problem
than a OS problem. I don't use NetBeans, but there must be a way to
indicate the encoding beyond what your particular user settings are.
Otherwise, English programmers couldn't develop non-English programs,
which is hard to believe.

> I think what Java lacks is a means to indicate the encoding of source 
> files (e.g. <?java encoding="ISO-8859-1"?> in a XMLish way). The 
> encoding has to be fed into the system from the outside. What else could 
> be the reason for having an encoding switch to the java compiler? 
> Therefore I think it is best to have Java source files to be plain ASCII.

Java has quite a lot of localization features built into the
language. Yes, the encoding has to be specified, just as one
would have to tell any processor how to decode any given set
of bytes. Java itself is Unicode aware for anything dealing
with characters. For dealing with byte streams the encoding
has to be specified. Here's a good article on the subject:

    http://www.jorendorff.com/articles/unicode/java.html

As for crippling files by forcing them into plain ASCII, why
would we want to step back 20 years in computer science? It's
been a long-fought battle to get to where we are now, and the
desires of a few people to be able to look at a file in ASCII
are far outweighed by the rest of the world, whose languages
don't fit into that straitjacket. As was mentioned, it would
make the code a great deal harder to both read and manage.

I remember looking at a desktop publishing application
developed at StoneHand in 1996 that had Arabic, Gujarati,
Japanese, Chinese, English, and Hebrew on the screen at the
same time and thinking damn! pretty impressive! We now have
that kind of thing in our browsers and think little of it.
I'd hate to step back to pre-1996 again.

We should all be using Unicode-aware tools. It's what the rest
of the world is doing, even in the Anglocentric US. For an
international project like Lucene, there's no good reason to
step back in time to ASCII. There are many programmers using
the Lucene source code that have no problem with Unicode, and
it would not be in their interest to be suddenly reading
numeric character entities rather then normally-readable text.

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

    [International Committee of the Red Cross director] Kraehenbuhl
    pointed out that complying with international humanitarian law
    was "an obligation, not an option", for all sides of the conflict.
    "If these rules or any other applicable rules of international
    humanitarian law are violated, the persons responsible must be
    held accountable for their actions," he said. -- BBC News
    http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm

   "In my judgment, this new paradigm [the War on Terror] renders
    obsolete Geneva's strict limitations on questioning of enemy
    prisoners and renders quaint some of its provisions [...]
    Your determination [that the Geneva Conventions] does not apply
    would create a reasonable basis in law that [the War Crimes Act]
    does not apply, which would provide a solid defense to any future
    prosecution." -- Alberto Gonzalez, appointed US Attorney General,
    and likely Supreme Court nominee, in a memo to George W. Bush
    http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Stefan Wachter <St...@gmx.de>.

Hi Daniel,

I am using NetBeans 3.6 which certainly is unicode aware. Yet, NetBeans 
seems not to detect that the source files of Lucene are UTF-8 encoded 
automatically. I guess that it uses the platform specific default 
encoding which is ISO-8859-1 for my Linux operating system.

I think what Java lacks is a means to indicate the encoding of source 
files (e.g. <?java encoding="ISO-8859-1"?> in a XMLish way). The 
encoding has to be fed into the system from the outside. What else could 
be the reason for having an encoding switch to the java compiler? 
Therefore I think it is best to have Java source files to be plain ASCII.

Cheers,
--Stefan

Daniel Naber wrote:

>On Friday 26 November 2004 11:42, Stefan Wachter wrote:
>
>  
>
>>With UTF-8 encoding theses source
>>files look rather strange when viewed on an "ISO-8859-1" development
>>environment because they contain german umlauts and the "sharp s".
>>    
>>
>
>Your editor / IDE needs to be unicode aware and you have to set it up 
>accordingly. That and the fact that build.xml specifies the encodig 
>explicitly should make everything work, no matter what your default encoding 
>is.
>
>  
>
>>In order to make the source of the german analyzer class platform
>>independent I propose to use the corresonding Java unicode escapes where
>>the special characters are used.
>>    
>>
>
>That makes the source more difficult to read.
>
>Regards
> Daniel
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: encoding of german analyzer source files

Posted by Daniel Naber <da...@t-online.de>.

On Friday 26 November 2004 11:42, Stefan Wachter wrote:

> With UTF-8 encoding theses source
> files look rather strange when viewed on an "ISO-8859-1" development
> environment because they contain german umlauts and the "sharp s".

Your editor / IDE needs to be unicode aware and you have to set it up 
accordingly. That and the fact that build.xml specifies the encodig 
explicitly should make everything work, no matter what your default encoding 
is.

> In order to make the source of the german analyzer class platform
> independent I propose to use the corresonding Java unicode escapes where
> the special characters are used.

That makes the source more difficult to read.

Regards
 Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org