You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Noel J. Bergman" <no...@devtech.com> on 2005/05/31 23:24:02 UTC
Encoding issue in BayesianAnalyzer?
Can someone please take a look at BayesianAnalyzer.java? I keep getting:
- || ch == ''
+ || ch == 'i??'
when doing a diff. What is the encoded character supposed to be?
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: Encoding issue in BayesianAnalyzer?
Posted by Stefano Bagnara <ap...@bago.org>.
> Yes, that's it. And I am finding that something in our build
> process is corrupting it. I just did:
>
> $ rm src/java/org/apache/james/util/BayesianAnalyzer.java
> $ svn up
> Restored 'src/java/org/apache/james/util/BayesianAnalyzer.java'
> At revision 179287.
> $ svn diff
> $ ./build.sh clean dist-lite
> $ svn diff
> Index: src/java/org/apache/james/util/BayesianAnalyzer.java
> ===================================================================
> --- src/java/org/apache/james/util/BayesianAnalyzer.java
> (revision 179287)
> +++ src/java/org/apache/james/util/BayesianAnalyzer.java
> (working copy)
> @@ -471,7 +471,7 @@
> if (Character.isLetter(ch)
> || ch == '-'
> || ch == '$'
> - || ch == ''
> + || ch == '�'
> || ch == '!'
> || ch == '\''
> ) {
>
> Now, during the build we run <fixcrlf> during the build
> process. Could ANT be corrupting the file?
>
> > It probably is the EURO character (€, unicode \u20AC i think).
> > http://www.fileformat.info/info/unicode/char/20ac/index.htm
>
> Perhaps the safest thing is to hex encode the character.
The build script does not corrupt my file but my default charset supports "€" in non-unicode encoding.
I agree that you can safely replace the line with "|| ch == '\u20AC'".
Stefano
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: Encoding issue in BayesianAnalyzer?
Posted by Soren Hilmer <so...@tietoenator.com>.
Back to the issue. It is very likely to be fixcrlf causing the problem.
I used to run it on som test-mails to ensure they ended in crlf, but as you
have experienced it changes the charset as well :-(
Luckily you can add an encoding option to fixcrlf like in this example:
<fixcrlf srcdir="${src}" includes="**/*.java" eol="lf" tab="remove"
tablength="4" encoding="ISO-8859-1"/>
Don't now if ISO-8859-1 is the correct one to use maybe ISO-8859-15 is better
--Søren
On Wednesday 01 June 2005 00:55, Noel J. Bergman wrote:
> Stefano Bagnara wrote:
> > > IIRC, because it was intended to keep Windows users from
> > > putting incorrectly formatted files into source control, and
> > > to remove tabs.
> > >
> > > At least the latter should be addressed now by SVN.
> >
> > AFAIK the first is addressed by SVN.
>
> <<sigh>> That's actually what I meant. The tabs issue is not addressed by
> SVN.
>
> --- Noel
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
--
Søren Hilmer, M.Sc.
R&D manager Phone: +45 72 30 64 00
TietoEnator IT+ A/S Fax: +45 72 30 64 40
Ved Lunden 12 Direct: +45 72 30 64 57
DK-8230 Åbyhøj Email: soren.hilmer <at> tietoenator.com
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
RE: Encoding issue in BayesianAnalyzer?
Posted by "Noel J. Bergman" <no...@devtech.com>.
Stefano Bagnara wrote:
> > IIRC, because it was intended to keep Windows users from
> > putting incorrectly formatted files into source control, and
> > to remove tabs.
> >
> > At least the latter should be addressed now by SVN.
> AFAIK the first is addressed by SVN.
<<sigh>> That's actually what I meant. The tabs issue is not addressed by
SVN.
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: Encoding issue in BayesianAnalyzer?
Posted by Stefano Bagnara <ap...@bago.org>.
> IIRC, because it was intended to keep Windows users from
> putting incorrectly formatted files into source control, and
> to remove tabs.
>
> At least the latter should be addressed now by SVN.
AFAIK the first is addressed by SVN. You can add per file attributes to tell
svn they are "svn:eol-style = native"
(dos newlines from dos clients, unix newlines from unix client) or specify
they are always DOS (svn:eol-style = CRLF) or always UNIX
You can also add executable bit informations (svn:executable)
Stefano
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
RE: Encoding issue in BayesianAnalyzer?
Posted by "Noel J. Bergman" <no...@devtech.com>.
Stefano Bagnara wrote:
> Why do we run <fixcrlf> on the original source code and not directly in
the
> working copy as like as the <replace> ?
IIRC, because it was intended to keep Windows users from putting incorrectly
formatted files into source control, and to remove tabs.
At least the latter should be addressed now by SVN.
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: Encoding issue in BayesianAnalyzer?
Posted by Stefano Bagnara <ap...@bago.org>.
> Now, during the build we run <fixcrlf> during the build
> process. Could ANT be corrupting the file?
Why do we run <fixcrlf> on the original source code and not directly in the
working copy as like as the <replace> ?
Stefano
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
RE: Encoding issue in BayesianAnalyzer?
Posted by "Noel J. Bergman" <no...@devtech.com>.
Stefano Bagnara wrote:
> I think the code you are pointing is
Yes, that's it. And I am finding that something in our build process is corrupting it. I just did:
$ rm src/java/org/apache/james/util/BayesianAnalyzer.java
$ svn up
Restored 'src/java/org/apache/james/util/BayesianAnalyzer.java'
At revision 179287.
$ svn diff
$ ./build.sh clean dist-lite
$ svn diff
Index: src/java/org/apache/james/util/BayesianAnalyzer.java
===================================================================
--- src/java/org/apache/james/util/BayesianAnalyzer.java (revision 179287)
+++ src/java/org/apache/james/util/BayesianAnalyzer.java (working copy)
@@ -471,7 +471,7 @@
if (Character.isLetter(ch)
|| ch == '-'
|| ch == '$'
- || ch == ''
+ || ch == '�'
|| ch == '!'
|| ch == '\''
) {
Now, during the build we run <fixcrlf> during the build process. Could ANT be corrupting the file?
> It probably is the EURO character (€, unicode \u20AC i think).
> http://www.fileformat.info/info/unicode/char/20ac/index.htm
Perhaps the safest thing is to hex encode the character.
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: Encoding issue in BayesianAnalyzer?
Posted by Stefano Bagnara <ap...@bago.org>.
> Can someone please take a look at BayesianAnalyzer.java? I
> keep getting:
>
> - || ch == ''
> + || ch == 'i??'
>
> when doing a diff. What is the encoded character supposed to be?
I think the code you are pointing is:
if (ch == ':') {
String tokenString = token.toString() + ':';
if (tokenString.equals("From:")
|| tokenString.equals("Return-Path:")
|| tokenString.equals("Subject:")
|| tokenString.equals("To:")
) {
return tokenString;
}
}
if (Character.isLetter(ch)
|| ch == '-'
|| ch == '$'
|| ch == '€'
|| ch == '!'
|| ch == '\''
) {
It probably is the EURO character (€, unicode \u20AC i think).
http://www.fileformat.info/info/unicode/char/20ac/index.htm
Stefano
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org