You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Noel J. Bergman" <no...@devtech.com> on 2005/05/31 23:24:02 UTC

Encoding issue in BayesianAnalyzer?

Can someone please take a look at BayesianAnalyzer.java?  I keep getting:

-            || ch == ''
+            || ch == 'i??'

when doing a diff.  What is the encoded character supposed to be?

	--- Noel

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Encoding issue in BayesianAnalyzer?

Posted by Stefano Bagnara <ap...@bago.org>.
> Yes, that's it.  And I am finding that something in our build 
> process is corrupting it.  I just did:
> 
>  $ rm src/java/org/apache/james/util/BayesianAnalyzer.java
>  $ svn up
>  Restored 'src/java/org/apache/james/util/BayesianAnalyzer.java'
>  At revision 179287.
>  $ svn diff
>  $ ./build.sh clean dist-lite
>  $ svn diff
>  Index: src/java/org/apache/james/util/BayesianAnalyzer.java
>  ===================================================================
>  --- src/java/org/apache/james/util/BayesianAnalyzer.java     
>    (revision 179287)
>  +++ src/java/org/apache/james/util/BayesianAnalyzer.java     
>    (working copy)
>  @@ -471,7 +471,7 @@
>               if (Character.isLetter(ch)
>               || ch == '-'
>               || ch == '$'
>  -            || ch == ''
>  +            || ch == '�'
>               || ch == '!'
>               || ch == '\''
>               ) {
> 
> Now, during the build we run <fixcrlf> during the build 
> process.  Could ANT be corrupting the file?
> 
> > It probably is the EURO character (€, unicode \u20AC i think).
> > http://www.fileformat.info/info/unicode/char/20ac/index.htm
> 
> Perhaps the safest thing is to hex encode the character.

The build script does not corrupt my file but my default charset supports "€" in non-unicode encoding.
I agree that you can safely replace the line with "|| ch == '\u20AC'".

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Encoding issue in BayesianAnalyzer?

Posted by Soren Hilmer <so...@tietoenator.com>.
Back to the issue. It is very likely to be fixcrlf causing the problem.
I used to run it on som test-mails to ensure they ended in crlf, but as you 
have experienced it changes the charset as well :-(

Luckily you can add an encoding option to fixcrlf like in this example:
<fixcrlf srcdir="${src}" includes="**/*.java" eol="lf" tab="remove" 
tablength="4" encoding="ISO-8859-1"/>

Don't now if ISO-8859-1 is the correct one to use maybe ISO-8859-15 is better

--Søren

On Wednesday 01 June 2005 00:55, Noel J. Bergman wrote:
> Stefano Bagnara wrote:
> > > IIRC, because it was intended to keep Windows users from
> > > putting incorrectly formatted files into source control, and
> > > to remove tabs.
> > >
> > > At least the latter should be addressed now by SVN.
> >
> > AFAIK the first is addressed by SVN.
>
> <<sigh>> That's actually what I meant.  The tabs issue is not addressed by
> SVN.
>
> 	--- Noel
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org

-- 
Søren Hilmer, M.Sc.
R&D manager             Phone:  +45 72 30 64 00
TietoEnator IT+ A/S     Fax:    +45 72 30 64 40
Ved Lunden 12           Direct: +45 72 30 64 57
DK-8230 Åbyhøj          Email:  soren.hilmer <at> tietoenator.com

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


RE: Encoding issue in BayesianAnalyzer?

Posted by "Noel J. Bergman" <no...@devtech.com>.
Stefano Bagnara wrote:

> > IIRC, because it was intended to keep Windows users from
> > putting incorrectly formatted files into source control, and
> > to remove tabs.
> >
> > At least the latter should be addressed now by SVN.

> AFAIK the first is addressed by SVN.

<<sigh>> That's actually what I meant.  The tabs issue is not addressed by
SVN.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Encoding issue in BayesianAnalyzer?

Posted by Stefano Bagnara <ap...@bago.org>.
> IIRC, because it was intended to keep Windows users from 
> putting incorrectly formatted files into source control, and 
> to remove tabs.
> 
> At least the latter should be addressed now by SVN.

AFAIK the first is addressed by SVN. You can add per file attributes to tell
svn they are "svn:eol-style = native"
(dos newlines from dos clients, unix newlines from unix client) or specify
they are always DOS (svn:eol-style = CRLF) or always UNIX 
You can also add executable bit informations  (svn:executable)

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


RE: Encoding issue in BayesianAnalyzer?

Posted by "Noel J. Bergman" <no...@devtech.com>.
Stefano Bagnara wrote:

> Why do we run <fixcrlf> on the original source code and not directly in
the
> working copy as like as the <replace> ?

IIRC, because it was intended to keep Windows users from putting incorrectly
formatted files into source control, and to remove tabs.

At least the latter should be addressed now by SVN.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Encoding issue in BayesianAnalyzer?

Posted by Stefano Bagnara <ap...@bago.org>.
> Now, during the build we run <fixcrlf> during the build 
> process.  Could ANT be corrupting the file?

Why do we run <fixcrlf> on the original source code and not directly in the
working copy as like as the <replace> ?

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


RE: Encoding issue in BayesianAnalyzer?

Posted by "Noel J. Bergman" <no...@devtech.com>.
Stefano Bagnara wrote:

> I think the code you are pointing is 

Yes, that's it.  And I am finding that something in our build process is corrupting it.  I just did:

 $ rm src/java/org/apache/james/util/BayesianAnalyzer.java 
 $ svn up
 Restored 'src/java/org/apache/james/util/BayesianAnalyzer.java'
 At revision 179287.
 $ svn diff
 $ ./build.sh clean dist-lite
 $ svn diff
 Index: src/java/org/apache/james/util/BayesianAnalyzer.java
 ===================================================================
 --- src/java/org/apache/james/util/BayesianAnalyzer.java        (revision 179287)
 +++ src/java/org/apache/james/util/BayesianAnalyzer.java        (working copy)
 @@ -471,7 +471,7 @@
              if (Character.isLetter(ch)
              || ch == '-'
              || ch == '$'
 -            || ch == ''
 +            || ch == '�'
              || ch == '!'
              || ch == '\''
              ) {

Now, during the build we run <fixcrlf> during the build process.  Could ANT be corrupting the file?

> It probably is the EURO character (€, unicode \u20AC i think).
> http://www.fileformat.info/info/unicode/char/20ac/index.htm

Perhaps the safest thing is to hex encode the character.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Encoding issue in BayesianAnalyzer?

Posted by Stefano Bagnara <ap...@bago.org>.
> Can someone please take a look at BayesianAnalyzer.java?  I 
> keep getting:
> 
> -            || ch == ''
> +            || ch == 'i??'
> 
> when doing a diff.  What is the encoded character supposed to be?

I think the code you are pointing is:

if (ch == ':') {
    String tokenString = token.toString() + ':';
    if (tokenString.equals("From:")
    || tokenString.equals("Return-Path:")
    || tokenString.equals("Subject:")
    || tokenString.equals("To:")
    ) {
        return tokenString;
    }
}

if (Character.isLetter(ch)
|| ch == '-'
|| ch == '$'
|| ch == '€'
|| ch == '!'
|| ch == '\''
) {


It probably is the EURO character (€, unicode \u20AC i think).
http://www.fileformat.info/info/unicode/char/20ac/index.htm

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org