You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/04/06 12:08:15 UTC

[Bug 3244] New: Converting messages to a single encoding

http://bugzilla.spamassassin.org/show_bug.cgi?id=3244

           Summary: Converting messages to a single encoding
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: spamassassin
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: kaede.news@online.ru


I beleive that converting messages to a single encoding cat improve bayes
accuracy for some languages and also allow writing rules for national languages.
   For example, there're 3 common encodings for Russian: koi8-r, windows-1251
and utf-8.  For many languages situation is the same.  Without converting
messages to single defined encoding, bayes will be populated with the same
tokens in different encodings, some tokens will probably be malformed (due to
tr/// operator in Bayes.pm) and it is impossible to write rules matching phrases
in languages other than English.  Here's a simple patch which converts messages
to Perl internal encoding (utf-8).  It's not complete yet (I suspect there're
many places in SpamAssassin which may break on unicode data) but it seems to
work in most cases.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.