You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2011/06/23 19:26:55 UTC

[Bug 6625] New: Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

             Bug #: 6625
           Summary: Bayes SQL schema treats bayes_token.token as char
                    instead of binary, fails chset checks
           Product: Spamassassin
           Version: 3.3.2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Documentation
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: Mark.Martinec@ijs.si
    Classification: Unclassified


Panagiotis Christias wrote on the SA user ML on 2011-06-21

> I faced the same problem today. In my case, MySQL was configured to
> use utf8 by default:
> 
>   # my.cnf
>   [client]
>   default-character-set=utf8
>   [mysqld]
>   character-set-server=utf8
>   collation-server=utf8_unicode_ci
>   init_connect='set collation_connection = utf8_unicode_ci;'
> 
> After commenting out the utf8 definitions and reverting back
> to latin1 "sa-learn --restore" worked fine.

As it turns out this is not the same problem as Bug 6624,
but an entirely independent one.


Lawrence writes on 2011-06-22:

> Ignore my last suggestion of starting from scratch. Try commenting out 
> these lines (or similar ones) if present in /etc/my.cnf and restarting 
> MySQL before attempting again
> 
> default-character-set=utf8
> character-set-server=utf8
> collation-server=utf8_unicode_ci
> init_connect='set collation_connection = utf8_unicode_ci;'



Benny Pedersen posted his SQL schema which fixes the underlying
problem instead of covering it:

> CREATE TABLE IF NOT EXISTS `bayes_token` (
>    `id` int(11) NOT NULL DEFAULT '0',
>    `token` binary(5) NOT NULL,
> ...
> );


myself commented:

> Yes, the binary or varbinary is the key to a solution here.
> Mucking with utf-8 vs latin-1 is just covering but not solving
> the most glaring problem here, namely that a token must not be
> associated with any character set, as it does not obey any
> such rules, nor should it be treated case-insensitively
> (as char is, which is possibly a reason for more than two
> record changes as reported by Dave). Will take a closer look...


So in summary: as the bayes_token.token field will receive just
plain octets (binary data, not some ascii or other characters),
it must not be associated with any character set in SQL.
Treating a string as char or varchar may imply SQL checks
for data compliance with a chosen charset, implies collation
and implies case-insensitive matching (which is another hidden
bug here). As it happens the MySQL is stricter with UTF-8 checks
but rather lax with Latin-1 checks, which is why the suggested
workaround (avoiding UTF-8) is just a poor workaround which
happens to work most of the time with current versions of MySQL
but may break at any time (e.g. control chars are not a valid
Latin-1 characters).

Btw, the bayes_pg.sql schema for PostgreSQL already has this fix!

Attached is a trivial but essential fix for the bayes_mysql.sql.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6625] Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.4.0

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6625] Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

--- Comment #1 from Mark Martinec <Ma...@ijs.si> 2011-06-23 17:30:52 UTC ---
trunk:
  Bug 6625: Bayes SQL schema treats bayes_token.token as char
    instead of binary, fails chset checks
Sending sql/bayes_mysql.sql
Committed revision 1139007.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6625] Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

--- Comment #2 from Mark Martinec <Ma...@ijs.si> 2011-06-23 17:31:50 UTC ---
Created attachment 4925
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=4925
Change a data type of bayes_token.token to binary

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6625] Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #3 from Mark Martinec <Ma...@ijs.si> 2011-09-21 00:31:52 UTC ---
closing, fixed for 3.4

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.