You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2017/07/16 13:11:57 UTC

[Bug 5185] Bayesian learning uses different message checksums during exiscan_acl and later sa_learn

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5185

Ian Turner <ve...@vectro.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vectro@vectro.org

--- Comment #29 from Ian Turner <ve...@vectro.org> ---
This is not resolved in 3.4.0. There is still a problem with the newline
removal.

The problem is that we take 1/2 the body before removing newlines. Thus we
could get a different portion of the body depending on the line endings.

Consider these two messages:
"ABCDEFG\r\n\r\n"
"ABCDEFG\n\n"
They are the same except for newlines but will yield a different result using
the current algorithm.
"ABCDEFG\r\n\r\n" -> "ABCDE"
"ABCDEFG\n\n" -> "ABCD"
A similar problem could occur if taking the first kilobyte of message.

The issue only presents itself for certain patterns of messages and newlines,
obviously.

Removing the newlines before truncating the message works, but I'm sure there's
a more efficient fix available. Perhaps go through the message until reaching
1kb of non-newline characters?

Happy to report this as a new bug if you think this one is too stale.

-- 
You are receiving this mail because:
You are the assignee for the bug.