You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by David B Funk <db...@engineering.uiowa.edu> on 2016/05/13 17:44:40 UTC

Bayes duplicate message detection algorithm?

What algorithm does Bayes use to detect that it has already 'seen' a given 
message?

When I receive a bolus (say 40~60) of 'phish' messages from a compromised 
Hotmail/gmail/yahoo account which are mostly the same (body, many headers same,
only recipients, Message-ID, Date, and a few Received headers are different)
if I feed all of them to Bayes, it will learn only about 10% of them, the
other 90% will be ignored as 'already seen'.

So how does Bayes decide that it has 'already seen' a given message when
it actually hasn't (it has already seen one that is -almost- identical).

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: Bayes duplicate message detection algorithm?

Posted by RW <rw...@googlemail.com>.
On Fri, 13 May 2016 12:44:40 -0500 (CDT)
David B Funk wrote:

> What algorithm does Bayes use to detect that it has already 'seen' a
> given message?
> 
> When I receive a bolus (say 40~60) of 'phish' messages from a
> compromised Hotmail/gmail/yahoo account which are mostly the same
> (body, many headers same, only recipients, Message-ID, Date, and a
> few Received headers are different) if I feed all of them to Bayes,
> it will learn only about 10% of them, the other 90% will be ignored
> as 'already seen'.
> 
> So how does Bayes decide that it has 'already seen' a given message
> when it actually hasn't (it has already seen one that is -almost-
> identical).

It's a hash of part of the body and the date header.