You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by David B Funk <db...@engineering.uiowa.edu> on 2016/05/13 17:44:40 UTC
Bayes duplicate message detection algorithm?
What algorithm does Bayes use to detect that it has already 'seen' a given
message?
When I receive a bolus (say 40~60) of 'phish' messages from a compromised
Hotmail/gmail/yahoo account which are mostly the same (body, many headers same,
only recipients, Message-ID, Date, and a few Received headers are different)
if I feed all of them to Bayes, it will learn only about 10% of them, the
other 90% will be ignored as 'already seen'.
So how does Bayes decide that it has 'already seen' a given message when
it actually hasn't (it has already seen one that is -almost- identical).
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Bayes duplicate message detection algorithm?
Posted by RW <rw...@googlemail.com>.
On Fri, 13 May 2016 12:44:40 -0500 (CDT)
David B Funk wrote:
> What algorithm does Bayes use to detect that it has already 'seen' a
> given message?
>
> When I receive a bolus (say 40~60) of 'phish' messages from a
> compromised Hotmail/gmail/yahoo account which are mostly the same
> (body, many headers same, only recipients, Message-ID, Date, and a
> few Received headers are different) if I feed all of them to Bayes,
> it will learn only about 10% of them, the other 90% will be ignored
> as 'already seen'.
>
> So how does Bayes decide that it has 'already seen' a given message
> when it actually hasn't (it has already seen one that is -almost-
> identical).
It's a hash of part of the body and the date header.