You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Mark Martinec <Ma...@ijs.si> on 2009/02/10 20:31:39 UTC

Which Message-ID is supposed to go into Bayes 'seen' database?

I lived under impression that the bayes 'seen' database keeps
the original Message-ID header fields, and only resorts to
generated value when Message-ID is not available in a message.

Yet the bayes database 'seen' only receives generated msgids,
never the original ones, even if available.
Is this intended behaviour (in SA 3.3)?

The sub get_msgid provides a list of msgids, typically exactly
two elements are returned, the generated and the original, e.g.:
  992d4f36a10d093b467251c336d5c297cfbc3a65@sa_generated
  8d2c019dbff4$dc1e4df8$a51fab64@apr.com

Then the M::S::Plugin::Bayes::_learn_trapped picks out
only the first one and ignores the original Message-ID:

  $msgid = $msgid[0];
  ...
  $self->{store}->seen_put ($msgid, ($isspam ? 's' : 'h'));

Bug or feature?

  Mark

Re: Which Message-ID is supposed to go into Bayes 'seen' database?

Posted by Theo Van Dinter <fe...@apache.org>.
I'd have to go look at the mail archives, assuming we discussed it
in email and not just irc ...  but I seem to recall it had to do with
mails coming in w/ the same message-id and sa-learn seeing them as the
same message, thereby bypassing our ability to learn tokens.  Since we
already generated ids for some mails, it was easy to make it the default
w/ some backward compatibility.

Digging through the code + svn logs a bit:

------------------------------------------------------------------------
r6733 | felicity | 2004-02-18 18:26:01 -0500 (Wed, 18 Feb 2004) | 1 line

bug 3055: spammers are using the same message id to get around bayes being
able to learn different messages.  make the hash message-id the default now,
but be backwards compatible with the seen db.
------------------------------------------------------------------------

On Wed, Feb 11, 2009 at 10:17:44AM +0000, Justin Mason wrote:
> On Tue, Feb 10, 2009 at 19:37, Michael Parker <pa...@pobox.com> wrote:
> >
> > On Feb 10, 2009, at 1:31 PM, Mark Martinec wrote:
> >>
> >> Bug or feature?
> >
> > Feature.  Theo can talk more to this but I believe we wanted to standardize
> > on a generated id instead of using the header value since headers are easily
> > forged/duplicated even though the message wasn't the same.
> 
> yeah.  we should probably have added a comment to this effect I guess ;)
> 
> Also, Message-IDs are occasionally omitted; it's a SHOULD rather than
> a MUST in rfc
> 822.  this is bad practice, but it happens.  in that case we had to
> generate an ID
> anyway.
> 
> --j.

-- 
Randomly Selected Tagline:
Wit, n.:
 	The salt with which the American Humorist spoils his cookery
 	... by leaving it out.
 		-- Ambrose Bierce, "The Devil's Dictionary"

Re: Which Message-ID is supposed to go into Bayes 'seen' database?

Posted by Justin Mason <jm...@jmason.org>.
On Tue, Feb 10, 2009 at 19:37, Michael Parker <pa...@pobox.com> wrote:
>
> On Feb 10, 2009, at 1:31 PM, Mark Martinec wrote:
>>
>> Bug or feature?
>
> Feature.  Theo can talk more to this but I believe we wanted to standardize
> on a generated id instead of using the header value since headers are easily
> forged/duplicated even though the message wasn't the same.

yeah.  we should probably have added a comment to this effect I guess ;)

Also, Message-IDs are occasionally omitted; it's a SHOULD rather than
a MUST in rfc
822.  this is bad practice, but it happens.  in that case we had to
generate an ID
anyway.

--j.

Re: Which Message-ID is supposed to go into Bayes 'seen' database?

Posted by Michael Parker <pa...@pobox.com>.
On Feb 10, 2009, at 1:31 PM, Mark Martinec wrote:
>
> Bug or feature?

Feature.  Theo can talk more to this but I believe we wanted to  
standardize on a generated id instead of using the header value since  
headers are easily forged/duplicated even though the message wasn't  
the same.

Michael