You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by ha...@t-online.de on 2006/12/04 07:20:48 UTC

Re: New Rule: OE_MULTIPART_RELATED

>> 
>> Hello list,
>> 
>> For your consideration:
>> 
>> header __MULTIPART_RELATED Content-Type =~ /multipart\/related/
>> 
>> meta OE_MULTIPART_RELATED (__OE_MUA && __MULTIPART_RELATED)
>> describe OE_MULTIPART_RELATED Possible image spam forged as from MS Outlook
>> 
>> The false Positive rate on my corpus is 0.1%. I can't tell you about the false 
>> negative rate since I don't keep my spam (only my ham).
>> 
>> This rule works very well on the pump-and-dump image spam that has been 
>> escaping my spamassassin installation for the last few months. Although 
>> Outlook Express is capable of generating messages with multipart/related MIME 
>> type, it only does that if the user creates an HTML message with inline 
>> images. This happens occasionally but rarely (hence the 0.1%). I expect the 
>> perceptron might give this rule a score of perhaps +0.5, which is not enough 
>> to catch the pump-and-dump image spam by itself, but works well in 
>> conjunction with Mail::SpamAssassin::Plugin::ImageInfo.
>> 
>> Thoughts on this rule?
>> 
>> --Ian Turner
>> 

Hi Ian,

this would trap mail using outlook "stationery".

I dont really like it, but I get it in wanted mail.
Generally I believe that rules scoring valid use of mail (cid addressing, mime types) should
be avoided - unless you want to block, e.g., mails with images or mails sent from outlook
generally
Rather try to find a subtle difference in the way real outlook builds the message and the
spammers do it, that would really reveal it is not from outlook

Wolfgang Hamann


Re: New Rule: OE_MULTIPART_RELATED

Posted by Ian Turner <ve...@vectro.org>.
Followup on my earlier message...

On Monday 04 December 2006 11:11, Ian Turner wrote:
> Yup. All of the FPs in my corpus are outlook messages with inline images.
> But it turns out that some of those are also spam; the actual FP rate is

The actual FP rate, eliminating false false positives (e.g., after corpus 
cleaning) is 4 messages in 4773, or 0.08%.

> That's what I'm trying to do, but this particular spammer seems to have
> been very careful (or really used outlook to generate the message) -- it
> seems to match exactly, at least at the MIME and RFC822 layers. I'm looking
> into HTML now.

A careful review of HTML messages from this class of spam and HTML messages 
from my corpus reveals nothing distinctive about the spam; the message 
template was almost certainly generated using Outlook Express itself. The 
rule I've already suggested (OE_MULTIPART_RELATED) is the most distinctive 
aspect I can find, barring any analysis of the image itself (which I leave to 
the ImageInfo or OCR plugins).

Cheers,

--Ian

Re: New Rule: OE_MULTIPART_RELATED

Posted by Ian Turner <ve...@vectro.org>.
On Monday 04 December 2006 16:19, John D. Hardin wrote:
> On Mon, 4 Dec 2006, Ian Turner wrote:
> > When used in combination with, say, DC_GIF_UNO_LARGO,
> > RCVD_IN_NJABL_DUL, and RCVD_IN_BL_SPAMCOP_NET, this rule can help
> > make a more solid prediction.
>
> The perceptron doesn't create meta rules, does it?

Nope, although you can always create them and see what score it gives them. 
But what I actually meant when I said "in combination" was not meta rules, 
but simply the sum-of-scores rule aggregation that spamassassin already does. 
Each of the rules may provide the suggestion of spam, but most rules are not 
scored high enough to mark an e-mail as spam on their own -- several rules 
must match in order to make a "spam" decision.

Cheers,

--Ian Turner

Re: New Rule: OE_MULTIPART_RELATED

Posted by "John D. Hardin" <jh...@impsec.org>.
On Mon, 4 Dec 2006, Ian Turner wrote:

> When used in combination with, say, DC_GIF_UNO_LARGO,
> RCVD_IN_NJABL_DUL, and RCVD_IN_BL_SPAMCOP_NET, this rule can help
> make a more solid prediction.

The perceptron doesn't create meta rules, does it?

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  ...to announce there must be no criticism of the President or to
  stand by the President right or wrong is not only unpatriotic and
  servile, but is morally treasonous to the American public.
                                          -- Theodore Roosevelt, 1918
-----------------------------------------------------------------------
 11 days until Bill of Rights day


Re: New Rule: OE_MULTIPART_RELATED

Posted by Ian Turner <ve...@vectro.org>.
On Monday 04 December 2006 01:20, hamann.w@t-online.de wrote:
> this would trap mail using outlook "stationery".
> I dont really like it, but I get it in wanted mail.

Yup. All of the FPs in my corpus are outlook messages with inline images. But 
it turns out that some of those are also spam; the actual FP rate is 

> Generally I believe that rules scoring valid use of mail (cid addressing,
> mime types) should be avoided

Actually, I disagree -- we already have lots of rules that match valid use of 
mail, such as CHARSET_FARAWAY, DOMAIN_RATIO, NO_REAL_NAME, TO_EMPTY, and 
nearly all of the SUBJ_ rules.

A spamassassin rule need not stand alone; it still has predictive power when 
used in combination with other rules, as long as it shows a statistically 
significant difference in spam/ham hit-rates. We use the perceptron to figure 
out exactly /how much/ predictive power it has.

When used in combination with, say, DC_GIF_UNO_LARGO, RCVD_IN_NJABL_DUL, and 
RCVD_IN_BL_SPAMCOP_NET, this rule can help make a more solid prediction.

> Rather try to find a subtle difference in the way real outlook builds the
> message and the spammers do it, that would really reveal it is not from
> outlook

That's what I'm trying to do, but this particular spammer seems to have been 
very careful (or really used outlook to generate the message) -- it seems to 
match exactly, at least at the MIME and RFC822 layers. I'm looking into HTML 
now.

Cheers,

--Ian