You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/08/10 01:14:41 UTC

Re: Unrecognized encodings make text rules painfully slow and give FP

Mark -- can you mail a *real* sample?  private mail would be fine.

--j.

Mark Martinec writes:
> I recently noticed a couple of cases where SA (3.1.4 or earlier)
> would take over a minute (instead of few seconds) to check a 500 kB
> message. Investigation reavealed that cases have one thing in common:
> these were all message/partial chunks of a longish transfer of some
> document or other data. Moreover, most of these cases were hitting
> random sets of SARE or baseline rules, yielding false positives.
> 
> In case someone would suggest that Content-Type: message/partial
> should be banned outright - well, it is a policy decision, and
> if allowed, should not bring SA to its knees on a 0.5 MB message.
> 
> Here is one example where a command-line 'spamassassin -t -D' would
> run for 68 seconds. Timestamping each debug line produces the
> following top-10 lines - sorted by elapsed time, first column
> is time in seconds for this line to appear after a previous one:
> 
> 1.935 dbg: rules: ran body rule SARE_RMML_Stock1 ======> got hit: "0TC"
> 2.204 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134"
> 3.695 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0il"
> 3.976 dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit: "i"
> 4.021 dbg: rules: running raw-body-text per-line regexp tests; score ... 
> 6.397 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " Sjx"
> 8.225 dbg: bayes: tok_get_all: token count: 37175
> 8.254 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169"
> 9.682 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218"
> 11.999 dbg: rules: running body-text per-line regexp tests; score so far=2.501
> 
> and another example:
> 
> 2.396 dbg: rules: ran body rule DISGUISE_PORN_MUNDANE ======> got hit: "b0y"
> 2.424 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134"
> 2.627 dbg: bayes: tok_get_all: token count: 36631
> 3.421 dbg: rules: running body-text per-line regexp tests; score so far=0.203
> 3.826 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0Il"
> 4.181 dbg: rules: running raw-body-text per-line regexp tests; score ... 
> 4.265 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " S8X"
> 8.113 dbg: rules: ran body rule FUZZY_XPILL ======> got hit: "XoNOgX"
> 9.308 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169"
> 9.945 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218"
> 
> I know some of these are SARE rulesets, but some are baseline rules
> or bayes token parsing.
> 
> Here is a relevant section/sample of one of these messages:
> 
> MIME-Version: 1.0
> Content-Type: message/partial;
>         total=22;
>         id="01C6BB9C.7D698F00@zogica";
>         number=21
> X-Priority: 3
> X-MSMail-Priority: Normal
> X-Mailer: Microsoft Outlook Express 6.00.2900.2869
> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869
> 
> f6idzxqa608aID8+YhwNSQwBpIrboHA0/zPfOP26mB6eONz70Xl12DwGVnAPemaaKaJyQk5ZKUwg
> VC0sGYHLd543cICNa1piu8YgRJR0EaEK7GNVXvFSriat5dZwj7PNzQuOTO030bra7tBjROxbrVYR
> XFStjnugVkyH27zqrvUdUsHYnLaVLdUuAxWH51QDV9/kc6vtIURcdUbthPszq12lj7Lt7rMAtVX7
> 
> 
> So the problem is that these base64-encoded lines in a message/partial
> chunk are treated as obfuscated text, which is very slow, and produces
> almost random hits on various rules. It also places some burden on
> SQL server (bayes: tok_get_all: token count: 37175).
> 
> 
> Somewhat similar mail cases that also hit various obfuscation rules
> because of its UU-encoding being mistaken for a plain text, is mail
> with attachments produced by Microsoft Office Outlook where user
> has the following setting chosen:
> 
>   Tools -> Options -> Mail Format -> Internet format: plain text options:
>     (YES) Encode attachments in UUENCODE format
>           when sending a plain text message
> 
> It would be nice if such encodings were recognized and at least
> prevent rules that expect plain text from running and/or producing
> false hits.
> 
>   Mark

Re: Unrecognized encodings make text rules painfully slow and give FP

Posted by Mark Martinec <Ma...@ijs.si>.
Justin,

> Mark -- can you mail a *real* sample?  private mail would be fine.

It is coming your way (private) in a minute or two.
It is a real sample, the only change I made is to
replace the From and To header, so that it looks like
a mail from me to you, subject: promotional video ...

Watch out, it takes 80 seconds of CPU-intensive processing
on our host, using SA 3.1.4 + sa-update + basic SARE rules.

Thanks for looking into it.

  Mark