You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/03/06 13:40:54 UTC

Re: Passive Fingerprinting to feed filters

hi Chris --

the best way to feed extra info to Bayes is to add a header, containing
space-separated tokens, e.g.

    X-Syn-Print: win=64240 mss=1460 sackOK

This is easy to write SpamAssassin rules to match against, will
automatically be used by Bayes during classification, and is saved
persistently in the output message for later training.

You can either add the header upfront *before* the message is passed
to SpamAssassin, or use a SpamAssassin plugin and the put_metadata()
API, similar to how the RelayCountry plugin does it --
lib/Mail/SpamAssassin/Plugin/RelayCountry.pm in the SpamAssassin source
distribution.

--j.

Chris writes:
> Hi,
> 
> I've written milter with the extra facility to gather the TCP SYN and
> the SYN's initial ACK from MTA connections to port 25 with a view to
> feeding this extra info into an anti-spam learning system.
> 
> From initial observation, I think it has an outstanding chance of
> great success - for example - I observed that 100% of the 4000 emails
> I received this weekend with the following TCP flags and sizings:-
>     win 64240 <mss 1460,nop,nop,sackOK
> were spam.  "Common sense" tells us that some things are never
> supposed to be sending emails (eg: hacked routers), other things are
> very unlikely to be running a legitimate MTA (eg: a Windows XP PC) and
> others are highly likely to be legit (eg: a RedHat Enterprise server)
> 
> Is anyone interested in connecting this passive info-feed up to
> spamassassin?
> 
> I presently have two bits of perl code - the passive connection cache
> daemon, and the milter that - upon connection - gets the SYN/ACK from
> the daemon.  (My deamon also records the final FIN as well, although
> this would only be useful to sites that filter mail after having
> accepted it already).  I also have a growing "database" of passive
> data that can be fed to a learning system, with each email already
> having been classified as "definite spam", "very very unlikely to be
> spam", and "unknown, but very likely to be spam"
> 
> Here's some examples:
> 
> 1. Definite spam:  (caught by "Brightmail", sent not by our customers)
> 
> 03/06 01:37.27 30855 pm PaSv refused-spam:Yes <> => xxxx@xxxxx.com HOST=smtp01.bluespree.com [67.91.146.227:54790] HELO=smtp01.BlueSpree.com TLS=//// Cache(67.91.146.227.54790)=ASYN:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 62: 67.91.146.227.54790 > xx.xx.xx.xx.25: S [tcp sum ok] 928359215:928359215(0) win 16384 <mss 1460,nop,nop,sackOK> (DF) (ttl 115, id 51201, len 48)        ACK:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 60: 67.91.146.227.54790 > xx.xx.xx.xx.25: . [tcp sum ok] 928359216:928359216(0) ack 587049612 win 17520 (DF) (ttl 115, id 51209, len 40)
> 03/06 01:38.59 30855 pm PaSv refused-spam:Blocked Sender <xx...@xxxxxxxx.ch> => xxxxxxxx@xxx.xx.cn HOST=[201.67.216.25] [201.67.216.25:3287] HELO=bluemail.ch TLS=//// Cache(201.67.216.25.3287)=ASYN:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 62: 201.67.216.25.3287 > xx.xx.xx.xx.25: S [tcp sum ok] 3241967323:3241967323(0) win 65535 <mss 1452,nop,nop,sackOK> (DF) (ttl 116, id 21335, len 48) ACK:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 60: 201.67.216.25.3287 > xx.xx.xx.xx.25: . [tcp sum ok] 3241967324:3241967324(0) ack 686135845 win 65535 (DF) (ttl 116, id 21355, len 40)
> 03/06 01:39.20 30855 pm PaSv refused-spam:Blocked Sender <xx...@xxxxxxxxxxx.de> => xxxxxx@xxx.xxx.au HOST=142-217-112-233.telebecinternet.net [142.217.112.233:57953] HELO=knuddlteddy.de TLS=//// Cache(142.217.112.233.57953)=ASYN:0:14:f6:dc:b0:c0 0:8:2:a0:0:da 0800 62: 142.217.112.233.57953 > xx.xx.xx.xx.25: S [tcp sum ok] 902385560:902385560(0) win 64240 <mss 1460,nop,nop,sackOK> (DF) (ttl 119, id 44063, len 48)   ACK:0:14:f6:dc:b0:c0 0:8:2:a0:0:da 0800 60: 142.217.112.233.57953 > xx.xx.xx.xx.25: . [tcp sum ok] 902385561:902385561(0) ack 696775812 win 64240 (DF) (ttl 119, id 44069, len 40)
> 03/06 01:39.08 30855 pm PaSv refused-spam:Yes <xx...@xxx.com> => xxxxxxxx@xxxxxxx.com HOST=[87.69.103.68] [87.69.103.68:4811] HELO=xx.xx.xx.xx TLS=//// Cache(87.69.103.68.4811)=ASYN:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 62: 87.69.103.68.4811 > xx.xx.xx.xx.25: S [tcp sum ok] 472097603:472097603(0) win 65535 <mss 1360,nop,nop,sackOK> (DF) (ttl 110, id 53309, len 48)     ACK:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 60: 87.69.103.68.4811 > xx.xx.xx.xx.25: . [tcp sum ok] 472097604:472097604(0) ack 695762229 win 65535 (DF) (ttl 110, id 53341, len 40)
> 
> 2. Very probably spam:  (not caught by "Brightmail", but also not a
>                          customer with permission to use our mail
>                          system)
> 
> 03/06 01:40.46 30855 pm PaSv noncust:nonspam xxxxxx@xxxxxxxxx.com => xxxxx@xxxxxxxxx.com HOST=maildana.danareksa.com [202.158.10.99:19584] HELO=mail-gw.danareksa.com TLS=//// Cache(202.158.10.99.19584)=SSYN:0:14:f6:dc:b0:c0 0:8:2:a0:0:da 0800 78: 202.158.10.99.19584 > xx.xx.xx.xx.25: S [tcp sum ok] 212145248:212145248(0) win 16384 <mss 1460,nop,nop,sackOK,nop,wscale 0,nop,nop,timestamp 1508937671 0> (DF) (ttl 115, id 1987, len 64)
> 03/06 01:46.21 30855 pm PaSv noncust:nonspam xxxx_xxxxxxx@xx.com => xxxxxxx@xxxxxx-xxxx.com HOST=mlnyb902er.ml.com [199.43.54.100:56981] HELO=mlnyb902er.ml.com TLS=//// Cache(199.43.54.100.56981)=SSYN:0:14:f6:dc:b0:c0 0:8:2:a0:0:da 0800 74: 199.43.54.100.56981 > xx.xx.xx.xx.25: S [tcp sum ok] 672545219:672545219(0) win 5840 <mss 1460,sackOK,timestamp 1375514422 0,nop,wscale 2> (DF) (ttl 50, id 38645, len 60)
> 
> 3. Very unlikely to be spam:  (Our customers, or, DSN receipts
>                                relating to email we originated
>                                earlier, neither of which caught by
>                                Brightmail)
> 
> 03/06 01:46.52 30855 pm PaSv cust:nonspam xxxxxxxxxxx@xxxxxxxxxxxxxx.com => xxxxxxxx@xxxxx-xxxxxxxx.com HOST=list.opisnet.com [198.6.95.10:13985] HELO=ucgsmtpfw1.ucg.com TLS=//// Cache(198.6.95.10.13985)=ASYN:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 62: 198.6.95.10.13985 > xx.xx.xx.xx.25: S [tcp sum ok] 4066464733:4066464733(0) win 16384 <mss 1460,nop,nop,sackOK> (DF) (ttl 116, id 5604, len 48)      ACK:0:17:cb:19:f2:b9 0:8:2:a0:0:da 0800 60: 198.6.95.10.13985 > xx.xx.xx.xx.25: . [tcp sum ok] 4066464734:4066464734(0) ack 1175379631 win 17520 (DF) (ttl 116, id 5617, len 40)
> 03/06 01:47.42 30855 pm PaSv cust:nonspam xxxxxxxx@xxxxxxx.net => xxxxxx.xxxxxxxx@xxxx.org HOST=sccrmhc15.comcast.net [204.127.200.85:46523] HELO=sccrmhc15.comcast.net TLS=//// Cache(204.127.200.85.46523)=SSYN:0:14:f6:dc:b0:c0 0:8:2:a0:0:da 0800 78: 204.127.200.85.46523 > xx.xx.xx.xx.25: S [tcp sum ok] 2091700940:2091700940(0) win 32850 <nop,wscale 1,nop,nop,timestamp 709867016 0,nop,nop,sackOK,mss 1460> (DF) (ttl 53, id 6350, len 64)
> 03/06 01:47.53 30855 pm PaSv cust:nonspam xxxxxxxxxxxxxxx@xxxxxxx.net => xxxxxxxxxxxxxxx@xxxxxxx.net HOST=rwcrmhc15.comcast.net [204.127.192.85:36708] HELO=rwcrmhc15.comcast.net TLS=//// Cache(204.127.192.85.36708)=SSYN:0:14:f6:dc:b0:c0 0:8:2:a0:0:da 0800 78: 204.127.192.85.36708 > xx.xx.xx.xx.25: S [tcp sum ok] 2924519936:2924519936(0) win 32850 <nop,wscale 1,nop,nop,timestamp 407323383 0,nop,nop,sackOK,mss 1460> (DF) (ttl 55, id 39701, len 64)
> 
> 
> Hopefully nothing wrapped those long lines above!!
> 
> The reason I'm posting this here is because I know Email, and I know
> TCP/IP, but I don't know neural-networks or bayesian tech...
> 
> Let me know if you want a copy of the code or data to work on.  It's
> in perl, and runs on any Linux machine (or Unix with some small
> mods)
> 
> (-; Chris.

Re: Passive Fingerprinting to feed filters

Posted by Andy Fiddaman <cl...@fiddaman.net>.
On Thu, 8 Mar 2007, Michael Monnerie wrote:

; On Dienstag, 6. M�rz 2007 13:40 Justin Mason wrote:
; > X-Syn-Print: win=64240 mss=1460 sackOK
;
; Sounds very interesting, I'm willing to test it. Headers are always
; nice, making it easy readable.

I somehow missed the original post about this but I'm very interested too.
I already pass p0f (http://lcamtuf.coredump.cx/p0f.shtml) metrics to SA
as headers so that Bayes can use them.

Hop count is a metric which seems a good indicator as well. p0f guesses at
the distance of the remote host based on the TTL in the packet and the
guessed source OS. I put the hop count range (e.g. 10-15) into the header
for Bayes to use.

Some very rough stats based on the last 60 days messages (p0f_dist is the
hop count reported by p0f)....

mysql> select avg(spam_score) from log where p0f_dist >= 10;
+-----------------+
| avg(spam_score) |
+-----------------+
|         7.00293 |
+-----------------+
1 row in set (6.47 sec)

mysql> select avg(spam_score) from log where p0f_dist > 2 and p0f_dist <
10;
+-----------------+
| avg(spam_score) |
+-----------------+
|         3.73930 |
+-----------------+
1 row in set (6.28 sec)

A.

Re: Passive Fingerprinting to feed filters

Posted by Michael Monnerie <mi...@it-management.at>.
On Dienstag, 6. März 2007 13:40 Justin Mason wrote:
> X-Syn-Print: win=64240 mss=1460 sackOK

Sounds very interesting, I'm willing to test it. Headers are always 
nice, making it easy readable.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:        "curl -s http://zmi.at/zmi4.asc | gpg --import"
// Fingerprint: EA39 8918 EDFF 0A68 ACFB  11B7 BA2D 060F 1C6F E6B0
// Keyserver: www.keyserver.net                   Key-ID: 1C6FE6B0