You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Marc Perkel <ma...@perkel.com> on 2005/02/14 18:50:24 UTC

Spam and Ham have different headers - bayesian tricks

Continuing with my experimenting with a second bayesian filter - using 
spamprobe and controlling the tokens myself - and using SA to score the 
output.

So - I noticed that spam and ham often have different header fields. 
Some headers only show up in ham - and some headers only show up in 
spam. So I tokenized the headers themselves and fed just the header 
names in as data and got some really good results.

So - I don't know if SA is doing this but tokenizing the header names 
(excluding the common ones that all headers have) is very effective.

-- 
Marc Perkel - marc@perkel.com

Spam Filter: http://www.junkemailfilter.com
    My Blog: http://marc.perkel.com
My Religion: http://www.churchofreality.org
~ "If it's real - we believe in it!" ~



Re: Spam and Ham have different headers - bayesian tricks

Posted by Marc Perkel <ma...@perkel.com>.
Examples:

Ham Headers:

0.0000018     786       0  0x00000395  hdr_article
0.0000019     731       0  0x00000395  hdr_x-yahoo-profile
0.0000026     535       0  0x00000395  hdr_x-virus-checked
0.0000027     518       0  0x00000395  hdr_x-asf-spam-status
0.0000048     289       0  0x00000395  hdr_x-egroups-approved-by
0.0000052     267       0  0x00000395  hdr_x-elnk-trace
0.0000054     259       0  0x00000395  hdr_x-authentication-info
0.0000058     243       0  0x00000395  hdr_mail-followup-to
0.0000070     199       0  0x00000395  hdr_x-x-sender
0.0000076     184       0  0x00000395  hdr_resent-to
0.0000085     164       0  0x00000395  hdr_x-egroups-edited-by
0.0000086     163       0  0x00000395  hdr_x-content-filtered-by
0.0000088     159       0  0x00000395  hdr_x-list-host
0.0000100     140       0  0x00000395  hdr_x-enigmail-supports
0.0000100     140       0  0x00000395  hdr_x-enigmail-version
0.0000104     134       0  0x00000395  hdr_x-sequence
0.0000109     128       0  0x00000395  hdr_x-lyris-message-id
0.0000111     126       0  0x00000395  hdr_x-precedence
0.0000124     113       0  0x00000395  hdr_x-sasl-enc
0.0000132     106       0  0x00000395  hdr_x-ms-embedded-report
0.0000133     105       0  0x00000395  hdr_x-bugzilla-reason
0.0000135     104       0  0x00000395  hdr_x-envelope-from
0.0000140     100       0  0x00000395  hdr_x-plug
0.0000144      97       0  0x00000395  hdr_x-mailing-list-name
0.0000144      97       0  0x00000395  hdr_x-old-spam-check-by
0.0000144      97       0  0x00000395  hdr_x-old-spam-status
0.0000146      96       0  0x00000395  hdr_x-contentstamp
0.0000146      96       0  0x00000395  hdr_x-untd-originstamp
0.0000171      82       0  0x00000395  hdr_x-listprocessor-version
0.0000175      80       0  0x00000395  hdr_x-authenticated-sender
0.0000203      69       0  0x00000395  hdr_x-pmx-version
0.0000215      65       0  0x00000395  hdr_x-perlmx-spam
0.0000219      64       0  0x00000395  hdr_x-greylist
0.0000229      61       0  0x00000395  hdr_x-to
0.0000229      61       0  0x00000395  hdr_x-yahoo-newman-id
0.0000237      59       0  0x00000395  hdr_x-yahoo-alertid
0.0000237      59       0  0x00000395  hdr_x-yahoo-alerts-beta
0.0000237      59       0  0x00000395  hdr_x-yahoo-returnbounces
0.0000246      57       0  0x00000395  hdr_x-gmane-mailscanner
0.0000246      57       0  0x00000395  hdr_x-mimedefang-filter
0.0000264      53       0  0x00000394  hdr_x-ebay-mailtracker
0.0000264      53       0  0x00000394  hdr_x-yahoo-newman-expires
0.0000269      52       0  0x00000395  hdr_x-mail-info
0.0000274      51       0  0x00000395  hdr_x-juno-line-breaks
0.0000274      51       0  0x00000395  hdr_x-pop-user
0.0000280      50       0  0x00000394  hdr_x-coriate
0.0000280      50       0  0x00000395  hdr_x-domain
0.0000280      50       0  0x00000395  hdr_x-key
0.0000280      50       0  0x00000395  hdr_x-message-type
0.0000280      50       0  0x00000395  hdr_x-schema
0.0000286      49       0  0x00000395  hdr_x-listserver
0.0000292      48       0  0x00000395  hdr_x-dsncontext
0.0000311      45       0  0x00000395  hdr_envelope-sender
0.0000311      45       0  0x00000395  hdr_x-db
0.0000311      45       0  0x00000395  hdr_x-parse
0.0000318      44       0  0x00000395  hdr_x-cam-antivirus
0.0000318      44       0  0x00000395  hdr_x-cam-scannerinfo
0.0000318      44       0  0x00000395  hdr_x-cam-spamdetails
0.0000326      43       0  0x00000394  hdr_x-bigfish
0.0000341      41       0  0x00000395  hdr_seal-send-time
0.0000350      40       0  0x00000394  hdr_x-warning
0.0000359      39       0  0x00000395  hdr_x-amazon-corporate-relay
0.0000359      39       0  0x00000395  hdr_x-amazon-track
0.0000368      38       0  0x00000395  hdr_x-authenticated
0.0000368      38       0  0x00000395  hdr_x-converted-to-plain-text
0.0000378      37       0  0x00000395  hdr_x-msg-ref
0.0000378      37       0  0x00000395  hdr_x-starscan-version
0.0000378      37       0  0x00000395  hdr_x-viruschecked
0.0000378      37       0  0x00000395  hdr_x-wss-id
0.0000389      36       0  0x00000395  hdr_jobid
0.0000389      36       0  0x00000395  hdr_mailid
0.0000389      36       0  0x00000395  hdr_x-env-sender
0.0000400      35       0  0x00000395  hdr_x-operating-system
0.0000400      35       0  0x00000395  hdr_x-webtv-signature
0.0000400      35       0  0x00000395  hdr_x-y-gmx-trusted
0.0000412      34       0  0x00000395  hdr_x-mailscanner-to
0.0000412      34       0  0x00000395  hdr_x-subscription_info
0.0000424      33       0  0x00000394  hdr_x-evi-mailscanner
0.0000424      33       0  0x00000394  hdr_x-evi-mailscanner-information
0.0000424      33       0  0x00000394  hdr_x-evi-mailscanner-spamcheck
0.0000424      33       0  0x00000395  hdr_restrict
0.0000452      31       0  0x00000395  hdr_x-gmane-nntp-posting-host
0.0000452      31       0  0x00000395  hdr_x-note
0.0000467      30       0  0x00000395  hdr_x-server-uuid
0.0000483      29       0  0x00000395  hdr_x-fid
0.0000483      29       0  0x00000395  hdr_x-mail-handler
0.0000518      27       0  0x00000395  hdr_x-archived-at
0.0000538      26       0  0x00000394  hdr_x-disclaimer
0.0000538      26       0  0x00000395  hdr_x-reply-to
0.0000560      25       0  0x00000394  hdr_x-compuserve-customer
0.0000560      25       0  0x00000394  hdr_x-punge
0.0000560      25       0  0x00000394  hdr_x-sbi
0.0000560      25       0  0x00000394  hdr_x-terminate
0.0000560      25       0  0x00000394  hdr_x-treme
0.0000560      25       0  0x00000395  hdr_x-juno-att
0.0000560      25       0  0x00000395  hdr_x-juno-refparts
0.0000560      25       0  0x00000395  hdr_x-pgp-key
0.0000560      25       0  0x00000395  hdr_x-ufl-scanned-by
0.0000560      25       0  0x00000395  hdr_x-ufl-spam-status
0.0000583      24       0  0x00000394  hdr_x-emailedto
0.0000583      24       0  0x00000394  hdr_x-userid
0.0000583      24       0  0x00000395  hdr_x-frameusers
0.0000609      23       0  0x00000394  hdr_x-newsreader
0.0000609      23       0  0x00000394  hdr_x-ntf-cell_id
0.0000609      23       0  0x00000394  hdr_x-ntf-mime
0.0000609      23       0  0x00000394  hdr_x-ntf-unique_key
0.0000636      22       0  0x00000394  hdr_error
0.0000636      22       0  0x00000394  hdr_usage
0.0000636      22       0  0x00000395  hdr_x-egroups-from
0.0000636      22       0  0x00000395  hdr_x-ks
0.0000636      22       0  0x00000395  hdr_x-mailman-id
0.0000667      21       0  0x00000394  hdr_x-imail-spam-valhelo
0.0000667      21       0  0x00000395  hdr_x-originating-server
0.0000667      21       0  0x00000395  hdr_x-pair-authenticated
0.0000667      21       0  0x00000395  hdr_x-sympa-to
0.0000667      21       0  0x00000395  hdr_x-validation-by
0.0000700      20       0  0x00000394  hdr_x-cron-env
0.0000700      20       0  0x00000394  hdr_x-srs-rewrite
0.0000700      20       0  0x00000394  hdr_x-unsub
0.0000778      18       0  0x00000394  hdr_x-broadcast-flag
0.0000778      18       0  0x00000394  hdr_x-consensus-at-lawyerpoint
0.0000778      18       0  0x00000394  hdr_x-cruelty-to-analog
0.0000778      18       0  0x00000394  hdr_x-modulation
0.0000778      18       0  0x00000395  hdr_x-declude-sender
0.0000778      18       0  0x00000395  hdr_x-newsserver
0.0000778      18       0  0x00000395  hdr_x-smtpserver
0.0000778      18       0  0x00000395  hdr_x-whitelisted
0.0000823      17       0  0x00000394  hdr_x-clips-url
0.0000823      17       0  0x00000394  hdr_x-list-id
0.0000823      17       0  0x00000394  hdr_x-mailscanner-spamcheck
0.0000823      17       0  0x00000394  hdr_x-original-sender
0.0000823      17       0  0x00000394  hdr_x-pstn-levels
0.0000823      17       0  0x00000395  hdr_emacs
0.0000823      17       0  0x00000395  hdr_x-copyright
0.0000875      16       0  0x00000394  hdr_x-gpg-key-fingerprint
0.0000875      16       0  0x00000394  hdr_x-uwash-spam
0.0000875      16       0  0x00000395  hdr_x-me-uuid
0.0000875      16       0  0x00000395  hdr_x-ob-received
0.0000875      16       0  0x00000395  hdr_x-unsubscribe
0.0000875      16       0  0x00000395  hdr_x-usanet-msgid
0.0000875      16       0  0x00000395  hdr_x-usanet-source
0.0000933      15       0  0x00000394  hdr_x-newsgroups
0.0000933      15       0  0x00000394  hdr_x-quris
0.0000933      15       0  0x00000394  hdr_x-spamscore
0.0000933      15       0  0x00000395  hdr_x-mho-user
0.0000933      15       0  0x00000395  hdr_x-report-abuse-to


Spam Headers:

0.9997579       0      10  0x00000394  hdr_elistexpress-info
0.9997799       0      11  0x00000394  hdr_x-mdrcpt-to
0.9997982       0      12  0x00000394  hdr_error-to
0.9997982       0      12  0x00000395  hdr_content-alias
0.9998270       0      14  0x00000395  hdr_x-gmx-antivirus
0.9998386       0      15  0x00000394  hdr_x-mailnum
0.9998576       0      17  0x00000394  hdr_x-contact
0.9998576       0      17  0x00000394  hdr_x-titankey-e_id
0.9998655       0      18  0x00000394  hdr_x-gfol
0.9998725       0      19  0x00000394  hdr_x-library
0.9998725       0      19  0x00000395  hdr_x-desist
0.9998725       0      19  0x00000395  hdr_x-satatus
0.9998725       0      19  0x00000395  hdr_x_uid
0.9998725       0      19  0x00000395  hdr_x-yurttell
0.9998991       0      24  0x00000394  hdr_x-header-companydbusername
0.9998991       0      24  0x00000394  hdr_x-header-masterid
0.9998991       0      24  0x00000394  hdr_x-header-versions
0.9998991       0      24  0x00000395  hdr_x-kaspersky-antivirus
0.9998991       0      24  0x00000395  hdr_x-subscriber
0.9999069       0      26  0x00000395  hdr_x-campidz
0.9999165       0      29  0x00000395  hdr_language
0.9999243       0      32  0x00000394  hdr_x-astrocenter-type
0.9999243       0      32  0x00000394  hdr_x-astrocenter-uid
0.9999327       0      36  0x00000395  hdr_x-rav-antivirus
0.9999327       0      36  0x00000395  hdr_x-rocket-spam
0.9999363       0      38  0x00000394  hdr_x-spam-forward
0.9999395       0      40  0x00000395  hdr_authentication-results
0.9999395       0      40  0x00000395  hdr_x-yahoo-forwarded
0.9999534       0      52  0x00000394  hdr_x-mozilla-draft-info
0.9999534       0      52  0x00000394  hdr_x-mozilla-status
0.9999560       0      55  0x00000395  hdr_x-cid
0.9999560       0      55  0x00000395  hdr_x-yahoofilteredbulk
0.9999644       0      68  0x00000395  hdr_x-disembark
0.9999644       0      68  0x00000395  hdr_x-nthart
0.9999649       0      69  0x00000395  hdr_x-clienthost
0.9999649       0      69  0x00000395  hdr_x-ip
0.9999664       0      72  0x00000395  hdr_x-mailingid
0.9999693       0      79  0x00000395  hdr_x-unsent
0.9999839       0     150  0x00000395  hdr_original-recipient
0.9999916       0     290  0x00000395  hdr_x-message-info