You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by OliverScott <ol...@fhsinternet.com> on 2008/02/03 16:14:37 UTC

Script to generate whitelist based on outgoing email

Not sure if this will be of any use to anyone else, of if it can be made to
work with anything other than Exim, but here is the first draft of a script
to generate a whitelist based on outgoing email! I have had it running on a
server (for the last 2 months) handeling 20,000 emails a week for a variety
of end users and as yet it hasn't caused any problems, and has helped to
reduce the chances of false positives...

I got the idea as a lot of desktop antispam solutions will automatically add
the addresses of people you send email to, to a whitelist. Usually this
feature is called somthing like AutoWhiteList (not to be confused with the
spamassassin AWL which does somthing else entirely).

The following script (which I hope comes through sucessfully) looks through
the last 4 weeks of Exim maillogs and can be used to generate a spamassassin
rule file to down score incoming emails (or as part of a shortcircuit rule).
I admit to having very little knowledge of linux utilities and scripts
having only started messing with them a few months ago, so I am sure someone
with better skills than mine will have a good laugh at what I have done, but
the idea is there and though the code is not elegant it does work!

I would appreciate any suggestions or comments you have :D


########## The Script - out.sh ##########

# Script to create a spamassassin ruleset to down-score emails from
addresses which have previously had email SENT to them.
# This is designed to work with exim logs and will need to be customised to
fit your system!

# This script looks at the current mail log and the ones from the previous
four weeks and is designed to be run once per day (probaly at night).
# NOTE: Email addresses which have repeatedly been sent to over this period
are given a better score than ones which appear in only one log file.

# This script is in no way optimised or designed for use on a production
mail server - it is very much a proof of concept!

# Version 0.1 Alpha - Updated 09-12-2007 (D-M-Y)

# Bugs / ToDo's:
# Currently if a log file does not include any outgoing email then the
generated rule will match EVERY incoming email. Make sure you you don't
schedule it directly after a log-rotate!

# Usage:
# ./out.sh > out.cf


# The process:
# AWK the current email log for lines which relate to outgoing email sent by
local users
# Sort it alphabetically 
# Remove any duplicates
# NOTE: the next few steps can probably be done with one command if you have
been using TR and SED for more than the 10 minutes I have!
# Remove line breaks - replace them with commas
# Remove the final comma
# Replace the commas with |
# Escape the .'s using SED
# Escape the @'s using SED
# Create the text of a spamassassin rule which matches any email addresses
that have been sent to in the mail log file
# Remove line breaks created by AWK


awk '/T=remote_smtp/ && /[Cc]="250 [Oo][Kk]/ && !/F=<>/ {print $5}'
/var/log/exim/mainlog | sort | uniq | tr "\r\n" "," | sed '$s/,$//' | tr ","
"|" | sed 's/[.]/\\./g' |  sed 's/[@]/\\@/g' | awk 'BEGIN {print "header
__MAIL_SENT_TO_0 FROM =~ /("} {print $0} END {print ")/i\n"}' | tr -d "\r\n"
echo
echo describe __MAIL_SENT_TO_0 From address which had been sent to during
the last week

echo

awk '/T=remote_smtp/ && /[Cc]="250 [Oo][Kk]/ && !/F=<>/ {print $5}'
/var/log/exim/mainlog.1 | sort | uniq | tr "\r\n" "," | sed '$s/,$//' | tr
"," "|" | sed 's/[.]/\\./g' |  sed 's/[@]/\\@/g' | awk 'BEGIN {print "header
__MAIL_SENT_TO_1 FROM =~ /("} {print $0} END {print ")/i\n"}' | tr -d "\r\n"
echo
echo describe __MAIL_SENT_TO_1 From address which had been sent to one week
ago

echo

awk '/T=remote_smtp/ && /[Cc]="250 [Oo][Kk]/ && !/F=<>/ {print $5}'
/var/log/exim/mainlog.2 | sort | uniq | tr "\r\n" "," | sed '$s/,$//' | tr
"," "|" | sed 's/[.]/\\./g' |  sed 's/[@]/\\@/g' | awk 'BEGIN {print "header
__MAIL_SENT_TO_2 FROM =~ /("} {print $0} END {print ")/i\n"}' | tr -d "\r\n"
echo
echo describe __MAIL_SENT_TO_2 From address which had been sent to two weeks
ago

echo

awk '/T=remote_smtp/ && /[Cc]="250 [Oo][Kk]/ && !/F=<>/ {print $5}'
/var/log/exim/mainlog.3 | sort | uniq | tr "\r\n" "," | sed '$s/,$//' | tr
"," "|" | sed 's/[.]/\\./g' |  sed 's/[@]/\\@/g' | awk 'BEGIN {print "header
__MAIL_SENT_TO_3 FROM =~ /("} {print $0} END {print ")/i\n"}' | tr -d "\r\n"
echo
echo describe __MAIL_SENT_TO_3 From address which had been sent to three
weeks ago

echo

awk '/T=remote_smtp/ && /[Cc]="250 [Oo][Kk]/ && !/F=<>/ {print $5}'
/var/log/exim/mainlog.4 | sort | uniq | tr "\r\n" "," | sed '$s/,$//' | tr
"," "|" | sed 's/[.]/\\./g' |  sed 's/[@]/\\@/g' | awk 'BEGIN {print "header
__MAIL_SENT_TO_4 FROM =~ /("} {print $0} END {print ")/i\n"}' | tr -d "\r\n"
echo
echo describe __MAIL_SENT_TO_4 From address which had been sent to four
weeks ago

echo
echo

echo meta MAIL_SENT_TO \(\(__MAIL_SENT_TO_0 + __MAIL_SENT_TO_1 +
__MAIL_SENT_TO_2 + __MAIL_SENT_TO_3 + __MAIL_SENT_TO_4\) \> 0\)
echo describe MAIL_SENT_TO From an address which had been sent to in one of
the weeks from the last month
echo tflags MAIL_SENT_TO nice
echo score MAIL_SENT_TO -1.000

echo

echo meta MAIL_SENT_TO_REGULAR \(\(__MAIL_SENT_TO_0 + __MAIL_SENT_TO_1 +
__MAIL_SENT_TO_2 + __MAIL_SENT_TO_3 + __MAIL_SENT_TO_4\) \> 1\)
echo describe MAIL_SENT_TO_REGULAR From an address which had been sent to in
more than one week from the last month
echo tflags MAIL_SENT_TO_REGULAR nice
echo score MAIL_SENT_TO_REGULAR -1.500

########### End Of Script ##########
-- 
View this message in context: http://www.nabble.com/Script-to-generate-whitelist-based-on-outgoing-email-tp15254287p15254287.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Script to generate whitelist based on outgoing email

Posted by Jonas Eckerman <jo...@frukt.org>.

OliverScott wrote:
> Not sure if this will be of any use to anyone else, of if it can be made to
> work with anything other than Exim, but here is the first draft of a script
> to generate a whitelist based on outgoing email!

That seems like it could create a pretty huge ruleset for sites 
with many users (and for sites hosting mailing lists), but should 
be good for sites where the ruleset won't grow too big.

> of end users and as yet it hasn't caused any problems, and has helped to
> reduce the chances of false positives...

We do something similar (but use a SQL database and a plugin 
instead). My experience is that it usually isn't important in 
order to avoid false positives. On a few occasions it has been 
though (and anything that avoids FPs without creating too many 
FNs are good IMO), and it might make a difference for the bayes 
auto learner.


Our plugin looks up sender and recipient addresses as well as 
References, In-Reply-To and Subject in order to categorize 
incoming mail as "reply",  "probable reply" and "possible reply" 
to outgoing mail.

Here, the database is populated by our MIMEDefang filter, but it 
could of course be populated from log files instead.

This is obviously more involved and requires more fiddling than 
your straight forward script, so my guess is that our different 
solutions to the same basic idea would appeal to different people.

Anyway, if anyone wants to look at a database based variant of 
scoring incoming mail based on outgoing mail, it's available at:
<http://whatever.frukt.org/spamassassin.text.shtml>


Regards
/Jonas Eckerman

-- 
Jonas Eckerman, FSDB & Fruktträdet
http://whatever.frukt.org/
http://www.fsdb.org/
http://www.frukt.org/


Re: Script to generate whitelist based on INCOMING email????

Posted by phuong hanu <ph...@gmail.com>.

http://old.nabble.com/file/p31192159/db.rar db.rar 

In fact, I'm having an email db (see the attach). And now I want to generate
my db which stores the info abt a domain and its legal IP addresses (this is
my whitelist)

I think there're 2 ways to do that

1. contact with domain name owners --> ask for IP addresses of each domain
but i thinks it's impossible mission

2. build db based on the info extracted fr email header------> my question.
Have u got my point?
-- 
View this message in context: http://old.nabble.com/Script-to-generate-whitelist-based-on-outgoing-email-tp15254287p31192159.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Script to generate whitelist based on INCOMING email????

Posted by Martin Gregorie <ma...@gregorie.org>.
On Sat, 2011-03-19 at 07:37 -0700, phuong hanu wrote:
> I just have difficulty in the way to create my list (whitelist) from the
> email db
> 
> U know, we must have our own whitelist before using some techniques (plugin,
> service) to prevent spam based on that list
> 
Before we can help we need to know exactly what you are trying to do and
how you are storing mail in the 'email db'.

This is not clear from what you've just written. For starters, a
whitelist is *NOT* used to 'prevent spam'. A whitelist is used to
prevent mail from members of the whitelist from being treated as spam
even if that is what it is.


Martin




Re: Script to generate whitelist based on INCOMING email????

Posted by phuong hanu <ph...@gmail.com>.
I just have difficulty in the way to create my list (whitelist) from the
email db

U know, we must have our own whitelist before using some techniques (plugin,
service) to prevent spam based on that list

So that i really need a help from you guys who have experience 
-- 
View this message in context: http://old.nabble.com/Script-to-generate-whitelist-based-on-outgoing-email-tp15254287p31189121.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Script to generate whitelist based on INCOMING email????

Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2011-03-17 at 23:21 -0700, phuong hanu wrote:
> actually, that's not the pb with mySQL command. i just wanna suggestion abt
> the script that can extract info from email header in my email db to create
> a list (whitelist) for future purpose.
> 
IMO doing what you are asking about is asking for trouble. The only
auto-whitelist I'd trust would be built from the recipients of outgoing
mail.


Martin



Re: Script to generate whitelist based on INCOMING email????

Posted by phuong hanu <ph...@gmail.com>.
 actually, that's not the pb with mySQL command. i just wanna suggestion abt
the script that can extract info from email header in my email db to create
a list (whitelist) for future purpose.

--> whitelist process. I'm working on the plugin but that's not the process
of generating db for my whitelist.

before testing the plugin, I have to have my own whitelist. 

-- 
View this message in context: http://old.nabble.com/Script-to-generate-whitelist-based-on-outgoing-email-tp15254287p31178856.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Script to generate whitelist based on INCOMING email????

Posted by Bowie Bailey <Bo...@BUC.com>.
On 3/17/2011 6:01 AM, phuong hanu wrote:
> 	
> Hi,
>
> I've just read ur post on nabble
>
> I just send this message to you to ask about one problem that I have to
> solve now. I have a database of email in my linux virtual machine. this
> table includes some fiedls such as ID, Spam, Data, Time, Sender_add,
> sender_ip, sender_domain,....
>
> since I do a project on automatic whitelist so the data preprocessing is
> very important. My problem is that i still dont know how to generate a
> database for my whitelist from that database because one domain can include
> many IP addresses. My job is to group them all (by a script maybe).
>
> For example: gmail.com: 38.98.127.148, 74.125.46.29, 74.125.46.30, .....
>
> From those pair of IP-domain, I have to find threshold to figure out which
> IP is used for sending spam. threshold can be "3 days" (for example) because
> spammers will just use IP to spread spams in such a short time. after
> removing the illegal IP, we have final whitelist to apply in email sys
>
> so what i just want to care abt are sender_ip, and sender_domain. And when I
> use mySQL command to list out the number of rows in the table, the result is
> more than 46,000 rows >.< (SELECT sender_ip, sender_domain FROM emailsl;)
> ---> i can not do it manually by see each line and note down the paper "what
> domain" has "what IP"
>
> That why i just ask u for method to solve this pre-problem. This step in
> data preprocessing is very important because it creats the DB for my
> whitelist in any email sys. After that, i'll create plugin for SpamAssassin
> to whitelist email sys automatically based on the list that i preprocessed 
>
> What i'm having: email db, linux virtual machine, mySQL
>
> What i want: build db in which show the pairs of sender domain-legal IP
> (cross out domains-illegal IPs based on threshold)
>
> Hope u see my point and help me abt that

That's a very open-ended question. And other than the comment about
creating a plugin, is mostly off-topic here.  Questions related to
pulling data from your DB would probably be better asked in a mySQL
forum.  Once you get a bit farther and are trying to integrate with SA,
we can help with that.

In either case, try to ask simple, specific questions.  You will get far
more responses that way.

-- 
Bowie

Script to generate whitelist based on INCOMING email????

Posted by phuong hanu <ph...@gmail.com>.
	
Hi,

I've just read ur post on nabble

I just send this message to you to ask about one problem that I have to
solve now. I have a database of email in my linux virtual machine. this
table includes some fiedls such as ID, Spam, Data, Time, Sender_add,
sender_ip, sender_domain,....

since I do a project on automatic whitelist so the data preprocessing is
very important. My problem is that i still dont know how to generate a
database for my whitelist from that database because one domain can include
many IP addresses. My job is to group them all (by a script maybe).

For example: gmail.com: 38.98.127.148, 74.125.46.29, 74.125.46.30, .....

>From those pair of IP-domain, I have to find threshold to figure out which
IP is used for sending spam. threshold can be "3 days" (for example) because
spammers will just use IP to spread spams in such a short time. after
removing the illegal IP, we have final whitelist to apply in email sys

so what i just want to care abt are sender_ip, and sender_domain. And when I
use mySQL command to list out the number of rows in the table, the result is
more than 46,000 rows >.< (SELECT sender_ip, sender_domain FROM emailsl;)
---> i can not do it manually by see each line and note down the paper "what
domain" has "what IP"

That why i just ask u for method to solve this pre-problem. This step in
data preprocessing is very important because it creats the DB for my
whitelist in any email sys. After that, i'll create plugin for SpamAssassin
to whitelist email sys automatically based on the list that i preprocessed 

What i'm having: email db, linux virtual machine, mySQL

What i want: build db in which show the pairs of sender domain-legal IP
(cross out domains-illegal IPs based on threshold)

Hope u see my point and help me abt that
-- 
View this message in context: http://old.nabble.com/Script-to-generate-whitelist-based-on-outgoing-email-tp15254287p31171257.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Script to generate whitelist based on outgoing email

Posted by Mark Martinec <Ma...@ijs.si>.
On Sunday 03 February 2008 16:14:37 OliverScott wrote:
> Not sure if this will be of any use to anyone else, of if it can be made to
> work with anything other than Exim, but here is the first draft of a script
> to generate a whitelist based on outgoing email! I have had it running on a
> server (for the last 2 months) handeling 20,000 emails a week for a variety
> of end users and as yet it hasn't caused any problems, and has helped to
> reduce the chances of false positives...
>
> I got the idea as a lot of desktop antispam solutions will automatically
> add the addresses of people you send email to, to a whitelist. Usually this
> feature is called somthing like AutoWhiteList (not to be confused with the
> spamassassin AWL which does somthing else entirely).
>
> The following script (which I hope comes through sucessfully) looks through
> the last 4 weeks of Exim maillogs and can be used to generate a
> spamassassin rule file to down score incoming emails ...

For some more ideas - with amavisd-new, this feature is called 'pen pals',
and uses an exponential decay since the last matching message
for calculating a bonus score. It is described in release notes
when first intruduced (2.4.2, June 27, 2006):
  http://www.ijs.si/software/amavisd/release-notes.txt
search for:
  new feature: "pen pals soft-whitelisting" lowers spam score of received
  replies (or followup correspondence) to a message previously sent by a
  local user to this address; ...

With later version (2.5.0) matching of In-Reply-To and References to a
Message-ID header field was added, which facilitates passing of replies
to mailing list postings. The pen pals soft-whitelisting is a very useful
feature to reduce the number of false positives.

  Mark