You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-user@james.apache.org by Marc Chamberlin <ma...@marcchamberlin.com> on 2008/09/20 09:11:23 UTC
Using the Bayesian Analysis mailet
I was recently advised here on this group to enable the Bayesian
Analysis mailet on the James server in order to help control some spam
that is getting created and sent to some of my maillists that I sponsor.
I have tried to understand and follow the documentation on the James
wiki site but so far not been able to get it set up an running properly.
A couple of points in particular has me confused -
1. on the wiki page it says -
"It is a good idea to activate SMTP AUTH and replace thisdomain.com with
a domain not listed as a server in <servernames> in config.xml: this way
only authenticated users can feed the corpus. An example of addresses to
use could be "[MAILTO] ham@bayes.feeder" and "[MAILTO] spam@bayes.feeder". "
My server is already set up to use SMTP AUTH. I have a single domain
name that I have purchased from Network Solutions, lets call it
mydomain.com. I have listed mydomain.com in the <servernames> section of
the James config.xml file. I want this server to service both internal
and external users. So what exactly is this suggestion asking me to do?
Do I need to purchase another domain name in order to run this Bayesian
Analysis mailet? That does not make sense to me...
I presume (guess) that I can use a more qualified URL such as what I did
which seems, at first glance, to have worked. Here I preceded my domain
with the name of the machine on which I am running the James server.
<mailet match="RecipientIs=spam@myhostservername.mydomain.com"
class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>spam</feedType>
<maxSize>200000</maxSize>
</mailet>
2. I am not sure I fully understand the concept of having both a spam
and a ham feedback to the Bayesian Analyzer. Spam I can understand, that
is used to teach the analyzer what is spam. But why have a ham feedback?
Do the users have to teach the analyzer what is good email also??? That
seems like an extraordinary burden to place on them.
3. That last question is related to this next one because I haven't got
this working yet. So I don't yet know what to expect fully when I do get
it working. I pretty much set up my config.xml with as little change as
possible. I simply uncommented the two bayesian analysis feeder mailets
and modified the RecipientIs parameters as described above. Then I
uncommented the four bayesian analysis mailets and left them as is. The
log files showed that a bunch of tables got created in the mysql
database OK. Next I went to my Junk folder in my email client and feed
a few pieces of already collected spam back to the bayesian analysis
feeder address as mail attachments, just to get it started. James seems
quite happy to accept them and nothing bounced so I figured it was
working. I left the mail server running to see how it would behave and
the trouble is James ate EVERYTHING that came in from outside senders,
but internal users could send email to each other OK. So we seem to have
lost a whole lot of email today and I had to turn the Bayesian analyzer off.
So what did I do wrong? Doesn't seem to have worked too well 'out of the
box'! The documentation seems to be unclear on some of this or just
plain missing, the only thing I could find was on the wiki pages.
Nothing in the main documentation. Another question - What does James do
with the email that it filters out with the Bayesian analyzer? I looked
in the spam folder in the mail database and nothing was there. Nor did
the postmaster address receive anything? Do these emails simply go to
/dev/null? Is there some kind of summary email sent to the users so that
they can verify/retrieve email is necessary? (perhaps that is the
purpose of the ham feedback? I am guessing.....)
Hopefully someone will help walk me out of these woods, I am kinda
lost.. Thanks in advance...
Marc...
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Using the Bayesian Analysis mailet
Posted by David Legg <da...@searchevent.co.uk>.
Hi Marc,
> Occasionally I do find an email that was not really spam and I have to
> do two things with it, send it back to James's nospam mailet, and
> forward it on to the user who should have gotten it. This latter step
> is a bit of a pain because I usually have to clean up the email (or
> forward it as is)..
>
> Is there an easy way to send the original version of an email to the
> user which was accidentally marked as spam? I hope I don't have to
> keep double checking all this spam, I get thousands daily... How do
> you and others handle the spam that the Bayesian filter catches?
> Suggestions welcomed, this could get rather tedious for me...
It will get better quite quickly until you reach the 'maintenance'
level. At that point your workload will diminish to the point where you
only have to mark single digits of emails as spam a day. My spam
database has analysed over 16,000 examples of spam in the last 3 years
and I still get flurries where a new flavour of spam slips through. I
guess there are only so many ways to spell that word beginning with 'V'!
After a while I stopped forwarding false positives completely. This
sounds bad, but like you, I didn't fancy manually forwarding messages to
users for the rest of my life. One thing which I found helps
enourmously is to set up the whitelist manager in James. This ensures
that when your users send a message to someone their address is added to
a 'whitelist'. If an email is received where the from address matches
someone in the whitelist then it is let through without spam checking.
Like the Bayesian filter your users can manually add or remove addresses
from the whitelist by sending special emails to the whitelist manager
email address.
Another alternative you could try until your filter is working well
enough is to simply raise the spam score threshold to a really high
figure like 95%. This will make the filter more lenient and reduce the
false positives at the expense of letting more spam through. Be aware
though that the spam score tends to fluctuate wildly. You will tend to
find that most scores are either very very small or very very large and
only a relative few will have a score between 1% to 99%.
If you used my config.xml settings you should find that all emails
passing through your server have a header like this: -
X-MessageIsSpamProbability: 3.455076140932531E-22
You can use this to satisfy your curiosity about any email's score.
Another thing I like to do sometimes to guage how effective the spam
filter is working is to search the daily mailet log file and count the
number of lines containing '%;'. This is because every analysis result
is recorded like the following: -
29/03/08 17:59:41 INFO James.Mailet: BayesianAnalysis:
X-MessageIsSpamProbability: 100%; From: extrmtao@goline.ca;
Recipient(s): [someone@localhost]
So a command like: -
egrep '100%;' mailet-2008-09-27-17-14.log | wc
will give you an indication of how many emails have been rejected that
day. If you do that over several months it gives you an idea how fast
spam levels are rising!
Regards,
David Legg
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Using the Bayesian Analysis mailet
Posted by Marc Chamberlin <ma...@marcchamberlin.com>.
Thanks David for all your help! I really appreciated it. I think I have
the Bayesian filter running fine now, and it does seem to be getting
better at sorting out spam from the good stuff. Especially as I send it
feedback on what it misses as spam and what it mistakenly thinks is
spam. I have set up James to send to me, wearing my "postmaster" hat,
everything that the Bayesian filter thinks is spam. Occasionally I do
find an email that was not really spam and I have to do two things with
it, send it back to James's nospam mailet, and forward it on to the user
who should have gotten it. This latter step is a bit of a pain because
I usually have to clean up the email (or forward it as is)..
Is there an easy way to send the original version of an email to the
user which was accidentally marked as spam? I hope I don't have to keep
double checking all this spam, I get thousands daily... How do you and
others handle the spam that the Bayesian filter catches? Suggestions
welcomed, this could get rather tedious for me...
Marc...
David Legg wrote:
> Hi Marc,
>
>> My server is already set up to use SMTP AUTH. I have a single domain
>> name that I have purchased from Network Solutions, lets call it
>> mydomain.com. I have listed mydomain.com in the <servernames> section
>> of the James config.xml file. I want this server to service both
>> internal and external users. So what exactly is this suggestion
>> asking me to do? Do I need to purchase another domain name in order
>> to run this Bayesian Analysis mailet? That does not make sense to me...
>
> No, there is no need to purchase another domain. The instructions are
> suggesting that you choose email addresses which are impossible for
> outsiders to use in order to tell the server what is ham and what is
> spam. For example, in my server I literally chose 'spam@xxx.yyy' and
> not.'spam@xxx.yyy'. Not only are these unlikely to be real domains
> but anyone trying to send emails addressed to these addresses will be
> challenged for an SMTP password. Thus only authorized users will be
> able to train the Bayesian analysis filter.
>
>> 2. I am not sure I fully understand the concept of having both a spam
>> and a ham feedback to the Bayesian Analyzer. Spam I can understand,
>> that is used to teach the analyzer what is spam. But why have a ham
>> feedback? Do the users have to teach the analyzer what is good email
>> also??? That seems like an extraordinary burden to place on them.
>
> It is true that the Bayesian filter works best with examples of both
> spam and not spam (ham). This is the drawback of this technique. The
> other drawback is that the system doesn't discriminate between one
> users view of what is spam and another's. The best you can hope for
> is a general consensus. To be honest I wouldn't trust the users to
> keep the system up to date. I tend to do all the spam control
> myself. Don't forget that each individual user can still have their
> own anti-spam tools. Your aim is to keep out the bulk of the spam...
> you won't be able to completely eradicate it.
>
>> ... So what did I do wrong? Doesn't seem to have worked too well 'out
>> of the box'!
>
> The filter works by classifying some email and giving it a score. It
> doesn't do anything else to it. You have to set up the pipeline to do
> something with messages which score too highly as spam. Initially, I
> forwarded all failed emails to the postmaster address so that I could
> check them manually and forward them to their owner if they were
> mis-classified. These days the filter is so good I simply throw away
> anything over a 50% threshold.
>
>> Hopefully someone will help walk me out of these woods, I am kinda
>> lost.. Thanks in advance...
>
> It looks like you have done the worst bit already. However a
> difficult bit is deciding on how to process email through your
> pipeline. If it helps here is a shortened version of my config.xml
> file showing the Bayesian analysis settings I use. Where I have not
> shown parts of the file I have marked them with '...' characters.
> Notice the commented out section which controls whether messages
> considered as spam are simply deleted or sent to the postmaster.
>
> Hope this helps.
>
> Regards,
> David Legg
>
> ------------------------ config.xml ---------------------------------
>
> ...
> <config>
> ...
> <spoolmanager>
> <threads> 5 </threads>
>
> <!-- ROOT PROCESSOR -->
> <processor name="root">
> ...
> <!-- "not spam" bayesian analysis feeder. -->
> <mailet match="RecipientIs=not.spam@xxx.yyy"
> class="BayesianAnalysisFeeder">
> <repositoryPath> db://maildb </repositoryPath>
> <feedType>ham</feedType>
> <maxSize>500000</maxSize>
> </mailet>
> <!-- "spam" bayesian analysis feeder. -->
> <mailet match="RecipientIs=spam@xxx.yyy"
> class="BayesianAnalysisFeeder">
> <repositoryPath> db://maildb </repositoryPath>
> <feedType>spam</feedType>
> <maxSize>500000</maxSize>
> </mailet>
> ...
> <!-- Anti-spam processing -->
> <!-- The following two entries avoid double anti-spam analysis
> -->
> <!-- for forwarded messages. -->
> <!-- Has spam checking already been done? -->
> <mailet match="HasMailAttribute=spamChecked" class="ToProcessor">
> <processor> transport </processor>
> </mailet>
> <!-- Spam checking will not be done twice -->
> <mailet match="All" class="SetMailAttribute">
> <spamChecked>true</spamChecked>
> </mailet>
>
> <!-- Messages from authenticated senders are never spam -->
> <mailet match="SMTPAuthSuccessful" class="ToProcessor">
> <processor> transport </processor>
> </mailet>
> ... <!-- Anti spam bayesian analysis -->
> <mailet match="All" class="BayesianAnalysis"
> onMailetException="ignore">
> <repositoryPath>db://maildb</repositoryPath>
> <maxSize>3000000</maxSize>
> <headerName>X-MessageIsSpamProbability</headerName>
> <ignoreLocalSender>false</ignoreLocalSender>
> </mailet>
>
> <mailet
> match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50"
> class="SetMailAttribute" onMatchException="noMatch">
> <isSpam>true</isSpam>
> </mailet>
>
> <mailet
> match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50"
> class="SetMimeHeader" onMatchException="noMatch">
> <name>X-MessageIsSpam</name>
> <value>true</value>
> </mailet>
>
> <mailet
> match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50"
> class="ToProcessor" onMatchException="noMatch">
> <processor> spam </processor>
> <notice>Spam not accepted</notice>
> </mailet>
>
> <!-- Send remaining mails to the transport processor for
> either local or remote delivery -->
> <mailet match="All" class="ToProcessor">
> <processor> transport </processor>
> </mailet>
> </processor>
> ...
> <processor name="transport">
> <mailet match="SMTPAuthSuccessful" class="SetMimeHeader">
> <name>X-UserIsAuth</name>
> <value>true</value>
> </mailet>
> ...
> </processor>
>
> <processor name="spam">
> <mailet match="All" class="Null"/>
> <!-- To notify the postmaster that a message was marked as
> spam, uncomment this matcher/mailet configuration -->
> <!--
> <mailet match="All" class="NotifyPostmaster"/>
> -->
> </processor>
> ...
> </spoolmanager>
> ...
> </config>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Using the Bayesian Analysis mailet
Posted by David Legg <da...@searchevent.co.uk>.
Hi Marc,
> My server is already set up to use SMTP AUTH. I have a single domain
> name that I have purchased from Network Solutions, lets call it
> mydomain.com. I have listed mydomain.com in the <servernames> section
> of the James config.xml file. I want this server to service both
> internal and external users. So what exactly is this suggestion asking
> me to do? Do I need to purchase another domain name in order to run
> this Bayesian Analysis mailet? That does not make sense to me...
No, there is no need to purchase another domain. The instructions are
suggesting that you choose email addresses which are impossible for
outsiders to use in order to tell the server what is ham and what is
spam. For example, in my server I literally chose 'spam@xxx.yyy' and
not.'spam@xxx.yyy'. Not only are these unlikely to be real domains but
anyone trying to send emails addressed to these addresses will be
challenged for an SMTP password. Thus only authorized users will be
able to train the Bayesian analysis filter.
> 2. I am not sure I fully understand the concept of having both a spam
> and a ham feedback to the Bayesian Analyzer. Spam I can understand,
> that is used to teach the analyzer what is spam. But why have a ham
> feedback? Do the users have to teach the analyzer what is good email
> also??? That seems like an extraordinary burden to place on them.
It is true that the Bayesian filter works best with examples of both
spam and not spam (ham). This is the drawback of this technique. The
other drawback is that the system doesn't discriminate between one users
view of what is spam and another's. The best you can hope for is a
general consensus. To be honest I wouldn't trust the users to keep the
system up to date. I tend to do all the spam control myself. Don't
forget that each individual user can still have their own anti-spam
tools. Your aim is to keep out the bulk of the spam... you won't be
able to completely eradicate it.
> ... So what did I do wrong? Doesn't seem to have worked too well 'out
> of the box'!
The filter works by classifying some email and giving it a score. It
doesn't do anything else to it. You have to set up the pipeline to do
something with messages which score too highly as spam. Initially, I
forwarded all failed emails to the postmaster address so that I could
check them manually and forward them to their owner if they were
mis-classified. These days the filter is so good I simply throw away
anything over a 50% threshold.
> Hopefully someone will help walk me out of these woods, I am kinda
> lost.. Thanks in advance...
It looks like you have done the worst bit already. However a difficult
bit is deciding on how to process email through your pipeline. If it
helps here is a shortened version of my config.xml file showing the
Bayesian analysis settings I use. Where I have not shown parts of the
file I have marked them with '...' characters. Notice the commented out
section which controls whether messages considered as spam are simply
deleted or sent to the postmaster.
Hope this helps.
Regards,
David Legg
------------------------ config.xml ---------------------------------
...
<config>
...
<spoolmanager>
<threads> 5 </threads>
<!-- ROOT PROCESSOR -->
<processor name="root">
...
<!-- "not spam" bayesian analysis feeder. -->
<mailet match="RecipientIs=not.spam@xxx.yyy"
class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>ham</feedType>
<maxSize>500000</maxSize>
</mailet>
<!-- "spam" bayesian analysis feeder. -->
<mailet match="RecipientIs=spam@xxx.yyy"
class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>spam</feedType>
<maxSize>500000</maxSize>
</mailet>
...
<!-- Anti-spam processing -->
<!-- The following two entries avoid double anti-spam analysis -->
<!-- for forwarded messages. -->
<!-- Has spam checking already been done? -->
<mailet match="HasMailAttribute=spamChecked" class="ToProcessor">
<processor> transport </processor>
</mailet>
<!-- Spam checking will not be done twice -->
<mailet match="All" class="SetMailAttribute">
<spamChecked>true</spamChecked>
</mailet>
<!-- Messages from authenticated senders are never spam -->
<mailet match="SMTPAuthSuccessful" class="ToProcessor">
<processor> transport </processor>
</mailet>
...
<!-- Anti spam bayesian analysis -->
<mailet match="All" class="BayesianAnalysis"
onMailetException="ignore">
<repositoryPath>db://maildb</repositoryPath>
<maxSize>3000000</maxSize>
<headerName>X-MessageIsSpamProbability</headerName>
<ignoreLocalSender>false</ignoreLocalSender>
</mailet>
<mailet
match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50"
class="SetMailAttribute" onMatchException="noMatch">
<isSpam>true</isSpam>
</mailet>
<mailet
match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50"
class="SetMimeHeader" onMatchException="noMatch">
<name>X-MessageIsSpam</name>
<value>true</value>
</mailet>
<mailet
match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50"
class="ToProcessor" onMatchException="noMatch">
<processor> spam </processor>
<notice>Spam not accepted</notice>
</mailet>
<!-- Send remaining mails to the transport processor for either
local or remote delivery -->
<mailet match="All" class="ToProcessor">
<processor> transport </processor>
</mailet>
</processor>
...
<processor name="transport">
<mailet match="SMTPAuthSuccessful" class="SetMimeHeader">
<name>X-UserIsAuth</name>
<value>true</value>
</mailet>
...
</processor>
<processor name="spam">
<mailet match="All" class="Null"/>
<!-- To notify the postmaster that a message was marked as
spam, uncomment this matcher/mailet configuration -->
<!--
<mailet match="All" class="NotifyPostmaster"/>
-->
</processor>
...
</spoolmanager>
...
</config>
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org