You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-user@james.apache.org by Marc Chamberlin <ma...@marcchamberlin.com> on 2008/09/20 09:11:23 UTC

Using the Bayesian Analysis mailet

I was recently advised here on this group to enable the Bayesian 
Analysis mailet on the James server in order to help control some spam 
that is getting created and sent to some of my maillists that I sponsor. 
I have tried to understand and follow the documentation on the James 
wiki site but so far not been able to get it set up an running properly. 
A couple of points in particular has me confused -

1. on the wiki page it says -

"It is a good idea to activate SMTP AUTH and replace thisdomain.com with 
a domain not listed as a server in <servernames> in config.xml: this way 
only authenticated users can feed the corpus. An example of addresses to 
use could be "[MAILTO] ham@bayes.feeder" and "[MAILTO] spam@bayes.feeder". "

My server is already set up to use SMTP AUTH.  I have a single domain 
name that I have purchased from Network Solutions, lets call it 
mydomain.com. I have listed mydomain.com in the <servernames> section of 
the James config.xml file. I want this server to service both internal 
and external users. So what exactly is this suggestion asking me to do? 
Do I need to purchase another domain name in order to run this Bayesian 
Analysis mailet? That does not make sense to me...

I presume (guess) that I can use a more qualified URL such as what I did 
which seems, at first glance, to have worked. Here I preceded my domain 
with the name of the machine on which I am running the James server.   

<mailet match="RecipientIs=spam@myhostservername.mydomain.com" 
class="BayesianAnalysisFeeder">
            <repositoryPath> db://maildb </repositoryPath>
            <feedType>spam</feedType>
            <maxSize>200000</maxSize>
 </mailet>

2. I am not sure I fully understand the concept of having both a spam 
and a ham feedback to the Bayesian Analyzer. Spam I can understand, that 
is used to teach the analyzer what is spam. But why have a ham feedback? 
Do the users have to teach the analyzer what is good email also??? That 
seems like an extraordinary burden to place on them.

3. That last question is related to this next one because I haven't got 
this working yet. So I don't yet know what to expect fully when I do get 
it working. I pretty much set up my config.xml with as little change as 
possible. I simply uncommented the two  bayesian analysis feeder mailets 
and modified the RecipientIs parameters as described above. Then I 
uncommented the four bayesian analysis mailets and left them as is. The 
log files showed that a bunch of tables got created in the mysql 
database OK.  Next I went to my Junk folder in my email client and feed 
a few pieces of already collected spam back to the bayesian analysis 
feeder address as mail attachments, just to get it started. James seems 
quite happy to accept them and nothing bounced so I figured it was 
working. I left the mail server running to see how it would behave and 
the trouble is James ate EVERYTHING that came in from outside senders, 
but internal users could send email to each other OK. So we seem to have 
lost a whole lot of email today and I had to turn the Bayesian analyzer off.

So what did I do wrong? Doesn't seem to have worked too well 'out of the 
box'! The documentation seems to be unclear on some of this or just 
plain missing, the only thing I could find was on the wiki pages. 
Nothing in the main documentation. Another question - What does James do 
with the email that it filters out with the Bayesian analyzer? I looked 
in the spam folder in the mail database and nothing was there. Nor did 
the postmaster address receive anything? Do these emails simply go to 
/dev/null? Is there some kind of summary email sent to the users so that 
they can verify/retrieve email is necessary? (perhaps that is the 
purpose of the ham feedback? I am guessing.....)

Hopefully someone will help walk me out of these woods, I am kinda 
lost.. Thanks in advance...

   Marc...








---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Using the Bayesian Analysis mailet

Posted by David Legg <da...@searchevent.co.uk>.
Hi Marc,

> Occasionally I do find an email that was not really spam and I have to 
> do two things with it, send it back to James's nospam mailet, and 
> forward it on to the user who should have gotten it.  This latter step 
> is a bit of a pain because I usually have to clean up the email (or 
> forward it as is)..
>
> Is there an easy way to send the original version of an email to the 
> user which was accidentally marked as spam? I hope I don't have to 
> keep double checking all this spam, I get thousands daily... How do 
> you and others handle the spam that the Bayesian filter catches? 
> Suggestions welcomed, this could get rather tedious for me...

It will get better quite quickly until you reach the 'maintenance' 
level.  At that point your workload will diminish to the point where you 
only have to mark single digits of emails as spam a day.  My spam 
database has analysed over 16,000 examples of spam in the last 3 years 
and I still get flurries where a new flavour of spam slips through.  I 
guess there are only so many ways to spell that word beginning with 'V'!

After a while I stopped forwarding false positives completely.  This 
sounds bad, but like you, I didn't fancy manually forwarding messages to 
users for the rest of my life.  One thing which I found helps 
enourmously is to set up the whitelist manager in James.  This ensures 
that when your users send a message to someone their address is added to 
a 'whitelist'.  If an email is received where the from address matches 
someone in the whitelist then it is let through without spam checking.  
Like the Bayesian filter your users can manually add or remove addresses 
from the whitelist by sending special emails to the whitelist manager 
email address.

Another alternative you could try until your filter is working well 
enough is to simply raise the spam score threshold to a really high 
figure like 95%.  This will make the filter more lenient and reduce the 
false positives at the expense of letting more spam through.  Be aware 
though that the spam score tends to fluctuate wildly.  You will tend to 
find that most scores are either very very small or very very large and 
only a relative few will have a score between 1% to 99%.

If you used my config.xml settings you should find that all emails 
passing through your server have a header like this: -

  X-MessageIsSpamProbability: 3.455076140932531E-22


You can use this to satisfy your curiosity about any email's score.

Another thing I like to do sometimes to guage how effective the spam 
filter is working is to search the daily mailet log file and count the 
number of lines containing '%;'.  This is because every analysis result 
is recorded like the following: -

  29/03/08 17:59:41 INFO  James.Mailet: BayesianAnalysis: 
X-MessageIsSpamProbability: 100%; From: extrmtao@goline.ca; 
Recipient(s): [someone@localhost]

So a command like: -

  egrep '100%;' mailet-2008-09-27-17-14.log | wc

will give you an indication of how many emails have been rejected that 
day.  If you do that over several months it gives you an idea how fast 
spam levels are rising!

Regards,
David Legg

---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Using the Bayesian Analysis mailet

Posted by Marc Chamberlin <ma...@marcchamberlin.com>.
Thanks David for all your help! I really appreciated it. I think I have 
the Bayesian filter running fine now, and it does seem to be getting 
better at sorting out spam from the good stuff. Especially as I send it 
feedback on what it misses as spam and what it mistakenly thinks is 
spam. I have set up James to send to me, wearing my "postmaster" hat, 
everything that the Bayesian filter thinks is spam. Occasionally I do 
find an email that was not really spam and I have to do two things with 
it, send it back to James's nospam mailet, and forward it on to the user 
who should have gotten it.  This latter step is a bit of a pain because 
I usually have to clean up the email (or forward it as is)..

Is there an easy way to send the original version of an email to the 
user which was accidentally marked as spam? I hope I don't have to keep 
double checking all this spam, I get thousands daily... How do you and 
others handle the spam that the Bayesian filter catches? Suggestions 
welcomed, this could get rather tedious for me...

    Marc...

David Legg wrote:
> Hi Marc,
>
>> My server is already set up to use SMTP AUTH.  I have a single domain 
>> name that I have purchased from Network Solutions, lets call it 
>> mydomain.com. I have listed mydomain.com in the <servernames> section 
>> of the James config.xml file. I want this server to service both 
>> internal and external users. So what exactly is this suggestion 
>> asking me to do? Do I need to purchase another domain name in order 
>> to run this Bayesian Analysis mailet? That does not make sense to me...
>
> No, there is no need to purchase another domain.  The instructions are 
> suggesting that you choose email addresses which are impossible for 
> outsiders to use in order to tell the server what is ham and what is 
> spam.  For example, in my server I literally chose 'spam@xxx.yyy' and 
> not.'spam@xxx.yyy'.  Not only are these unlikely to be real domains 
> but anyone trying to send emails addressed to these addresses will be 
> challenged for an SMTP password.  Thus only authorized users will be 
> able to train the Bayesian analysis filter.
>
>> 2. I am not sure I fully understand the concept of having both a spam 
>> and a ham feedback to the Bayesian Analyzer. Spam I can understand, 
>> that is used to teach the analyzer what is spam. But why have a ham 
>> feedback? Do the users have to teach the analyzer what is good email 
>> also??? That seems like an extraordinary burden to place on them.
>
> It is true that the Bayesian filter works best with examples of both 
> spam and not spam (ham).  This is the drawback of this technique.  The 
> other drawback is that the system doesn't discriminate between one 
> users view of what is spam and another's.  The best you can hope for 
> is a general consensus.  To be honest I wouldn't trust the users to 
> keep the system up to date.  I tend to do all the spam control 
> myself.  Don't forget that each individual user can still have their 
> own anti-spam tools.  Your aim is to keep out the bulk of the spam... 
> you won't be able to completely eradicate it.
>
>> ... So what did I do wrong? Doesn't seem to have worked too well 'out 
>> of the box'!
>
> The filter works by classifying some email and giving it a score.  It 
> doesn't do anything else to it.  You have to set up the pipeline to do 
> something with messages which score too highly as spam.  Initially, I 
> forwarded all failed emails to the postmaster address so that I could 
> check them manually and forward them to their owner if they were 
> mis-classified.  These days the filter is so good I simply throw away 
> anything over a 50% threshold.
>
>> Hopefully someone will help walk me out of these woods, I am kinda 
>> lost.. Thanks in advance...
>
> It looks like you have done the worst bit already.  However a 
> difficult bit is deciding on how to process email through your 
> pipeline.  If it helps here is a shortened version of my config.xml 
> file showing the Bayesian analysis settings I use.  Where I have not 
> shown parts of the file I have marked them with '...' characters.  
> Notice the commented out section which controls whether messages 
> considered as spam are simply deleted or sent to the postmaster.
>
> Hope this helps.
>
> Regards,
> David Legg
>
> ------------------------ config.xml ---------------------------------
>
> ...
> <config>
> ...
>   <spoolmanager>
>      <threads> 5 </threads>
>
>      <!-- ROOT PROCESSOR -->
>      <processor name="root">
> ...
>         <!-- "not spam" bayesian analysis feeder. -->
>         <mailet match="RecipientIs=not.spam@xxx.yyy" 
> class="BayesianAnalysisFeeder">
>            <repositoryPath> db://maildb </repositoryPath>
>            <feedType>ham</feedType>
>            <maxSize>500000</maxSize>
>         </mailet>
>            <!-- "spam" bayesian analysis feeder. -->
>         <mailet match="RecipientIs=spam@xxx.yyy" 
> class="BayesianAnalysisFeeder">
>            <repositoryPath> db://maildb </repositoryPath>
>            <feedType>spam</feedType>
>            <maxSize>500000</maxSize>
>         </mailet>
> ...
>         <!-- Anti-spam processing -->
>         <!-- The following two entries avoid double anti-spam analysis 
> -->
>         <!-- for forwarded messages. -->
>         <!-- Has spam checking already been done? -->
>         <mailet match="HasMailAttribute=spamChecked" class="ToProcessor">
>            <processor> transport </processor>
>         </mailet>
>         <!-- Spam checking will not be done twice -->
>         <mailet match="All" class="SetMailAttribute">
>            <spamChecked>true</spamChecked>
>         </mailet>
>
>         <!-- Messages from authenticated senders are never spam -->
>         <mailet match="SMTPAuthSuccessful" class="ToProcessor">
>            <processor> transport </processor>
>         </mailet>
> ...               <!-- Anti spam bayesian analysis -->
>         <mailet match="All" class="BayesianAnalysis" 
> onMailetException="ignore">
>            <repositoryPath>db://maildb</repositoryPath>
>            <maxSize>3000000</maxSize>
>            <headerName>X-MessageIsSpamProbability</headerName>
>            <ignoreLocalSender>false</ignoreLocalSender>
>         </mailet>
>
>         <mailet 
> match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50" 
> class="SetMailAttribute" onMatchException="noMatch">
>            <isSpam>true</isSpam>
>         </mailet>
>
>         <mailet 
> match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50" 
> class="SetMimeHeader" onMatchException="noMatch">
>            <name>X-MessageIsSpam</name>
>            <value>true</value>
>         </mailet>
>
>         <mailet 
> match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50" 
> class="ToProcessor" onMatchException="noMatch">
>            <processor> spam </processor>
>            <notice>Spam not accepted</notice>
>         </mailet>
>
>         <!-- Send remaining mails to the transport processor for 
> either local or remote delivery -->
>         <mailet match="All" class="ToProcessor">
>            <processor> transport </processor>
>         </mailet>
>      </processor>
> ...
>      <processor name="transport">
>         <mailet match="SMTPAuthSuccessful" class="SetMimeHeader">
>            <name>X-UserIsAuth</name>
>            <value>true</value>
>         </mailet>
> ...
>      </processor>
>
>      <processor name="spam">
>         <mailet match="All" class="Null"/>
>         <!-- To notify the postmaster that a message was marked as 
> spam, uncomment this matcher/mailet configuration -->
>         <!--
>         <mailet match="All" class="NotifyPostmaster"/>
>         -->
>      </processor>
> ...
>   </spoolmanager>
> ...
> </config>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
> For additional commands, e-mail: server-user-help@james.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Using the Bayesian Analysis mailet

Posted by David Legg <da...@searchevent.co.uk>.
Hi Marc,

> My server is already set up to use SMTP AUTH.  I have a single domain 
> name that I have purchased from Network Solutions, lets call it 
> mydomain.com. I have listed mydomain.com in the <servernames> section 
> of the James config.xml file. I want this server to service both 
> internal and external users. So what exactly is this suggestion asking 
> me to do? Do I need to purchase another domain name in order to run 
> this Bayesian Analysis mailet? That does not make sense to me...

No, there is no need to purchase another domain.  The instructions are 
suggesting that you choose email addresses which are impossible for 
outsiders to use in order to tell the server what is ham and what is 
spam.  For example, in my server I literally chose 'spam@xxx.yyy' and 
not.'spam@xxx.yyy'.  Not only are these unlikely to be real domains but 
anyone trying to send emails addressed to these addresses will be 
challenged for an SMTP password.  Thus only authorized users will be 
able to train the Bayesian analysis filter.

> 2. I am not sure I fully understand the concept of having both a spam 
> and a ham feedback to the Bayesian Analyzer. Spam I can understand, 
> that is used to teach the analyzer what is spam. But why have a ham 
> feedback? Do the users have to teach the analyzer what is good email 
> also??? That seems like an extraordinary burden to place on them.

It is true that the Bayesian filter works best with examples of both 
spam and not spam (ham).  This is the drawback of this technique.  The 
other drawback is that the system doesn't discriminate between one users 
view of what is spam and another's.  The best you can hope for is a 
general consensus.  To be honest I wouldn't trust the users to keep the 
system up to date.  I tend to do all the spam control myself.  Don't 
forget that each individual user can still have their own anti-spam 
tools.  Your aim is to keep out the bulk of the spam... you won't be 
able to completely eradicate it.

> ... So what did I do wrong? Doesn't seem to have worked too well 'out 
> of the box'!

The filter works by classifying some email and giving it a score.  It 
doesn't do anything else to it.  You have to set up the pipeline to do 
something with messages which score too highly as spam.  Initially, I 
forwarded all failed emails to the postmaster address so that I could 
check them manually and forward them to their owner if they were 
mis-classified.  These days the filter is so good I simply throw away 
anything over a 50% threshold.

> Hopefully someone will help walk me out of these woods, I am kinda 
> lost.. Thanks in advance...

It looks like you have done the worst bit already.  However a difficult 
bit is deciding on how to process email through your pipeline.  If it 
helps here is a shortened version of my config.xml file showing the 
Bayesian analysis settings I use.  Where I have not shown parts of the 
file I have marked them with '...' characters.  Notice the commented out 
section which controls whether messages considered as spam are simply 
deleted or sent to the postmaster.

Hope this helps.

Regards,
David Legg

------------------------ config.xml ---------------------------------

...
<config>
...
   <spoolmanager>
      <threads> 5 </threads>

      <!-- ROOT PROCESSOR -->
      <processor name="root">
...
         <!-- "not spam" bayesian analysis feeder. -->
         <mailet match="RecipientIs=not.spam@xxx.yyy" 
class="BayesianAnalysisFeeder">
            <repositoryPath> db://maildb </repositoryPath>
            <feedType>ham</feedType>
            <maxSize>500000</maxSize>
         </mailet>
    
         <!-- "spam" bayesian analysis feeder. -->
         <mailet match="RecipientIs=spam@xxx.yyy" 
class="BayesianAnalysisFeeder">
            <repositoryPath> db://maildb </repositoryPath>
            <feedType>spam</feedType>
            <maxSize>500000</maxSize>
         </mailet>
...
         <!-- Anti-spam processing -->
         <!-- The following two entries avoid double anti-spam analysis -->
         <!-- for forwarded messages. -->
         <!-- Has spam checking already been done? -->
         <mailet match="HasMailAttribute=spamChecked" class="ToProcessor">
            <processor> transport </processor>
         </mailet>
         <!-- Spam checking will not be done twice -->
         <mailet match="All" class="SetMailAttribute">
            <spamChecked>true</spamChecked>
         </mailet>

         <!-- Messages from authenticated senders are never spam -->
         <mailet match="SMTPAuthSuccessful" class="ToProcessor">
            <processor> transport </processor>
         </mailet>
...       
         <!-- Anti spam bayesian analysis -->
         <mailet match="All" class="BayesianAnalysis" 
onMailetException="ignore">
            <repositoryPath>db://maildb</repositoryPath>
            <maxSize>3000000</maxSize>
            <headerName>X-MessageIsSpamProbability</headerName>
            <ignoreLocalSender>false</ignoreLocalSender>
         </mailet>

         <mailet 
match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50" 
class="SetMailAttribute" onMatchException="noMatch">
            <isSpam>true</isSpam>
         </mailet>

         <mailet 
match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50" 
class="SetMimeHeader" onMatchException="noMatch">
            <name>X-MessageIsSpam</name>
            <value>true</value>
         </mailet>

         <mailet 
match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.50" 
class="ToProcessor" onMatchException="noMatch">
            <processor> spam </processor>
            <notice>Spam not accepted</notice>
         </mailet>

         <!-- Send remaining mails to the transport processor for either 
local or remote delivery -->
         <mailet match="All" class="ToProcessor">
            <processor> transport </processor>
         </mailet>
      </processor>
...
      <processor name="transport">
         <mailet match="SMTPAuthSuccessful" class="SetMimeHeader">
            <name>X-UserIsAuth</name>
            <value>true</value>
         </mailet>
...
      </processor>

      <processor name="spam">
         <mailet match="All" class="Null"/>
         <!-- To notify the postmaster that a message was marked as 
spam, uncomment this matcher/mailet configuration -->
         <!--
         <mailet match="All" class="NotifyPostmaster"/>
         -->
      </processor>
...
   </spoolmanager>
...
</config>


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org