You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Nicola Piazzi <Ni...@gruppocomet.it> on 2017/01/04 09:58:08 UTC

learn ham

I found useful to put in cron a little script like this
Each minute cron launch this script that takes messages of last minute reading from maillog database
Then it search in filesysten related message and learn as ham
So words that come from our company are good classified when someone resend
In this example i use the ip of my Exchange server to learn ham but it can be everithing


# learn.local.ham.sh
# It learn HAM from messages sent from internal network in latest minute
# Put in cron every 1 minute
# * * * * * /batch/learn.local.ham.sh

# Variables
Q="/var/spool/MailScanner/quarantine"           # Quarantine folder
L="/usr/bin/sa-learn --ham --no-sync"           # Message learn command


# START

vsql="SELECT id FROM maillog WHERE clientip = '10.1.1.126' AND timestamp > DATE_SUB(now(), INTERVAL 1 MINUTE);"
m=( $( echo $vsql | mysql -N -u root -p<mypwd> -D mailscanner ) )


# Scan array and learn ham
for i in ${m[@]}; do
echo $i
ii=$(find $Q -type f -name $i)
check=${#ii}
if [ $check -gt 1 ] ; then
echo $ii
  $L $ii
fi
done

~
~

Nicola Piazzi
CED - Sistemi
COMET s.p.a.
Via Michelino, 105 - 40127 Bologna - Italia
Tel.  +39 051.6079.293
Cell. +39 328.21.73.470
Web: www.gruppocomet.it<http://www.gruppocomet.it/>
[Descrizione: gc]


Re: R: learn ham

Posted by John Hardin <jh...@impsec.org>.
On Thu, 5 Jan 2017, Nicola Piazzi wrote:

> Each minute it learn messages of the last minute so it read and learn one time only for each message

There is a certain amount of overhead involved in reading the mailbox and 
processing messages even if they have already been learned...

> Messages are that it sends from internal, so il learn that words are not spam
>
> Internal messages are not spam

...until you get infected by a spambot.

Bayes training should be manually reviewed. Blind training is fragile and 
invites the system to go badly off the rails when for some reason it makes 
a poor decision that is self-reinforcing.


> Nicola Piazzi
> CED - Sistemi
> COMET s.p.a.
> Via Michelino, 105 - 40127 Bologna - Italia
> Tel. +39 051.6079.293
> Cell. +39 328.21.73.470
> Web: www.gruppocomet.it
>
>
>
> -----Messaggio originale-----
> Da: John Hardin [mailto:jhardin@impsec.org]
> Inviato: gioved 5 gennaio 2017 17:35
> A: users@spamassassin.apache.org
> Oggetto: Re: learn ham
>
> On Thu, 5 Jan 2017, Marc Strmer wrote:
>
>> Am 2017-01-04 10:58, schrieb Nicola Piazzi:
>>
>>>  I found useful to put in cron a little script like this
>>>
>>>  Each minute cron launch this script that takes messages of last
>>> minute  reading from maillog database
>>
>> What's the purpose of this script, what's the reasoning behind running
>> this thingie every minute?
>>
>> What you do is training the Bayes filter every minute. Training a
>> filter is something which should never be done unattended, but always
>> supervised, because if not you will get bad results over time.
>
> The execution of the training program can safely be automated, though I'd agree once per minute is a bit excessive. The classification of messages into the folders that are trained from is what needs manual supervision.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
  381 days since the first successful real return to launch site (SpaceX)

Re: learn ham

Posted by Shawn Bakhtiar <sh...@hotmail.com>.
> On Jan 5, 2017, at 8:54 AM, Dave Funk <db...@engineering.uiowa.edu> wrote:
> 
> On Thu, 5 Jan 2017, Nicola Piazzi wrote:
> 
>> Each minute it learn messages of the last minute so it read and learn one time only for each message
>> Messages are that it sends from internal, so il learn that words are not spam
>> 
>> Internal messages are not spam
> 
> Until one of your users gets their account hacked/phished and spammers then use it to abuse your server to send out megabytes of spam.
> (or they may have had an account on Yahoo that used the same password).
> 
> Careless users happen to the best of us. ;(
> 
> John's point is still valid; blind un-vetted automated Bayes learning is asking for trouble.

I would have to agree and re-inforce the message here... automated learning of SPAM/HAM is not a good idea. I have users dropping emails THEY HAVE SUBSCRIBED TO and forgotten they did so in their SPAM folder, and I would argue those are NOT SPAM. They actually contain a LOT of industry standard nomenclature that if trained as SPAM would not necessarily be valid tokens.

Think about it, the best machine to tell whether something is SPAM or not is the human machine. learning in this regard is telling SA emails like this one that I have specifically identified as SPAM are ones you should look out for. It (in and of itself) does not make a judgement call on what is or is not SPAM. You need to do that. 

Keep teaching and pretty soon everything is in every pool (there is such a thing as knowing too much, so much so, that you are left indecisive and perplexed at event the simplest problem). I think it's far better to have a smaller pool of tokens keyed with precision than a lot of tokens that well frankly could go either way.



> 
> -- 
> Dave Funk                                  University of Iowa
> <dbfunk (at) engineering.uiowa.edu>        College of Engineering
> 319/335-5751   FAX: 319/384-0549           1256 Seamans Center
> Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
> #include <std_disclaimer.h>
> Better is not better, 'standard' is better. B{


Re: R: learn ham

Posted by Dave Funk <db...@engineering.uiowa.edu>.
On Thu, 5 Jan 2017, Nicola Piazzi wrote:

> Each minute it learn messages of the last minute so it read and learn one time only for each message
> Messages are that it sends from internal, so il learn that words are not spam
>
> Internal messages are not spam

Until one of your users gets their account hacked/phished and spammers 
then use it to abuse your server to send out megabytes of spam.
(or they may have had an account on Yahoo that used the same password).

Careless users happen to the best of us. ;(

John's point is still valid; blind un-vetted automated Bayes learning is 
asking for trouble.

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: R: learn ham

Posted by Bill Shirley <bi...@philly.polymerindustries.biz>.

On 1/6/2017 6:36 AM, Marc Strmer wrote:
> Am 05.01.2017 um 17:38 schrieb Nicola Piazzi:
>
>> Each minute it learn messages of the last minute so it read and learn one time only for each message
>> Messages are that it sends from internal, so il learn that words are not spam
>>
>> Internal messages are not spam
>
> You'll never know if internal messages are ever spam or not; your script is the best way to
>
> a) poison your bayes database through unsupervised user interaction and
> b) put unneccessary load on your server.
>
> Bayes is just one of the many factors Spamassassin takes into account for computing the spam score.
>
> Usually you would first train your Spamassassin on a good ham and spam corpus with enough messages that after that 
> Spamassassin's autolearn feature is enabled - I guess the threshold is 200 messages for each category before autolearn will 
> start to work.
>
> After that you normally only would train Spamassassin on its errors from time to time, nothing more, nothing less, maybe once 
> a week or though because a well trained and maintained Spamassassin behaves well enough and doesn't need more maintenance than 
> this. You can even make it more comfortable by using stuff like the antispam plugin from Dovecot if you want to.
>
> BTW, this line in the script is a security nightmare:
>
> mysql -N -u root p<mypwd> -D mailscanner
>
> This means that any user on the machine is able to read the root password with the means of using ps, e.g. "ps max | less", 
> when its running. Not good at all.
>
Set up /root/.my.cnf:
[client]
user=root
password=SuperDuperSecret


Then use the HOME trick (from my crontab):
HOME=/root mysqldump --opt --order-by-primary -f -R -r /home/webmaster/mysql.backups/pkg_phpmyadmin.sql pkg_phpmyadmin

Bill


Re: R: learn ham

Posted by Marc Stürmer <ma...@marc-stuermer.de>.
Am 05.01.2017 um 17:38 schrieb Nicola Piazzi:

> Each minute it learn messages of the last minute so it read and learn one time only for each message
> Messages are that it sends from internal, so il learn that words are not spam
>
> Internal messages are not spam

You'll never know if internal messages are ever spam or not; your script 
is the best way to

a) poison your bayes database through unsupervised user interaction and
b) put unneccessary load on your server.

Bayes is just one of the many factors Spamassassin takes into account 
for computing the spam score.

Usually you would first train your Spamassassin on a good ham and spam 
corpus with enough messages that after that Spamassassin's autolearn 
feature is enabled - I guess the threshold is 200 messages for each 
category before autolearn will start to work.

After that you normally only would train Spamassassin on its errors from 
time to time, nothing more, nothing less, maybe once a week or though 
because a well trained and maintained Spamassassin behaves well enough 
and doesn't need more maintenance than this. You can even make it more 
comfortable by using stuff like the antispam plugin from Dovecot if you 
want to.

BTW, this line in the script is a security nightmare:

mysql -N -u root –p<mypwd> -D mailscanner

This means that any user on the machine is able to read the root 
password with the means of using ps, e.g. "ps max | less", when its 
running. Not good at all.


R: learn ham

Posted by Nicola Piazzi <Ni...@gruppocomet.it>.
Each minute it learn messages of the last minute so it read and learn one time only for each message
Messages are that it sends from internal, so il learn that words are not spam

Internal messages are not spam



Nicola Piazzi
CED - Sistemi
COMET s.p.a.
Via Michelino, 105 - 40127 Bologna - Italia
Tel.  +39 051.6079.293
Cell. +39 328.21.73.470
Web: www.gruppocomet.it



-----Messaggio originale-----
Da: John Hardin [mailto:jhardin@impsec.org] 
Inviato: giovedì 5 gennaio 2017 17:35
A: users@spamassassin.apache.org
Oggetto: Re: learn ham

On Thu, 5 Jan 2017, Marc Stürmer wrote:

> Am 2017-01-04 10:58, schrieb Nicola Piazzi:
>
>>  I found useful to put in cron a little script like this
>>
>>  Each minute cron launch this script that takes messages of last 
>> minute  reading from maillog database
>
> What's the purpose of this script, what's the reasoning behind running 
> this thingie every minute?
>
> What you do is training the Bayes filter every minute. Training a 
> filter is something which should never be done unattended, but always 
> supervised, because if not you will get bad results over time.

The execution of the training program can safely be automated, though I'd agree once per minute is a bit excessive. The classification of messages into the folders that are trained from is what needs manual supervision.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
  381 days since the first successful real return to launch site (SpaceX)

Re: learn ham

Posted by John Hardin <jh...@impsec.org>.
On Thu, 5 Jan 2017, Marc Strmer wrote:

> Am 2017-01-04 10:58, schrieb Nicola Piazzi:
>
>>  I found useful to put in cron a little script like this
>>
>>  Each minute cron launch this script that takes messages of last minute
>>  reading from maillog database
>
> What's the purpose of this script, what's the reasoning behind running this 
> thingie every minute?
>
> What you do is training the Bayes filter every minute. Training a filter is 
> something which should never be done unattended, but always supervised, 
> because if not you will get bad results over time.

The execution of the training program can safely be automated, though I'd 
agree once per minute is a bit excessive. The classification of messages 
into the folders that are trained from is what needs manual supervision.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
  381 days since the first successful real return to launch site (SpaceX)

Re: learn ham

Posted by Marc Stürmer <ma...@marc-stuermer.de>.
Am 2017-01-04 10:58, schrieb Nicola Piazzi:

> I found useful to put in cron a little script like this
> 
> Each minute cron launch this script that takes messages of last minute 
> reading from maillog database

What's the purpose of this script, what's the reasoning behind running 
this thingie every minute?

What you do is training the Bayes filter every minute. Training a filter 
is something which should never be done unattended, but always 
supervised, because if not you will get bad results over time.

There's autolearn in Spamassassin, if it's confident enough that a 
message is ham it will act accordingly, normally this is enough when you 
trained your initial ham corpus accordingly.

BTW: saving your root password for MySQL in a cronjob script is a very 
bad idea, too.