You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Nicola Piazzi <Ni...@gruppocomet.it> on 2017/01/04 09:58:08 UTC
learn ham
I found useful to put in cron a little script like this
Each minute cron launch this script that takes messages of last minute reading from maillog database
Then it search in filesysten related message and learn as ham
So words that come from our company are good classified when someone resend
In this example i use the ip of my Exchange server to learn ham but it can be everithing
# learn.local.ham.sh
# It learn HAM from messages sent from internal network in latest minute
# Put in cron every 1 minute
# * * * * * /batch/learn.local.ham.sh
# Variables
Q="/var/spool/MailScanner/quarantine" # Quarantine folder
L="/usr/bin/sa-learn --ham --no-sync" # Message learn command
# START
vsql="SELECT id FROM maillog WHERE clientip = '10.1.1.126' AND timestamp > DATE_SUB(now(), INTERVAL 1 MINUTE);"
m=( $( echo $vsql | mysql -N -u root -p<mypwd> -D mailscanner ) )
# Scan array and learn ham
for i in ${m[@]}; do
echo $i
ii=$(find $Q -type f -name $i)
check=${#ii}
if [ $check -gt 1 ] ; then
echo $ii
$L $ii
fi
done
~
~
Nicola Piazzi
CED - Sistemi
COMET s.p.a.
Via Michelino, 105 - 40127 Bologna - Italia
Tel. +39 051.6079.293
Cell. +39 328.21.73.470
Web: www.gruppocomet.it<http://www.gruppocomet.it/>
[Descrizione: gc]
Re: R: learn ham
Posted by John Hardin <jh...@impsec.org>.
On Thu, 5 Jan 2017, Nicola Piazzi wrote:
> Each minute it learn messages of the last minute so it read and learn one time only for each message
There is a certain amount of overhead involved in reading the mailbox and
processing messages even if they have already been learned...
> Messages are that it sends from internal, so il learn that words are not spam
>
> Internal messages are not spam
...until you get infected by a spambot.
Bayes training should be manually reviewed. Blind training is fragile and
invites the system to go badly off the rails when for some reason it makes
a poor decision that is self-reinforcing.
> Nicola Piazzi
> CED - Sistemi
> COMET s.p.a.
> Via Michelino, 105 - 40127 Bologna - Italia
> Tel. +39 051.6079.293
> Cell. +39 328.21.73.470
> Web: www.gruppocomet.it
>
>
>
> -----Messaggio originale-----
> Da: John Hardin [mailto:jhardin@impsec.org]
> Inviato: gioved 5 gennaio 2017 17:35
> A: users@spamassassin.apache.org
> Oggetto: Re: learn ham
>
> On Thu, 5 Jan 2017, Marc Strmer wrote:
>
>> Am 2017-01-04 10:58, schrieb Nicola Piazzi:
>>
>>> I found useful to put in cron a little script like this
>>>
>>> Each minute cron launch this script that takes messages of last
>>> minute reading from maillog database
>>
>> What's the purpose of this script, what's the reasoning behind running
>> this thingie every minute?
>>
>> What you do is training the Bayes filter every minute. Training a
>> filter is something which should never be done unattended, but always
>> supervised, because if not you will get bad results over time.
>
> The execution of the training program can safely be automated, though I'd agree once per minute is a bit excessive. The classification of messages into the folders that are trained from is what needs manual supervision.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
381 days since the first successful real return to launch site (SpaceX)
Re: learn ham
Posted by Shawn Bakhtiar <sh...@hotmail.com>.
> On Jan 5, 2017, at 8:54 AM, Dave Funk <db...@engineering.uiowa.edu> wrote:
>
> On Thu, 5 Jan 2017, Nicola Piazzi wrote:
>
>> Each minute it learn messages of the last minute so it read and learn one time only for each message
>> Messages are that it sends from internal, so il learn that words are not spam
>>
>> Internal messages are not spam
>
> Until one of your users gets their account hacked/phished and spammers then use it to abuse your server to send out megabytes of spam.
> (or they may have had an account on Yahoo that used the same password).
>
> Careless users happen to the best of us. ;(
>
> John's point is still valid; blind un-vetted automated Bayes learning is asking for trouble.
I would have to agree and re-inforce the message here... automated learning of SPAM/HAM is not a good idea. I have users dropping emails THEY HAVE SUBSCRIBED TO and forgotten they did so in their SPAM folder, and I would argue those are NOT SPAM. They actually contain a LOT of industry standard nomenclature that if trained as SPAM would not necessarily be valid tokens.
Think about it, the best machine to tell whether something is SPAM or not is the human machine. learning in this regard is telling SA emails like this one that I have specifically identified as SPAM are ones you should look out for. It (in and of itself) does not make a judgement call on what is or is not SPAM. You need to do that.
Keep teaching and pretty soon everything is in every pool (there is such a thing as knowing too much, so much so, that you are left indecisive and perplexed at event the simplest problem). I think it's far better to have a smaller pool of tokens keyed with precision than a lot of tokens that well frankly could go either way.
>
> --
> Dave Funk University of Iowa
> <dbfunk (at) engineering.uiowa.edu> College of Engineering
> 319/335-5751 FAX: 319/384-0549 1256 Seamans Center
> Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
> #include <std_disclaimer.h>
> Better is not better, 'standard' is better. B{
Re: R: learn ham
Posted by Dave Funk <db...@engineering.uiowa.edu>.
On Thu, 5 Jan 2017, Nicola Piazzi wrote:
> Each minute it learn messages of the last minute so it read and learn one time only for each message
> Messages are that it sends from internal, so il learn that words are not spam
>
> Internal messages are not spam
Until one of your users gets their account hacked/phished and spammers
then use it to abuse your server to send out megabytes of spam.
(or they may have had an account on Yahoo that used the same password).
Careless users happen to the best of us. ;(
John's point is still valid; blind un-vetted automated Bayes learning is
asking for trouble.
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: R: learn ham
Posted by Bill Shirley <bi...@philly.polymerindustries.biz>.
On 1/6/2017 6:36 AM, Marc Strmer wrote:
> Am 05.01.2017 um 17:38 schrieb Nicola Piazzi:
>
>> Each minute it learn messages of the last minute so it read and learn one time only for each message
>> Messages are that it sends from internal, so il learn that words are not spam
>>
>> Internal messages are not spam
>
> You'll never know if internal messages are ever spam or not; your script is the best way to
>
> a) poison your bayes database through unsupervised user interaction and
> b) put unneccessary load on your server.
>
> Bayes is just one of the many factors Spamassassin takes into account for computing the spam score.
>
> Usually you would first train your Spamassassin on a good ham and spam corpus with enough messages that after that
> Spamassassin's autolearn feature is enabled - I guess the threshold is 200 messages for each category before autolearn will
> start to work.
>
> After that you normally only would train Spamassassin on its errors from time to time, nothing more, nothing less, maybe once
> a week or though because a well trained and maintained Spamassassin behaves well enough and doesn't need more maintenance than
> this. You can even make it more comfortable by using stuff like the antispam plugin from Dovecot if you want to.
>
> BTW, this line in the script is a security nightmare:
>
> mysql -N -u root p<mypwd> -D mailscanner
>
> This means that any user on the machine is able to read the root password with the means of using ps, e.g. "ps max | less",
> when its running. Not good at all.
>
Set up /root/.my.cnf:
[client]
user=root
password=SuperDuperSecret
Then use the HOME trick (from my crontab):
HOME=/root mysqldump --opt --order-by-primary -f -R -r /home/webmaster/mysql.backups/pkg_phpmyadmin.sql pkg_phpmyadmin
Bill
Re: R: learn ham
Posted by Marc Stürmer <ma...@marc-stuermer.de>.
Am 05.01.2017 um 17:38 schrieb Nicola Piazzi:
> Each minute it learn messages of the last minute so it read and learn one time only for each message
> Messages are that it sends from internal, so il learn that words are not spam
>
> Internal messages are not spam
You'll never know if internal messages are ever spam or not; your script
is the best way to
a) poison your bayes database through unsupervised user interaction and
b) put unneccessary load on your server.
Bayes is just one of the many factors Spamassassin takes into account
for computing the spam score.
Usually you would first train your Spamassassin on a good ham and spam
corpus with enough messages that after that Spamassassin's autolearn
feature is enabled - I guess the threshold is 200 messages for each
category before autolearn will start to work.
After that you normally only would train Spamassassin on its errors from
time to time, nothing more, nothing less, maybe once a week or though
because a well trained and maintained Spamassassin behaves well enough
and doesn't need more maintenance than this. You can even make it more
comfortable by using stuff like the antispam plugin from Dovecot if you
want to.
BTW, this line in the script is a security nightmare:
mysql -N -u root –p<mypwd> -D mailscanner
This means that any user on the machine is able to read the root
password with the means of using ps, e.g. "ps max | less", when its
running. Not good at all.
R: learn ham
Posted by Nicola Piazzi <Ni...@gruppocomet.it>.
Each minute it learn messages of the last minute so it read and learn one time only for each message
Messages are that it sends from internal, so il learn that words are not spam
Internal messages are not spam
Nicola Piazzi
CED - Sistemi
COMET s.p.a.
Via Michelino, 105 - 40127 Bologna - Italia
Tel. +39 051.6079.293
Cell. +39 328.21.73.470
Web: www.gruppocomet.it
-----Messaggio originale-----
Da: John Hardin [mailto:jhardin@impsec.org]
Inviato: giovedì 5 gennaio 2017 17:35
A: users@spamassassin.apache.org
Oggetto: Re: learn ham
On Thu, 5 Jan 2017, Marc Stürmer wrote:
> Am 2017-01-04 10:58, schrieb Nicola Piazzi:
>
>> I found useful to put in cron a little script like this
>>
>> Each minute cron launch this script that takes messages of last
>> minute reading from maillog database
>
> What's the purpose of this script, what's the reasoning behind running
> this thingie every minute?
>
> What you do is training the Bayes filter every minute. Training a
> filter is something which should never be done unattended, but always
> supervised, because if not you will get bad results over time.
The execution of the training program can safely be automated, though I'd agree once per minute is a bit excessive. The classification of messages into the folders that are trained from is what needs manual supervision.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
381 days since the first successful real return to launch site (SpaceX)
Re: learn ham
Posted by John Hardin <jh...@impsec.org>.
On Thu, 5 Jan 2017, Marc Strmer wrote:
> Am 2017-01-04 10:58, schrieb Nicola Piazzi:
>
>> I found useful to put in cron a little script like this
>>
>> Each minute cron launch this script that takes messages of last minute
>> reading from maillog database
>
> What's the purpose of this script, what's the reasoning behind running this
> thingie every minute?
>
> What you do is training the Bayes filter every minute. Training a filter is
> something which should never be done unattended, but always supervised,
> because if not you will get bad results over time.
The execution of the training program can safely be automated, though I'd
agree once per minute is a bit excessive. The classification of messages
into the folders that are trained from is what needs manual supervision.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
381 days since the first successful real return to launch site (SpaceX)
Re: learn ham
Posted by Marc Stürmer <ma...@marc-stuermer.de>.
Am 2017-01-04 10:58, schrieb Nicola Piazzi:
> I found useful to put in cron a little script like this
>
> Each minute cron launch this script that takes messages of last minute
> reading from maillog database
What's the purpose of this script, what's the reasoning behind running
this thingie every minute?
What you do is training the Bayes filter every minute. Training a filter
is something which should never be done unattended, but always
supervised, because if not you will get bad results over time.
There's autolearn in Spamassassin, if it's confident enough that a
message is ham it will act accordingly, normally this is enough when you
trained your initial ham corpus accordingly.
BTW: saving your root password for MySQL in a cronjob script is a very
bad idea, too.