You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Ronan <r....@qub.ac.uk> on 2004/11/24 15:22:21 UTC

sa-learn ham

hi all.
for those of you running large volume servers you no doubt have an 
abundance of spam to feed into sa-learn, and i suppose that goes for all 
sizes of volumes.
but one question. how do you manage to match the same number with hams / 
real messages. how do you go about bumping up the numbers to even the 
DB? Am i right in saying that basically anymail thats not spam is ham or 
is ham only supposed to be mail that are false negatives ie have been 
tagged but arent really spam.
here at the university there are 3 admins who if they want could read 
other peoples email... Data protection blah blah but its simply a side 
affect of administering the systems.

putting a random selection of users' HAM emails ( which could be and 
unsurprisingly are personal) into the filter to balance the DB could be 
contentious - but its the only way to get a good selection of emails.

as i said there are only the 3 of us but we have around 40000 mail boxes 
and 3 isnt really a good representation in terms of quality of emails to 
be feeding ham into sa-learn. Aside from opening up a mailbox to pleb 
users and creating more havoc, what are the recommended ways of getting 
around this?


thanks

ronan

-- 
Regards

Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN

Re: sa-learn ham

Posted by Ronan <r....@qub.ac.uk>.


Jim Maul wrote:

> Ronan wrote:
> 
>> so it doesnt make a difference if you have inordinately larger amounts 
>> of one than the other?? I would have thought it would've worked better 
>> with more ham...
>> i read somewhere on the list thats its best to balance.
>>
> 
> you'll get conflicting answers to this question.  The only real answer 
> that i can tell is "see what works best on your system"  If you get 
> significantly more spam than ham, and autolearning is enabled, then you 
> will have signficantly more spam tokens than ham.  Over here at the 
> hospital where i work, we get significantly more ham than spam so my 
> numbers are usually the opposite from everyone elses.  This goes to show 
> that even with opposite ratios, the bayes system still works properly. 
> This should be argument enough that you dont need to have a balance in 
> the number of spam/ham tokens.
> 
>> on a related note why is my autolearn not funtioning properly???
>>
> 
> good question.  Im not sure what the causes of autolearn=failed are. But 
> i did happen to notice that you appear to have ALL_TRUSTED firing on 
> every email you receive.  There may be larger issues with the setup 
> here.  Posting on the list (and not just to me) may provide you will 
> more answers as many more people will see the message as well.

ahh yeah hit reply instead of reply-all.

anyone out there see anything major or minorly wrong with the output below??

thanks

ronan
> 
> -Jim
> 
>> bash-2.03$ sa-learn --dump magic
>> 0.000          0          3          0  non-token data: bayes db version
>> 0.000          0       1077          0  non-token data: nspam
>> 0.000          0        427          0  non-token data: nham
>> 0.000          0     120915          0  non-token data: ntokens
>> 0.000          0 1082126382          0  non-token data: oldest atime
>> 0.000          0 1101307652          0  non-token data: newest atime
>> 0.000          0 1101307670          0  non-token data: last journal 
>> sync atime
>> 0.000          0 1100189181          0  non-token data: last expiry atime
>> 0.000          0          0          0  non-token data: last expire 
>> atime delta
>> 0.000          0          0          0  non-token data: last expire 
>> reduction count
>> bash-2.03$ tail -f /var/log/syslog|grep autolearn
>> Nov 24 14:56:12 elisha spamd[5125]: result: . -1 - ALL_TRUSTED 
>> scantime=2.8,size=1755,mid=<00...@w2k.cs.qub.ac.uk>,autolearn=failed 
>>
>> Nov 24 14:56:15 elisha spamd[5125]: result: . -1 - 
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME 
>> scantime=0.6,size=1155,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:27 elisha spamd[6919]: result: .  0 - 
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,MISSING_SUBJECT,NO_REAL_NAME 
>> scantime=2.8,size=1329,mid=<E1...@amos>,autolearn=no
>> Nov 24 14:56:29 elisha spamd[7794]: result: . -1 - 
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS 
>> scantime=2.5,size=1705,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:31 elisha spamd[5467]: result: .  0 - 
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,J_CHICKENPOX_21,J_CHICKENPOX_24,NO_REAL_NAME 
>> scantime=5.2,size=4798,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:32 elisha spamd[6919]: result: . -1 - 
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME 
>> scantime=2.5,size=2668,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:34 elisha spamd[7794]: result: . -1 - ALL_TRUSTED 
>> scantime=0.6,size=32341,mid=<Pi...@toad.am.qub.ac.uk>,autolearn=failed 
>>
>> Nov 24 14:56:35 elisha spamd[5467]: result: .  2 - 
>> FORGED_HOTMAIL_RCVD2,FORGED_RCVD_HELO,MISSING_MIMEOLE,NO_REAL_NAME,PRIORITY_NO_NAME,RCVD_IN_SORBS_DUL 
>> scantime=0.6,size=78030,mid=<fe...@hotmail.com>,autolearn=no 
>>
>> Nov 24 14:56:36 elisha spamd[8365]: result: . -1 - 
>> ALL_TRUSTED,HTML_MESSAGE,HTML_TAG_EXIST_TBODY 
>> scantime=8.5,size=12218,mid=<NK...@qub.ac.uk>,autolearn=failed 
>>
>> Nov 24 14:56:38 elisha spamd[6919]: result: . -1 - ALL_TRUSTED 
>> scantime=1.1,size=14404,mid=<00...@ELMSmagill>,autolearn=failed 
>>
>> Nov 24 14:56:38 elisha spamd[5467]: result: . -1 - 
>> ALL_TRUSTED,HTML_60_70,HTML_MESSAGE 
>> scantime=1.6,size=2221,mid=<00...@avas>,autolearn=failed

Re: sa-learn ham

Posted by Jim Maul <jm...@elih.org>.

Ronan wrote:
> so it doesnt make a difference if you have inordinately larger amounts 
> of one than the other?? I would have thought it would've worked better 
> with more ham...
> i read somewhere on the list thats its best to balance.
> 

you'll get conflicting answers to this question.  The only real answer 
that i can tell is "see what works best on your system"  If you get 
significantly more spam than ham, and autolearning is enabled, then you 
will have signficantly more spam tokens than ham.  Over here at the 
hospital where i work, we get significantly more ham than spam so my 
numbers are usually the opposite from everyone elses.  This goes to show 
that even with opposite ratios, the bayes system still works properly. 
This should be argument enough that you dont need to have a balance in 
the number of spam/ham tokens.

> on a related note why is my autolearn not funtioning properly???
> 

good question.  Im not sure what the causes of autolearn=failed are. 
But i did happen to notice that you appear to have ALL_TRUSTED firing on 
every email you receive.  There may be larger issues with the setup 
here.  Posting on the list (and not just to me) may provide you will 
more answers as many more people will see the message as well.

-Jim

> bash-2.03$ sa-learn --dump magic
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0       1077          0  non-token data: nspam
> 0.000          0        427          0  non-token data: nham
> 0.000          0     120915          0  non-token data: ntokens
> 0.000          0 1082126382          0  non-token data: oldest atime
> 0.000          0 1101307652          0  non-token data: newest atime
> 0.000          0 1101307670          0  non-token data: last journal 
> sync atime
> 0.000          0 1100189181          0  non-token data: last expiry atime
> 0.000          0          0          0  non-token data: last expire 
> atime delta
> 0.000          0          0          0  non-token data: last expire 
> reduction count
> bash-2.03$ tail -f /var/log/syslog|grep autolearn
> Nov 24 14:56:12 elisha spamd[5125]: result: . -1 - ALL_TRUSTED 
> scantime=2.8,size=1755,mid=<00...@w2k.cs.qub.ac.uk>,autolearn=failed 
> 
> Nov 24 14:56:15 elisha spamd[5125]: result: . -1 - 
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME 
> scantime=0.6,size=1155,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:27 elisha spamd[6919]: result: .  0 - 
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,MISSING_SUBJECT,NO_REAL_NAME 
> scantime=2.8,size=1329,mid=<E1...@amos>,autolearn=no
> Nov 24 14:56:29 elisha spamd[7794]: result: . -1 - 
> ALL_TRUSTED,FROM_ENDS_IN_NUMS 
> scantime=2.5,size=1705,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:31 elisha spamd[5467]: result: .  0 - 
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,J_CHICKENPOX_21,J_CHICKENPOX_24,NO_REAL_NAME 
> scantime=5.2,size=4798,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:32 elisha spamd[6919]: result: . -1 - 
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME 
> scantime=2.5,size=2668,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:34 elisha spamd[7794]: result: . -1 - ALL_TRUSTED 
> scantime=0.6,size=32341,mid=<Pi...@toad.am.qub.ac.uk>,autolearn=failed 
> 
> Nov 24 14:56:35 elisha spamd[5467]: result: .  2 - 
> FORGED_HOTMAIL_RCVD2,FORGED_RCVD_HELO,MISSING_MIMEOLE,NO_REAL_NAME,PRIORITY_NO_NAME,RCVD_IN_SORBS_DUL 
> scantime=0.6,size=78030,mid=<fe...@hotmail.com>,autolearn=no
> Nov 24 14:56:36 elisha spamd[8365]: result: . -1 - 
> ALL_TRUSTED,HTML_MESSAGE,HTML_TAG_EXIST_TBODY 
> scantime=8.5,size=12218,mid=<NK...@qub.ac.uk>,autolearn=failed 
> 
> Nov 24 14:56:38 elisha spamd[6919]: result: . -1 - ALL_TRUSTED 
> scantime=1.1,size=14404,mid=<00...@ELMSmagill>,autolearn=failed 
> 
> Nov 24 14:56:38 elisha spamd[5467]: result: . -1 - 
> ALL_TRUSTED,HTML_60_70,HTML_MESSAGE 
> scantime=1.6,size=2221,mid=<00...@avas>,autolearn=failed 
> 
> 
> 
> 
>

Re: sa-learn ham

Posted by Jim Maul <jm...@elih.org>.

Ronan wrote:
> 
> 
> Jim Maul wrote:
> 
>> Ronan wrote:
>>
>>> hi all.
>>> for those of you running large volume servers you no doubt have an 
>>> abundance of spam to feed into sa-learn, and i suppose that goes for 
>>> all sizes of volumes.
>>> but one question. how do you manage to match the same number with 
>>> hams / real messages. how do you go about bumping up the numbers to 
>>> even the DB? Am i right in saying that basically anymail thats not 
>>> spam is ham or is ham only supposed to be mail that are false 
>>> negatives ie have been tagged but arent really spam.
>>
>>
>>
>> Attempting to get these numbers equal is an unncessary, and as you've 
>> discovered, almost futile task.
>>
>> While i would *not* recommend running on autolearning exclusively, it 
>> is working incredibly well here with the occasional manual sa-learn 
>> here and there.  sa-learn --dump magic shows the following for my system:
>>
>> 0.000          0       1105          0  non-token data: nspam
>> 0.000          0      28077          0  non-token data: nham
>>
> 
> Jim, isnt your ration of ham:spam 25:1 and not 1:25
> 
>

Oops, yep your correct, i had the order switched.  Regardless, my point 
still stands :)

-Jim

Re: sa-learn ham

Posted by Ronan <r....@qub.ac.uk>.


Jim Maul wrote:
> Ronan wrote:
> 
>> hi all.
>> for those of you running large volume servers you no doubt have an 
>> abundance of spam to feed into sa-learn, and i suppose that goes for 
>> all sizes of volumes.
>> but one question. how do you manage to match the same number with hams 
>> / real messages. how do you go about bumping up the numbers to even 
>> the DB? Am i right in saying that basically anymail thats not spam is 
>> ham or is ham only supposed to be mail that are false negatives ie 
>> have been tagged but arent really spam.
> 
> 
> Attempting to get these numbers equal is an unncessary, and as you've 
> discovered, almost futile task.
> 
> While i would *not* recommend running on autolearning exclusively, it is 
> working incredibly well here with the occasional manual sa-learn here 
> and there.  sa-learn --dump magic shows the following for my system:
> 
> 0.000          0       1105          0  non-token data: nspam
> 0.000          0      28077          0  non-token data: nham
> 

Jim, isnt your ration of ham:spam 25:1 and not 1:25
> 
> Thats like a 1:25 ratio of ham:spam and it is quite rare that i see any 
> bayes scores that arent bayes_0 or bayes_99.  Of course, your mileage 
> may and probably will vary.
> 
> -Jim

-- 
Regards

Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN

Re: sa-learn ham

Posted by Gavin Cato <ga...@corp.nexon.com.au>.

I agree, autolearn in conjunction with the odd manual insert works very well
here, although I'm still having troubles blocking the variation of those
ridicoulous drugs/rx msgs.

0.000          0    1781758          0  non-token data: nspam
0.000          0     319835          0  non-token data: nham

Cheers

Gav

 
.
> 
> While i would *not* recommend running on autolearning exclusively, it is
> working incredibly well here with the occasional manual sa-learn here
> and there.  sa-learn --dump magic shows the following for my system:
> 
> 0.000          0       1105          0  non-token data: nspam
> 0.000          0      28077          0  non-token data: nham
> 
> 
> Thats like a 1:25 ratio of ham:spam and it is quite rare that i see any
> bayes scores that arent bayes_0 or bayes_99.  Of course, your mileage
> may and probably will vary.
> 
> -Jim

Re: sa-learn ham

Posted by Jim Maul <jm...@elih.org>.

Ronan wrote:
> hi all.
> for those of you running large volume servers you no doubt have an 
> abundance of spam to feed into sa-learn, and i suppose that goes for all 
> sizes of volumes.
> but one question. how do you manage to match the same number with hams / 
> real messages. how do you go about bumping up the numbers to even the 
> DB? Am i right in saying that basically anymail thats not spam is ham or 
> is ham only supposed to be mail that are false negatives ie have been 
> tagged but arent really spam.

Attempting to get these numbers equal is an unncessary, and as you've 
discovered, almost futile task.

While i would *not* recommend running on autolearning exclusively, it is 
working incredibly well here with the occasional manual sa-learn here 
and there.  sa-learn --dump magic shows the following for my system:

0.000          0       1105          0  non-token data: nspam
0.000          0      28077          0  non-token data: nham

Thats like a 1:25 ratio of ham:spam and it is quite rare that i see any 
bayes scores that arent bayes_0 or bayes_99.  Of course, your mileage 
may and probably will vary.

-Jim