You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Ronan <r....@qub.ac.uk> on 2004/11/24 15:22:21 UTC
sa-learn ham
hi all.
for those of you running large volume servers you no doubt have an
abundance of spam to feed into sa-learn, and i suppose that goes for all
sizes of volumes.
but one question. how do you manage to match the same number with hams /
real messages. how do you go about bumping up the numbers to even the
DB? Am i right in saying that basically anymail thats not spam is ham or
is ham only supposed to be mail that are false negatives ie have been
tagged but arent really spam.
here at the university there are 3 admins who if they want could read
other peoples email... Data protection blah blah but its simply a side
affect of administering the systems.
putting a random selection of users' HAM emails ( which could be and
unsurprisingly are personal) into the filter to balance the DB could be
contentious - but its the only way to get a good selection of emails.
as i said there are only the 3 of us but we have around 40000 mail boxes
and 3 isnt really a good representation in terms of quality of emails to
be feeding ham into sa-learn. Aside from opening up a mailbox to pleb
users and creating more havoc, what are the recommended ways of getting
around this?
thanks
ronan
--
Regards
Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN
Re: sa-learn ham
Posted by Ronan <r....@qub.ac.uk>.
Jim Maul wrote:
> Ronan wrote:
>
>> so it doesnt make a difference if you have inordinately larger amounts
>> of one than the other?? I would have thought it would've worked better
>> with more ham...
>> i read somewhere on the list thats its best to balance.
>>
>
> you'll get conflicting answers to this question. The only real answer
> that i can tell is "see what works best on your system" If you get
> significantly more spam than ham, and autolearning is enabled, then you
> will have signficantly more spam tokens than ham. Over here at the
> hospital where i work, we get significantly more ham than spam so my
> numbers are usually the opposite from everyone elses. This goes to show
> that even with opposite ratios, the bayes system still works properly.
> This should be argument enough that you dont need to have a balance in
> the number of spam/ham tokens.
>
>> on a related note why is my autolearn not funtioning properly???
>>
>
> good question. Im not sure what the causes of autolearn=failed are. But
> i did happen to notice that you appear to have ALL_TRUSTED firing on
> every email you receive. There may be larger issues with the setup
> here. Posting on the list (and not just to me) may provide you will
> more answers as many more people will see the message as well.
ahh yeah hit reply instead of reply-all.
anyone out there see anything major or minorly wrong with the output below??
thanks
ronan
>
> -Jim
>
>> bash-2.03$ sa-learn --dump magic
>> 0.000 0 3 0 non-token data: bayes db version
>> 0.000 0 1077 0 non-token data: nspam
>> 0.000 0 427 0 non-token data: nham
>> 0.000 0 120915 0 non-token data: ntokens
>> 0.000 0 1082126382 0 non-token data: oldest atime
>> 0.000 0 1101307652 0 non-token data: newest atime
>> 0.000 0 1101307670 0 non-token data: last journal
>> sync atime
>> 0.000 0 1100189181 0 non-token data: last expiry atime
>> 0.000 0 0 0 non-token data: last expire
>> atime delta
>> 0.000 0 0 0 non-token data: last expire
>> reduction count
>> bash-2.03$ tail -f /var/log/syslog|grep autolearn
>> Nov 24 14:56:12 elisha spamd[5125]: result: . -1 - ALL_TRUSTED
>> scantime=2.8,size=1755,mid=<00...@w2k.cs.qub.ac.uk>,autolearn=failed
>>
>> Nov 24 14:56:15 elisha spamd[5125]: result: . -1 -
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME
>> scantime=0.6,size=1155,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:27 elisha spamd[6919]: result: . 0 -
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,MISSING_SUBJECT,NO_REAL_NAME
>> scantime=2.8,size=1329,mid=<E1...@amos>,autolearn=no
>> Nov 24 14:56:29 elisha spamd[7794]: result: . -1 -
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS
>> scantime=2.5,size=1705,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:31 elisha spamd[5467]: result: . 0 -
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,J_CHICKENPOX_21,J_CHICKENPOX_24,NO_REAL_NAME
>> scantime=5.2,size=4798,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:32 elisha spamd[6919]: result: . -1 -
>> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME
>> scantime=2.5,size=2668,mid=<E1...@amos>,autolearn=failed
>> Nov 24 14:56:34 elisha spamd[7794]: result: . -1 - ALL_TRUSTED
>> scantime=0.6,size=32341,mid=<Pi...@toad.am.qub.ac.uk>,autolearn=failed
>>
>> Nov 24 14:56:35 elisha spamd[5467]: result: . 2 -
>> FORGED_HOTMAIL_RCVD2,FORGED_RCVD_HELO,MISSING_MIMEOLE,NO_REAL_NAME,PRIORITY_NO_NAME,RCVD_IN_SORBS_DUL
>> scantime=0.6,size=78030,mid=<fe...@hotmail.com>,autolearn=no
>>
>> Nov 24 14:56:36 elisha spamd[8365]: result: . -1 -
>> ALL_TRUSTED,HTML_MESSAGE,HTML_TAG_EXIST_TBODY
>> scantime=8.5,size=12218,mid=<NK...@qub.ac.uk>,autolearn=failed
>>
>> Nov 24 14:56:38 elisha spamd[6919]: result: . -1 - ALL_TRUSTED
>> scantime=1.1,size=14404,mid=<00...@ELMSmagill>,autolearn=failed
>>
>> Nov 24 14:56:38 elisha spamd[5467]: result: . -1 -
>> ALL_TRUSTED,HTML_60_70,HTML_MESSAGE
>> scantime=1.6,size=2221,mid=<00...@avas>,autolearn=failed
Re: sa-learn ham
Posted by Jim Maul <jm...@elih.org>.
Ronan wrote:
> so it doesnt make a difference if you have inordinately larger amounts
> of one than the other?? I would have thought it would've worked better
> with more ham...
> i read somewhere on the list thats its best to balance.
>
you'll get conflicting answers to this question. The only real answer
that i can tell is "see what works best on your system" If you get
significantly more spam than ham, and autolearning is enabled, then you
will have signficantly more spam tokens than ham. Over here at the
hospital where i work, we get significantly more ham than spam so my
numbers are usually the opposite from everyone elses. This goes to show
that even with opposite ratios, the bayes system still works properly.
This should be argument enough that you dont need to have a balance in
the number of spam/ham tokens.
> on a related note why is my autolearn not funtioning properly???
>
good question. Im not sure what the causes of autolearn=failed are.
But i did happen to notice that you appear to have ALL_TRUSTED firing on
every email you receive. There may be larger issues with the setup
here. Posting on the list (and not just to me) may provide you will
more answers as many more people will see the message as well.
-Jim
> bash-2.03$ sa-learn --dump magic
> 0.000 0 3 0 non-token data: bayes db version
> 0.000 0 1077 0 non-token data: nspam
> 0.000 0 427 0 non-token data: nham
> 0.000 0 120915 0 non-token data: ntokens
> 0.000 0 1082126382 0 non-token data: oldest atime
> 0.000 0 1101307652 0 non-token data: newest atime
> 0.000 0 1101307670 0 non-token data: last journal
> sync atime
> 0.000 0 1100189181 0 non-token data: last expiry atime
> 0.000 0 0 0 non-token data: last expire
> atime delta
> 0.000 0 0 0 non-token data: last expire
> reduction count
> bash-2.03$ tail -f /var/log/syslog|grep autolearn
> Nov 24 14:56:12 elisha spamd[5125]: result: . -1 - ALL_TRUSTED
> scantime=2.8,size=1755,mid=<00...@w2k.cs.qub.ac.uk>,autolearn=failed
>
> Nov 24 14:56:15 elisha spamd[5125]: result: . -1 -
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME
> scantime=0.6,size=1155,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:27 elisha spamd[6919]: result: . 0 -
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,MISSING_SUBJECT,NO_REAL_NAME
> scantime=2.8,size=1329,mid=<E1...@amos>,autolearn=no
> Nov 24 14:56:29 elisha spamd[7794]: result: . -1 -
> ALL_TRUSTED,FROM_ENDS_IN_NUMS
> scantime=2.5,size=1705,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:31 elisha spamd[5467]: result: . 0 -
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,J_CHICKENPOX_21,J_CHICKENPOX_24,NO_REAL_NAME
> scantime=5.2,size=4798,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:32 elisha spamd[6919]: result: . -1 -
> ALL_TRUSTED,FROM_ENDS_IN_NUMS,NO_REAL_NAME
> scantime=2.5,size=2668,mid=<E1...@amos>,autolearn=failed
> Nov 24 14:56:34 elisha spamd[7794]: result: . -1 - ALL_TRUSTED
> scantime=0.6,size=32341,mid=<Pi...@toad.am.qub.ac.uk>,autolearn=failed
>
> Nov 24 14:56:35 elisha spamd[5467]: result: . 2 -
> FORGED_HOTMAIL_RCVD2,FORGED_RCVD_HELO,MISSING_MIMEOLE,NO_REAL_NAME,PRIORITY_NO_NAME,RCVD_IN_SORBS_DUL
> scantime=0.6,size=78030,mid=<fe...@hotmail.com>,autolearn=no
> Nov 24 14:56:36 elisha spamd[8365]: result: . -1 -
> ALL_TRUSTED,HTML_MESSAGE,HTML_TAG_EXIST_TBODY
> scantime=8.5,size=12218,mid=<NK...@qub.ac.uk>,autolearn=failed
>
> Nov 24 14:56:38 elisha spamd[6919]: result: . -1 - ALL_TRUSTED
> scantime=1.1,size=14404,mid=<00...@ELMSmagill>,autolearn=failed
>
> Nov 24 14:56:38 elisha spamd[5467]: result: . -1 -
> ALL_TRUSTED,HTML_60_70,HTML_MESSAGE
> scantime=1.6,size=2221,mid=<00...@avas>,autolearn=failed
>
>
>
>
>
Re: sa-learn ham
Posted by Jim Maul <jm...@elih.org>.
Ronan wrote:
>
>
> Jim Maul wrote:
>
>> Ronan wrote:
>>
>>> hi all.
>>> for those of you running large volume servers you no doubt have an
>>> abundance of spam to feed into sa-learn, and i suppose that goes for
>>> all sizes of volumes.
>>> but one question. how do you manage to match the same number with
>>> hams / real messages. how do you go about bumping up the numbers to
>>> even the DB? Am i right in saying that basically anymail thats not
>>> spam is ham or is ham only supposed to be mail that are false
>>> negatives ie have been tagged but arent really spam.
>>
>>
>>
>> Attempting to get these numbers equal is an unncessary, and as you've
>> discovered, almost futile task.
>>
>> While i would *not* recommend running on autolearning exclusively, it
>> is working incredibly well here with the occasional manual sa-learn
>> here and there. sa-learn --dump magic shows the following for my system:
>>
>> 0.000 0 1105 0 non-token data: nspam
>> 0.000 0 28077 0 non-token data: nham
>>
>
> Jim, isnt your ration of ham:spam 25:1 and not 1:25
>
>
Oops, yep your correct, i had the order switched. Regardless, my point
still stands :)
-Jim
Re: sa-learn ham
Posted by Ronan <r....@qub.ac.uk>.
Jim Maul wrote:
> Ronan wrote:
>
>> hi all.
>> for those of you running large volume servers you no doubt have an
>> abundance of spam to feed into sa-learn, and i suppose that goes for
>> all sizes of volumes.
>> but one question. how do you manage to match the same number with hams
>> / real messages. how do you go about bumping up the numbers to even
>> the DB? Am i right in saying that basically anymail thats not spam is
>> ham or is ham only supposed to be mail that are false negatives ie
>> have been tagged but arent really spam.
>
>
> Attempting to get these numbers equal is an unncessary, and as you've
> discovered, almost futile task.
>
> While i would *not* recommend running on autolearning exclusively, it is
> working incredibly well here with the occasional manual sa-learn here
> and there. sa-learn --dump magic shows the following for my system:
>
> 0.000 0 1105 0 non-token data: nspam
> 0.000 0 28077 0 non-token data: nham
>
Jim, isnt your ration of ham:spam 25:1 and not 1:25
>
> Thats like a 1:25 ratio of ham:spam and it is quite rare that i see any
> bayes scores that arent bayes_0 or bayes_99. Of course, your mileage
> may and probably will vary.
>
> -Jim
--
Regards
Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN
Re: sa-learn ham
Posted by Gavin Cato <ga...@corp.nexon.com.au>.
I agree, autolearn in conjunction with the odd manual insert works very well
here, although I'm still having troubles blocking the variation of those
ridicoulous drugs/rx msgs.
0.000 0 1781758 0 non-token data: nspam
0.000 0 319835 0 non-token data: nham
Cheers
Gav
.
>
> While i would *not* recommend running on autolearning exclusively, it is
> working incredibly well here with the occasional manual sa-learn here
> and there. sa-learn --dump magic shows the following for my system:
>
> 0.000 0 1105 0 non-token data: nspam
> 0.000 0 28077 0 non-token data: nham
>
>
> Thats like a 1:25 ratio of ham:spam and it is quite rare that i see any
> bayes scores that arent bayes_0 or bayes_99. Of course, your mileage
> may and probably will vary.
>
> -Jim
Re: sa-learn ham
Posted by Jim Maul <jm...@elih.org>.
Ronan wrote:
> hi all.
> for those of you running large volume servers you no doubt have an
> abundance of spam to feed into sa-learn, and i suppose that goes for all
> sizes of volumes.
> but one question. how do you manage to match the same number with hams /
> real messages. how do you go about bumping up the numbers to even the
> DB? Am i right in saying that basically anymail thats not spam is ham or
> is ham only supposed to be mail that are false negatives ie have been
> tagged but arent really spam.
Attempting to get these numbers equal is an unncessary, and as you've
discovered, almost futile task.
While i would *not* recommend running on autolearning exclusively, it is
working incredibly well here with the occasional manual sa-learn here
and there. sa-learn --dump magic shows the following for my system:
0.000 0 1105 0 non-token data: nspam
0.000 0 28077 0 non-token data: nham
Thats like a 1:25 ratio of ham:spam and it is quite rare that i see any
bayes scores that arent bayes_0 or bayes_99. Of course, your mileage
may and probably will vary.
-Jim