You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Fajar Priyanto <fa...@arinet.org> on 2005/01/08 22:34:07 UTC
A very long spam
Hi all,
Greetings. I've just joined the list.
I've been using sa-learn with SA 2.64 and 3.0.2
One thing is bugging me though. Is it safe to teach SA on a very long spam
such as the stock report spam? Will it cause many False Positive?
Thanks
--
Fajar Priyanto | Reg'd Linux User #327841 | http://linux2.arinet.org
04:29:04 up 10:57, Mandrakelinux release 10.1 (Official) for i586
public key: https://www.arinet.org/fajar-pub.key
Re: A very long spam
Posted by Matt Kettler <mk...@evi-inc.com>.
At 04:55 PM 1/8/2005, Fajar Priyanto wrote:
>Thanks Matt,
>So talking statistically, does it mean I have to train SA about 'ham' as many
>as 'spam'? Right now, I train SA mostly on spams.
Ideally, yes.
( Personally, my understanding of statistics would say that real-world
ratios would be ideal, but Dan Q has pointed out that the SA dev testing
shows 50/50 works best. I trust Dan's real test of SA more than my own
theoretical observations. )
However, I'd also point out my own training is wildly imbalanced and works
fine. SA's bayes system is quite toleratant of wild variations in the
training ratio.
My training ratio even more imbalanced than real-world spam-ham ratios are.
My current training is about 4.1% ham, 95.9% spam, and I have a daily feed
of both ham and spam training. My real world rate is about 40% ham, 60% spam.
I would also say it's fairly important to regularly train at least some ham
when you train in spam. Even if the ratio isn't 40/60 or 50/50, it
shouldn't be 0/100.
Re: A very long spam
Posted by Christopher John Shaker <cj...@shaker-net.com>.
You can 'sa-learn --ham' from mail folders, which the email user already
read and culled for spam.
After I did that, my baysian filter got surprisingly accurate.
Chris Shaker
cjshaker@shaker-net.com
----- Original Message -----
From: "Dave Hills" <da...@dailyhills.com>
To: <us...@spamassassin.apache.org>
Sent: Saturday, January 08, 2005 2:06 PM
Subject: Re: A very long spam
>I try to train as much HAM as I can but I don't think it's possible to
> train HAM/SPAM equally as 90% of incoming email is SPAM.
>
>
> On Jan 8, 2005, at 1:55 PM, Fajar Priyanto wrote:
>
>>> At 04:34 AM 1/9/2005 +0700, you wrote:
>>>> Hi all,
>>>> Greetings. I've just joined the list.
>>>>
>>>> I've been using sa-learn with SA 2.64 and 3.0.2
>>>> One thing is bugging me though. Is it safe to teach SA on a very
>>>> long spam
>>>> such as the stock report spam? Will it cause many False Positive?
>>>
>>> Why would you think it would?
>>>
>>> By trying to avoid training that message you're poisoning your bayes
>>> database for false negatives.
>>>
>>> Train spam as spam, train ham as ham. Let the statistics deal with the
>>> overlap. By trying to avoid training "spamish" ham or "hamish" spam
>>> you're
>>> just doing your training a big disservice by making it unrealistic.
>>
>> Thanks Matt,
>> So talking statistically, does it mean I have to train SA about 'ham'
>> as many
>> as 'spam'? Right now, I train SA mostly on spams.
>
>
Re: A very long spam
Posted by Dave Hills <da...@dailyhills.com>.
I try to train as much HAM as I can but I don't think it's possible to
train HAM/SPAM equally as 90% of incoming email is SPAM.
On Jan 8, 2005, at 1:55 PM, Fajar Priyanto wrote:
>> At 04:34 AM 1/9/2005 +0700, you wrote:
>>> Hi all,
>>> Greetings. I've just joined the list.
>>>
>>> I've been using sa-learn with SA 2.64 and 3.0.2
>>> One thing is bugging me though. Is it safe to teach SA on a very
>>> long spam
>>> such as the stock report spam? Will it cause many False Positive?
>>
>> Why would you think it would?
>>
>> By trying to avoid training that message you're poisoning your bayes
>> database for false negatives.
>>
>> Train spam as spam, train ham as ham. Let the statistics deal with the
>> overlap. By trying to avoid training "spamish" ham or "hamish" spam
>> you're
>> just doing your training a big disservice by making it unrealistic.
>
> Thanks Matt,
> So talking statistically, does it mean I have to train SA about 'ham'
> as many
> as 'spam'? Right now, I train SA mostly on spams.
Re: A very long spam
Posted by Thomas Arend <ml...@arend-whv.info>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Am Samstag, 8. Januar 2005 22:55 schrieb Fajar Priyanto:
> On Sunday 09 January 2005 04:47 am, Matt Kettler wrote:
[..]
> > Train spam as spam, train ham as ham. Let the statistics deal with the
> > overlap. By trying to avoid training "spamish" ham or "hamish" spam
> > you're just doing your training a big disservice by making it
> > unrealistic.
>
> Thanks Matt,
> So talking statistically, does it mean I have to train SA about 'ham' as
> many as 'spam'? Right now, I train SA mostly on spams.
You must train ham and spam. How should the Bayes filter now what is ham if
you didn't train it?
As far as I understand the Bayes filter searches for tokens in the email. If a
token was found in 30 spam and 10 ham mails then the propability for being
spam is 75%. But if you only train spam the Bayes filter would say: if have
learned 30 spam mails but no ham so the propability for being spam is 100%.
(The bayes calculation is done with some ham/spam tokens. How many tokens are
taken into account I don't know)
If you only / mostly train spam this will poison your database and the
FalsePositves will grow. To keep FalsePositive low, you should teach SA all
ham.
Its unlikely to train as much ham as spam because there is more spam. But this
is no harm. The Bayesian filter work on tokens found. Lets assume you have
teached 200 spam and 100 ham. 100 spam and 100 ham contained the token x. If
x is found in an new message, then the spam prob is 50% even if the
propability of being in a ham message is 100%.
If you teach only half the ham messages the spam-ham ratio would be 100 to 50
which gives a propability of 66% for being spam.
Regards
Thomas
- --
icq:133073900
http://www.t-arend.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFB4RLeHe2ZLU3NgHsRAjgSAKCHYwQWLMJExHdtrgb0OLXHHy00XwCeKIyw
Y7oZeRBZ22sOlpZFmc5Ln7M=
=i9Cw
-----END PGP SIGNATURE-----
Re: A very long spam
Posted by Fajar Priyanto <fa...@arinet.org>.
On Sunday 09 January 2005 04:47 am, Matt Kettler wrote:
> At 04:34 AM 1/9/2005 +0700, you wrote:
> >Hi all,
> >Greetings. I've just joined the list.
> >
> >I've been using sa-learn with SA 2.64 and 3.0.2
> >One thing is bugging me though. Is it safe to teach SA on a very long spam
> >such as the stock report spam? Will it cause many False Positive?
>
> Why would you think it would?
>
> By trying to avoid training that message you're poisoning your bayes
> database for false negatives.
>
> Train spam as spam, train ham as ham. Let the statistics deal with the
> overlap. By trying to avoid training "spamish" ham or "hamish" spam you're
> just doing your training a big disservice by making it unrealistic.
Thanks Matt,
So talking statistically, does it mean I have to train SA about 'ham' as many
as 'spam'? Right now, I train SA mostly on spams.
--
Fajar Priyanto | Reg'd Linux User #327841 | http://linux2.arinet.org
04:53:49 up 11:22, Mandrakelinux release 10.1 (Official) for i586
public key: https://www.arinet.org/fajar-pub.key
Re: A very long spam
Posted by Matt Kettler <mk...@comcast.net>.
At 04:34 AM 1/9/2005 +0700, you wrote:
>Hi all,
>Greetings. I've just joined the list.
>
>I've been using sa-learn with SA 2.64 and 3.0.2
>One thing is bugging me though. Is it safe to teach SA on a very long spam
>such as the stock report spam? Will it cause many False Positive?
Why would you think it would?
By trying to avoid training that message you're poisoning your bayes
database for false negatives.
Train spam as spam, train ham as ham. Let the statistics deal with the
overlap. By trying to avoid training "spamish" ham or "hamish" spam you're
just doing your training a big disservice by making it unrealistic.