You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Fajar Priyanto <fa...@arinet.org> on 2005/01/08 22:34:07 UTC

A very long spam

Hi all, 
Greetings. I've just joined the list. 

I've been using sa-learn with SA 2.64 and 3.0.2
One thing is bugging me though. Is it safe to teach SA on a very long spam 
such as the stock report spam? Will it cause many False Positive?

Thanks
-- 
Fajar Priyanto | Reg'd Linux User #327841 | http://linux2.arinet.org
04:29:04 up 10:57, Mandrakelinux release 10.1 (Official) for i586 
public key: https://www.arinet.org/fajar-pub.key

Re: A very long spam

Posted by Matt Kettler <mk...@evi-inc.com>.
At 04:55 PM 1/8/2005, Fajar Priyanto wrote:
>Thanks Matt,
>So talking statistically, does it mean I have to train SA about 'ham' as many
>as 'spam'? Right now, I train SA mostly on spams.

Ideally, yes.

( Personally, my understanding of statistics would say that real-world 
ratios would be ideal, but Dan Q has pointed out that the SA dev testing 
shows 50/50 works best. I trust Dan's real test of SA more than my own 
theoretical observations. )

However, I'd also point out my own training is wildly imbalanced and works 
fine. SA's bayes system is quite toleratant of wild variations in the 
training ratio.

My training ratio even more imbalanced than real-world spam-ham ratios are. 
My current training is about 4.1% ham, 95.9% spam, and I have a daily feed 
of both ham and spam training. My real world rate is about 40% ham, 60% spam.


I would also say it's fairly important to regularly train at least some ham 
when you train in spam. Even if the ratio isn't 40/60 or 50/50, it 
shouldn't be 0/100.


Re: A very long spam

Posted by Christopher John Shaker <cj...@shaker-net.com>.
You can 'sa-learn --ham' from mail folders, which the email user already
read and culled for spam.

After I did that, my baysian filter got surprisingly accurate.

Chris Shaker
cjshaker@shaker-net.com


----- Original Message ----- 
From: "Dave Hills" <da...@dailyhills.com>
To: <us...@spamassassin.apache.org>
Sent: Saturday, January 08, 2005 2:06 PM
Subject: Re: A very long spam


>I try to train as much HAM as I can but I don't think it's possible to 
> train HAM/SPAM equally as 90% of incoming email is SPAM.
> 
> 
> On Jan 8, 2005, at 1:55 PM, Fajar Priyanto wrote:
> 
>>> At 04:34 AM 1/9/2005 +0700, you wrote:
>>>> Hi all,
>>>> Greetings. I've just joined the list.
>>>>
>>>> I've been using sa-learn with SA 2.64 and 3.0.2
>>>> One thing is bugging me though. Is it safe to teach SA on a very 
>>>> long spam
>>>> such as the stock report spam? Will it cause many False Positive?
>>>
>>> Why would you think it would?
>>>
>>> By trying to avoid training that message you're poisoning your bayes
>>> database for false negatives.
>>>
>>> Train spam as spam, train ham as ham. Let the statistics deal with the
>>> overlap. By trying to avoid training "spamish" ham or "hamish" spam 
>>> you're
>>> just doing your training a big disservice by making it unrealistic.
>>
>> Thanks Matt,
>> So talking statistically, does it mean I have to train SA about 'ham' 
>> as many
>> as 'spam'? Right now, I train SA mostly on spams.
> 
>

Re: A very long spam

Posted by Dave Hills <da...@dailyhills.com>.
I try to train as much HAM as I can but I don't think it's possible to 
train HAM/SPAM equally as 90% of incoming email is SPAM.


On Jan 8, 2005, at 1:55 PM, Fajar Priyanto wrote:

>> At 04:34 AM 1/9/2005 +0700, you wrote:
>>> Hi all,
>>> Greetings. I've just joined the list.
>>>
>>> I've been using sa-learn with SA 2.64 and 3.0.2
>>> One thing is bugging me though. Is it safe to teach SA on a very 
>>> long spam
>>> such as the stock report spam? Will it cause many False Positive?
>>
>> Why would you think it would?
>>
>> By trying to avoid training that message you're poisoning your bayes
>> database for false negatives.
>>
>> Train spam as spam, train ham as ham. Let the statistics deal with the
>> overlap. By trying to avoid training "spamish" ham or "hamish" spam 
>> you're
>> just doing your training a big disservice by making it unrealistic.
>
> Thanks Matt,
> So talking statistically, does it mean I have to train SA about 'ham' 
> as many
> as 'spam'? Right now, I train SA mostly on spams.


Re: A very long spam

Posted by Thomas Arend <ml...@arend-whv.info>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am Samstag, 8. Januar 2005 22:55 schrieb Fajar Priyanto:
> On Sunday 09 January 2005 04:47 am, Matt Kettler wrote:
[..]

> > Train spam as spam, train ham as ham. Let the statistics deal with the
> > overlap. By trying to avoid training "spamish" ham or "hamish" spam
> > you're just doing your training a big disservice by making it
> > unrealistic.
>
> Thanks Matt,
> So talking statistically, does it mean I have to train SA about 'ham' as
> many as 'spam'? Right now, I train SA mostly on spams.

You must train ham and spam. How should the Bayes filter now what is ham if 
you didn't train it?

As far as I understand the Bayes filter searches for tokens in the email. If a 
token was found in 30 spam and 10 ham mails then the propability for being 
spam is 75%. But if you only train spam the Bayes filter would say: if have 
learned 30 spam mails but no ham so the propability for being spam is 100%.

(The bayes calculation is done with some ham/spam tokens. How many tokens are 
taken into account I don't know)

If you only / mostly train spam this will poison your database and the 
FalsePositves will grow. To keep FalsePositive low, you should teach SA all 
ham.

Its unlikely to train as much ham as spam because there is more spam. But this 
is no harm. The Bayesian filter work on tokens found. Lets assume you have 
teached 200 spam and 100 ham. 100 spam and 100 ham contained the token x. If 
x is found in an new message, then the spam prob is 50% even if the 
propability of being in a ham message is 100%.

If you teach only half the ham messages the spam-ham ratio would be 100 to 50 
which gives a propability of 66% for being spam. 


Regards

Thomas

- -- 
icq:133073900
http://www.t-arend.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFB4RLeHe2ZLU3NgHsRAjgSAKCHYwQWLMJExHdtrgb0OLXHHy00XwCeKIyw
Y7oZeRBZ22sOlpZFmc5Ln7M=
=i9Cw
-----END PGP SIGNATURE-----

Re: A very long spam

Posted by Fajar Priyanto <fa...@arinet.org>.
On Sunday 09 January 2005 04:47 am, Matt Kettler wrote:
> At 04:34 AM 1/9/2005 +0700, you wrote:
> >Hi all,
> >Greetings. I've just joined the list.
> >
> >I've been using sa-learn with SA 2.64 and 3.0.2
> >One thing is bugging me though. Is it safe to teach SA on a very long spam
> >such as the stock report spam? Will it cause many False Positive?
>
> Why would you think it would?
>
> By trying to avoid training that message you're poisoning your bayes
> database for false negatives.
>
> Train spam as spam, train ham as ham. Let the statistics deal with the
> overlap. By trying to avoid training "spamish" ham or "hamish" spam you're
> just doing your training a big disservice by making it unrealistic.

Thanks Matt,
So talking statistically, does it mean I have to train SA about 'ham' as many 
as 'spam'? Right now, I train SA mostly on spams.

-- 
Fajar Priyanto | Reg'd Linux User #327841 | http://linux2.arinet.org
04:53:49 up 11:22, Mandrakelinux release 10.1 (Official) for i586 
public key: https://www.arinet.org/fajar-pub.key

Re: A very long spam

Posted by Matt Kettler <mk...@comcast.net>.
At 04:34 AM 1/9/2005 +0700, you wrote:
>Hi all,
>Greetings. I've just joined the list.
>
>I've been using sa-learn with SA 2.64 and 3.0.2
>One thing is bugging me though. Is it safe to teach SA on a very long spam
>such as the stock report spam? Will it cause many False Positive?

Why would you think it would?

By trying to avoid training that message you're poisoning your bayes 
database for false negatives.

Train spam as spam, train ham as ham. Let the statistics deal with the 
overlap. By trying to avoid training "spamish" ham or "hamish" spam you're 
just doing your training a big disservice by making it unrealistic.