You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by sr...@abit.de on 2006/01/16 11:33:42 UTC

Bayes - how bad is a small ham corpus with a big spam corpus?

Hi list,

I'm currently trying to build up a new bayes DB here, since the autobuilt
DB fubared (as expected, no need to throw things at me ;)). It's rather 
easy
to build up the spam part, as we are getting right enough of it, yet it 
poses
a problem to build up the ham part.
Much of our mail coming from relationed companies or customers comes
directly via Lotus Notes replication, so nothing to feed there. Much of 
the
inbound smtp mail either contains private or confidential information, so
I cannot use them as I keep the source of the bayes messages in a Notes
DB serverside - I'd run into privacy issues.

So much for where I'm coming from, but now the question is:

Will a small ham corpus - let's say we take the minimum of 200 for the
beginning - compared to a fast growing spam corpus (currently at around
2000 spam) be a problem and possibly lead to false bayes scoring?
I most certain that there is this possibility of course - it's natural - 
but the
question is how bad could it influence the scoring and how high is
the propability (aproximately)?

Any insights on this would be most appreciated.

regards
        sash

--------------------------------------------------
Sascha Runschke
Netzwerk Administration
IT-Services

ABIT AG
Robert-Bosch-Str. 1
40668 Meerbusch

Tel.:+49 (0) 2150.9153.226
Mobil:+49 (0) 173.5419665
mailto:SRunschke@abit.de

http://www.abit.net
http://www.abit-epos.net
---------------------------------
Sicherheitshinweis zur E-Mail Kommunikation /
  Security note regarding email communication:
http://www.abit.net/sicherheitshinweis.html

Re: Antwort: Re: Bayes - how bad is a small ham corpus with a big spam corpus?

Posted by Nix <ni...@esperi.org.uk>.
On Wed, 18 Jan 2006, srunschke@abit.de yowled:
> Robert Menschel <Ro...@Menschel.net> schrieb am 17.01.2006 03:41:39:
>> Bigger problem: bayes can only learn what it's taught.  If you have
>> ham that really should be trained, and because of privacy issues it
>> should not be kept after training, then you really should develop a
>> system which will enable you to train without retaining.  Bayes works
>> best when properly and fully trained, not just trained on "those
>> unimportant non-private emails are ham".
> 
> Yes, I might forfeit the storage of ham mails in a Notes DB for that,
> BUT... I really doubt that the management would even give permission to
> send those messages into SA.

Well, if you don't train SA with them, you run the risk that Bayes will
misclassify them as spam.

Which is considered more important?

> When I say "confidential" it is really one of those few times where
> it means "confidential" ;) Our customers are mostly big banks, big
> insurance companies and the German government. Even the slightest
> risk of leaking _any_ kind of information could
> get us into problems noone even wants to imagine here...

Have a look at `sa-learn --dump' one of these days. It's changed. The
tokens are *hashes*, which means the only way an attacker who stole your
Bayes DB could learn what you'd been corresponding about would be to
guess words and try to look up their hashes, and even then they couldn't
tell what emails they'd been in.

(Possibly more risky is that the message IDs are recorded in the
bayes_seen DB, which means that if someone stole that and sorted it by
frequency, they could determine the source systems of your
major correspondents. But if they could do that, they could probably
also spy on your email in flight...)

One final problem is that as I understand it Notes damages the headers
of mail in its DB :( but this could be wrong, as I have no actual
experience of Notes. You'll probably want to add any headers Notes adds
to bayes_ignore_headers, at at least.

-- 
`Logic and human nature don't seem to mix very well,
 unfortunately.' --- Velvet Wood

Antwort: Re: Bayes - how bad is a small ham corpus with a big spam corpus?

Posted by sr...@abit.de.
Robert Menschel <Ro...@Menschel.net> schrieb am 17.01.2006 03:41:39:

> sad> I'm currently trying to build up a new bayes DB here, ...
> sad> ... yet it poses a problem to build up the ham part.
> sad> ... Much of the inbound smtp mail either contains private or
> sad> confidential information, so I cannot use them as I keep the
> sad> source of the bayes messages in a Notes DB serverside - I'd run
> sad> into privacy issues.
> 
> If you keep the source of your bayes messages in a Notes DB, then you
> should have had enough ham to retrain your bayes with, no?

Uhm, no? If you reread my message, you see that I have used
autolearning before instead of manually training. I just ditched
the old bayes DB and disabled autolearning, now building up
a new bayes DB.

I'm keeping the full corpus of both ham and spam to have more
control over the bayes DB. Keeping the the sources of it enables
me to always reproduce the DB and especially to remove selected
messages containing tokens that prove to be problematic in the
future. Of course I could do that with relearning wrongly tagged
messages as ham - but 1 message as ham usually doesn't make much
of a difference for bayes.

> Bigger problem: bayes can only learn what it's taught.  If you have
> ham that really should be trained, and because of privacy issues it
> should not be kept after training, then you really should develop a
> system which will enable you to train without retaining.  Bayes works
> best when properly and fully trained, not just trained on "those
> unimportant non-private emails are ham".

Yes, I might forfeit the storage of ham mails in a Notes DB for that,
BUT... I really doubt that the management would even give permission to
send those messages into SA.
When I say "confidential" it is really one of those few times where
it means "confidential" ;) Our customers are mostly big banks, big
insurance companies and the German government. Even the slightest
risk of leaking _any_ kind of information could
get us into problems noone even wants to imagine here...

> I can't make recommendations on how to do so in your system, but
> you'll get better results from bayes if you figure out how to manage
> it.

That's natural. I just wanted to know how bad it will come at me ;)

regards
        sash

--------------------------------------------------
Sascha Runschke
Netzwerk Administration
IT-Services

ABIT AG
Robert-Bosch-Str. 1
40668 Meerbusch

Tel.:+49 (0) 2150.9153.226
Mobil:+49 (0) 173.5419665
mailto:SRunschke@abit.de

http://www.abit.net
http://www.abit-epos.net
---------------------------------
Sicherheitshinweis zur E-Mail Kommunikation /
  Security note regarding email communication:
http://www.abit.net/sicherheitshinweis.html

Re: Bayes - how bad is a small ham corpus with a big spam corpus?

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello srunschke,

Monday, January 16, 2006, 2:33:42 AM, you wrote:

sad> I'm currently trying to build up a new bayes DB here, ...
sad> ... yet it poses a problem to build up the ham part.
sad> ... Much of the inbound smtp mail either contains private or
sad> confidential information, so I cannot use them as I keep the
sad> source of the bayes messages in a Notes DB serverside - I'd run
sad> into privacy issues.

If you keep the source of your bayes messages in a Notes DB, then you
should have had enough ham to retrain your bayes with, no?

Bigger problem: bayes can only learn what it's taught.  If you have
ham that really should be trained, and because of privacy issues it
should not be kept after training, then you really should develop a
system which will enable you to train without retaining.  Bayes works
best when properly and fully trained, not just trained on "those
unimportant non-private emails are ham".

I can't make recommendations on how to do so in your system, but
you'll get better results from bayes if you figure out how to manage
it.

Bob Menschel




Re: Bayes - how bad is a small ham corpus with a big spam corpus?

Posted by Matt Kettler <mk...@evi-inc.com>.
srunschke@abit.de wrote:
> Hi list,
> 
> I'm currently trying to build up a new bayes DB here, since the autobuilt
> DB fubared (as expected, no need to throw things at me ;)). It's rather 
> easy
> to build up the spam part, as we are getting right enough of it, yet it 
> poses
> a problem to build up the ham part.

Generally, no problem. SA deals pretty well with wild imbalances in training.
I'm currently running with a 9:1 spam:ham training ratio. In the past I've had
as bad as 20:1 with no ill effects on scoring.

I'd try to get as close to 1:1 as you can, but don't kill yourself to get there.

If your training is small I would at least try to make sure you cover as broad a
range of your ham mail as possible. If all your ham training only reflects the
typical content of a small portion of the ham mail could have some problems.