You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Arthur Dent <sa...@troodos.demon.co.uk> on 2008/02/12 14:55:37 UTC

Bayes - Balance of spam

Hello All,

Please forgive my ignorance, but I don't fully understand just how Bayes
works.

I dutifully feed all the spam (and ham) I get into sa-learn and generally
Bayes works pretty well.

I am a little concerned however that at this moment I seem to be getting
bombarded with Russian spam. It currently outweighs all other spam by
about 100:1

My worry is that Bayes will eventually come to believe that only Russian
spam is *really* spam as it will, if the current trend continues,
overwhelm the other spam in the Bayes DB.

Am I worrying unnecessarily, or should I make efforts to "balance" the
spam I am feeding to bayes?

Thanks in advance

Mark



Re: Bayes - Balance of spam

Posted by Arthur Dent <sa...@troodos.demon.co.uk>.
On Tue, Feb 12, 2008 at 09:44:58AM -0500, Rob McEwen wrote:
> On 12.02.08 13:55, Arthur Dent wrote:
>> I am a little concerned however that at this moment I seem to be getting
>> bombarded with Russian spam. It currently outweighs all other spam by
>> about 100:1
> Arthur,
>
> First, make sure that you are blocking all mail sent to unknown users 
> (i.e., turn catch-all off). Second, there are other techniques to catch the 
> balance besides bayes. For example, there might be some RBLs (and URI 
> blacklists) that you aren't using which may be helpful. Not all of the good 
> ones are included in the default setup for SA.
>
Hi Rob,

I should have made it clearer in my original post that there is no issue with
*catching* these spams (I get them all, either with SA or with procmail), it's
just that I was concerned that the huge disparity between the volume of these particular
spams and that of the regular stuff would "poison" my carefully nurtured Bayes db.

I think I'm going to /dev/null them anyway, which means the problem will go
away. My only concern with that is if they start coming in a *slightly* different format
that my procmail recipe doesn't catch, will Bayes still get them?

Have to try it and see I guess...


Thanks for you input...

Mark


Re: Bayes - Balance of spam

Posted by Rob McEwen <ro...@invaluement.com>.
On 12.02.08 13:55, Arthur Dent wrote:
> I am a little concerned however that at this moment I seem to be getting
> bombarded with Russian spam. It currently outweighs all other spam by
> about 100:1
Arthur,

First, make sure that you are blocking all mail sent to unknown users 
(i.e., turn catch-all off). Second, there are other techniques to catch 
the balance besides bayes. For example, there might be some RBLs (and 
URI blacklists) that you aren't using which may be helpful. Not all of 
the good ones are included in the default setup for SA.

Rob McEwen


Re: Bayes - Balance of spam

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
On 12.02.08 13:55, Arthur Dent wrote:
> Please forgive my ignorance, but I don't fully understand just how Bayes
> works.
> 
> I dutifully feed all the spam (and ham) I get into sa-learn and generally
> Bayes works pretty well.
> 
> I am a little concerned however that at this moment I seem to be getting
> bombarded with Russian spam. It currently outweighs all other spam by
> about 100:1
> 
> My worry is that Bayes will eventually come to believe that only Russian
> spam is *really* spam as it will, if the current trend continues,
> overwhelm the other spam in the Bayes DB.
> 
> Am I worrying unnecessarily, or should I make efforts to "balance" the
> spam I am feeding to bayes?

Not needed. You may want to increase your bayes DB size however.
Remember to feed ham too and especially mail with BAYES score
away from 0 or 1 (those who hit BAYES_05 to BAYES_95)

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Spam is for losers who can't get business any other way.

Re: Bayes - Balance of spam

Posted by Arthur Dent <sa...@troodos.demon.co.uk>.
On Tue, Feb 12, 2008 at 09:33:47AM -0500, Matt Kettler wrote:
> Arthur Dent wrote:
>> Hello All,
>>
>> Please forgive my ignorance, but I don't fully understand just how Bayes
>> works.
>>
>> I dutifully feed all the spam (and ham) I get into sa-learn and generally
>> Bayes works pretty well.
>>
>> I am a little concerned however that at this moment I seem to be getting
>> bombarded with Russian spam. It currently outweighs all other spam by
>> about 100:1
>>
>> My worry is that Bayes will eventually come to believe that only Russian
>> spam is *really* spam as it will, if the current trend continues,
>> overwhelm the other spam in the Bayes DB.
>>   
> That won't happen unless there's such a massive flood of unique tokens 
> (words) that they flush all other tokens out of your bayes DB.

That was kind of what I was worrying about. I didn't know that they had to be
unique tokens. I just thought that the sheer volume of this Russian spam would
"push out" all the other stuff.

>
> Bombardment or not, it's highly unlikely that 100,000 unique Russian words 
> are going to enter your bayes database, which is what it would take with 
> the default bayes_expiry_max_db_size .
>
> Odds are, this bombardment is mostly the same 1000 or so words over and 
> over again. All that's going to do is raise the spam count on those tokens, 
> which won't have any impact at all on other spam email.

Thanks for clarifying that. That's very reassuring.


> Really, a more realistic risk is that SA may learn that all Russian 
> language email is spam, unless you actually get some Russian language 
> nonspam. (ie: bayes will contain very few Russian words, but the ones it 
> does will have strong spam scores). If you don't speak Russian, that's 
> probably not a significant problem...

Da...


>> Am I worrying unnecessarily, or should I make efforts to "balance" the
>> spam I am feeding to bayes?
>>   
> I would generally advise against trying to "balance" bayes. My own 
> philosophy is this is more likely to lead to self-poisoning than any 
> realistic benefit. I'd only try to "balance" in ways that actually make 
> things closer to your real spam feed.

A very helpful response. I really appreciate your input.

What I should have made clearer in my original mail is that almost all (in
fact I think *all*) this type of spam is being caught. Much of it even before
it gets into SA (with a procmail recipe).

My concern stems from the fact that I am thinking of /dev/null[ing] these
spams now that I am happy that there are no FPs from the procmail recipe. Would
the sudden absence of Russian spam leave the normal spam less heavily
weighted? I guess from what you said that it shouldn't be a problem. In any
case, the Bayes will soon put itself right over time (wouldn't it?).

Thanks again for your help, and to all the others who have commented on this
thread both on and off list...

Mark

>

Re: Bayes - Balance of spam

Posted by Matt Kettler <mk...@verizon.net>.
Arthur Dent wrote:
> Hello All,
>
> Please forgive my ignorance, but I don't fully understand just how Bayes
> works.
>
> I dutifully feed all the spam (and ham) I get into sa-learn and generally
> Bayes works pretty well.
>
> I am a little concerned however that at this moment I seem to be getting
> bombarded with Russian spam. It currently outweighs all other spam by
> about 100:1
>
> My worry is that Bayes will eventually come to believe that only Russian
> spam is *really* spam as it will, if the current trend continues,
> overwhelm the other spam in the Bayes DB.
>   
That won't happen unless there's such a massive flood of unique tokens 
(words) that they flush all other tokens out of your bayes DB.

Bombardment or not, it's highly unlikely that 100,000 unique Russian 
words are going to enter your bayes database, which is what it would 
take with the default bayes_expiry_max_db_size .

Odds are, this bombardment is mostly the same 1000 or so words over and 
over again. All that's going to do is raise the spam count on those 
tokens, which won't have any impact at all on other spam email.

Really, a more realistic risk is that SA may learn that all Russian 
language email is spam, unless you actually get some Russian language 
nonspam. (ie: bayes will contain very few Russian words, but the ones it 
does will have strong spam scores). If you don't speak Russian, that's 
probably not a significant problem...


> Am I worrying unnecessarily, or should I make efforts to "balance" the
> spam I am feeding to bayes?
>   
I would generally advise against trying to "balance" bayes. My own 
philosophy is this is more likely to lead to self-poisoning than any 
realistic benefit. I'd only try to "balance" in ways that actually make 
things closer to your real spam feed.