You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Alex Woick <al...@wombaz.de> on 2018/11/26 14:58:08 UTC

Bayes filter with phrases

In the last weeks I tried to create custom rules for several spam not 
catched (mostly german), and it's always the same:
- identify catchy phrases that (hopefully) only appear in that kind of spam
- make indirect rules for the catchy phrases
- make meta rules for combining a certain amount of catchy phrases
- guess some score for the meta rule and hope it is appropriate and will 
only push the total score just over 5, nothing more

In the end, I'm doing nothing different than a Bayes filter, only with 
phrases of 2-4 words instead of single words, and rate all manually 
instead of using the Bayes math based on mail analysis.

I cannot believe nobody did some Bayes variant in the past that 
identifies and feeds phrases of 2-4 words into the database instead of 
only single words. But why isn't this implemented yet?
What was the outcome of such an experiment? Does mathematical or empiric 
proof exist that this isn't more effective than single words?
My intuition (or guess) is that processing phrases would be vastly 
superior compared to processing single words, because as spammer, you 
have to convince your victim to click on that link, and there are only 
so much phrases to do this. Much less than single words. But since it 
isn't implemented, I assume there are arguments that invalidate the 
approach. Which ones?

I understand that building a Bayes engine capable of handling phrases 
has to be somewhat more complex. It has to handle overlapping phrases 
and prevent that overlapping phrases score more than once for any given 
message. Would the database size of such an engine grow without bounds? 
Processing time gets too high?

Alex

Re: Bayes filter with phrases

Posted by RW <rw...@googlemail.com>.
On Mon, 26 Nov 2018 15:58:08 +0100
Alex Woick wrote:


> I cannot believe nobody did some Bayes variant in the past that 
> identifies and feeds phrases of 2-4 words into the database instead
> of only single words. 

If you want that you can use a suitably configured third-party filter
like Bogofilter, DSPAM etc, and score it in to SpamAssassin.