You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Alex Woick <al...@wombaz.de> on 2018/11/26 14:58:08 UTC
Bayes filter with phrases
In the last weeks I tried to create custom rules for several spam not
catched (mostly german), and it's always the same:
- identify catchy phrases that (hopefully) only appear in that kind of spam
- make indirect rules for the catchy phrases
- make meta rules for combining a certain amount of catchy phrases
- guess some score for the meta rule and hope it is appropriate and will
only push the total score just over 5, nothing more
In the end, I'm doing nothing different than a Bayes filter, only with
phrases of 2-4 words instead of single words, and rate all manually
instead of using the Bayes math based on mail analysis.
I cannot believe nobody did some Bayes variant in the past that
identifies and feeds phrases of 2-4 words into the database instead of
only single words. But why isn't this implemented yet?
What was the outcome of such an experiment? Does mathematical or empiric
proof exist that this isn't more effective than single words?
My intuition (or guess) is that processing phrases would be vastly
superior compared to processing single words, because as spammer, you
have to convince your victim to click on that link, and there are only
so much phrases to do this. Much less than single words. But since it
isn't implemented, I assume there are arguments that invalidate the
approach. Which ones?
I understand that building a Bayes engine capable of handling phrases
has to be somewhat more complex. It has to handle overlapping phrases
and prevent that overlapping phrases score more than once for any given
message. Would the database size of such an engine grow without bounds?
Processing time gets too high?
Alex
Re: Bayes filter with phrases
Posted by RW <rw...@googlemail.com>.
On Mon, 26 Nov 2018 15:58:08 +0100
Alex Woick wrote:
> I cannot believe nobody did some Bayes variant in the past that
> identifies and feeds phrases of 2-4 words into the database instead
> of only single words.
If you want that you can use a suitably configured third-party filter
like Bogofilter, DSPAM etc, and score it in to SpamAssassin.