You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Marc Perkel <su...@junkemailfilter.com> on 2011/10/10 15:55:16 UTC

Re: New Bayes like paradigm


On 9/28/2011 8:02 AM, darxus@chaosreigns.com wrote:
> On 09/28, Marc Perkel wrote:
>> You would only have to test the rule combinations that the message
>> actually triggered. So if it hit 10 rules then it would be 1024
>> combinations. Seems not to be unreasonable to me.
> You definitely have a good point that it would only be necessary to track
> the combinations that actually show up in emails, however 1024 is only
> the possible combinations from one set of 10 rules.  The number of
> combinations in the actual corpora would be much higher.  I'll try to
> get you a number.

You wouldn't have to store all combinations. You could just do up to 3 
levels and only the combinations that actually occur and use a hash to 
look up the combinations.

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: New Bayes like paradigm

Posted by Marc Perkel <su...@junkemailfilter.com>.

On 10/10/2011 9:16 AM, darxus@chaosreigns.com wrote:
> On 10/10, Marc Perkel wrote:
>> On 9/28/2011 8:02 AM, darxus@chaosreigns.com wrote:
>>> On 09/28, Marc Perkel wrote:
>>>> You would only have to test the rule combinations that the message
>>>> actually triggered. So if it hit 10 rules then it would be 1024
>>>> combinations. Seems not to be unreasonable to me.
>>> You definitely have a good point that it would only be necessary to track
>>> the combinations that actually show up in emails, however 1024 is only
>>> the possible combinations from one set of 10 rules.  The number of
>>> combinations in the actual corpora would be much higher.  I'll try to
>>> get you a number.
>> You wouldn't have to store all combinations. You could just do up to
>> 3 levels and only the combinations that actually occur and use a
>> hash to look up the combinations.
> I never said storage would be a problem.  I agree you could just store a
> relatively small number that were most useful.
>
> The problems are:
> 1) The many years it would take to find useful rule combinations by trying
>     one possibility per masscheck run.
> 2) The hundreds of times as much (masscheck) data we'd need to get an
>     accurate re-score using all rule combinations existing in the corpora.
>
> There is still the possibility of doing an analysis of what combinations of
> rules hit false-negatives significantly more often than they hit non-spam.
> (Or false-positives vs. spam.)

I suppose it seems to me that there should be some automated way to find 
useful rule combinations.


-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: New Bayes like paradigm

Posted by da...@chaosreigns.com.
On 10/10, Marc Perkel wrote:
> On 9/28/2011 8:02 AM, darxus@chaosreigns.com wrote:
> >On 09/28, Marc Perkel wrote:
> >>You would only have to test the rule combinations that the message
> >>actually triggered. So if it hit 10 rules then it would be 1024
> >>combinations. Seems not to be unreasonable to me.
> >You definitely have a good point that it would only be necessary to track
> >the combinations that actually show up in emails, however 1024 is only
> >the possible combinations from one set of 10 rules.  The number of
> >combinations in the actual corpora would be much higher.  I'll try to
> >get you a number.
> 
> You wouldn't have to store all combinations. You could just do up to
> 3 levels and only the combinations that actually occur and use a
> hash to look up the combinations.

I never said storage would be a problem.  I agree you could just store a
relatively small number that were most useful.

The problems are:
1) The many years it would take to find useful rule combinations by trying
   one possibility per masscheck run.
2) The hundreds of times as much (masscheck) data we'd need to get an
   accurate re-score using all rule combinations existing in the corpora.

There is still the possibility of doing an analysis of what combinations of
rules hit false-negatives significantly more often than they hit non-spam.
(Or false-positives vs. spam.)

-- 
Immorality: "The morality of those who are having a better time"
- Henry Louis Mencken
http://www.ChaosReigns.com

Re: New Bayes like paradigm

Posted by da...@chaosreigns.com.
On 10/13, Adam Katz wrote:
> PS:  As an SA Committer, do I have access to those logs?

Don't think so, but you can just ask for a regular masscheck account if you
don't already have one, and with that account do:

rsync --exclude '*~' -vaz "rsync.spamassassin.org::corpus" ./

-- 
"I'd rather be happy than right any day."
- Slartiblartfast, The Hitchhiker's Guide to the Galaxy
http://www.ChaosReigns.com

Re: New Bayes like paradigm

Posted by Adam Katz <an...@khopis.com>.
> On 9/28/2011 8:02 AM, darxus@chaosreigns.com wrote:
>> You definitely have a good point that it would only be necessary to
>> track the combinations that actually show up in emails, however
>> 1024 is only the possible combinations from one set of 10 rules.
>> The number of combinations in the actual corpora would be much
>> higher.  I'll try to get you a number.

On 10/10/2011 06:55 AM, Marc Perkel wrote:
> You wouldn't have to store all combinations. You could just do up to
> 3 levels and only the combinations that actually occur and use a hash
> to look up the combinations.

The data is all there if you have access to the spam.log and ham.log
files created by mass-check (warning, this code was composed in email,
not vim, and it has not been run):

#############################
#!/bin/sh
# Give three rules as arguments.  Assumes ham.log and spam.log in PWD

export GREP_OPTIONS="--mmap"

tp=`grep -w "$1" spam.log |grep -w "$2" |grep -wc "$3"`
fp=`grep -w "$1"  ham.log |grep -w "$2" |grep -wc "$3"`

spams=`grep -c '^[^#]' spam.log`
hams=` grep -c '^[^#]' ham.log`

tpr=`echo "scale=5; $tp * 100 / $spams" |bc`
fpr=`echo "scale=5; $fp * 100 / $hams " |bc`

so=`echo "scale=4; $tpr / ($tpr + $fpr)" |bc`

echo "meta rule  $1 && $2 && $3"
echo "  SPAM% $tpr   HAM% $fpr   S/O $so"
#############################

Now you can pick your thresholds for moving forward (and your thresholds
for saving a combination as a no-go in the future).  These numbers are
just as valid as anything you'd get through the actual mass-check run.

Still, I worry about what this does to the GA.


PS:  As an SA Committer, do I have access to those logs?