You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/01/19 02:13:03 UTC

[Bug 4089] New: Micro Bayesian Filters - Phone Numbers

http://bugzilla.spamassassin.org/show_bug.cgi?id=4089

           Summary: Micro Bayesian Filters - Phone Numbers
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Learner
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: marc@perkel.com


OK - still trying to get you all to realize and comprehend the idea of having
more than one bayesian filter.

Picture this .....

You extract phonenumbers from messages and to do a bayesian filter on JUST PHONE
NUMBERS. It's the ultimate white rule!

A lot of people who email put their phone number in their sig line. Once the
filter learns all your friends and associates phone numbers then every time you
get a message with that phone number in it - it's a "this is not spam" flag.

And - it might catch a little bit of spam from spammers who use phone numbers.
But this is mostly a white rule because spammers don't know the phone numbers of
your friends.

So - if you grasp the concept of how this might be really accurate then think
about adding email addresses within a message - or links? But - nothing else. 

I'm testing this concept now and it's working EXTREMELY well. So - this is the
next breakthrough in spam (or ham) detection. And there's nothing like some
really good white rules to make up for the errors in the other rules.

What I'm testing now is using only headers (enhanced information headers) and
from the body only phone numbers, all links (including links to graphics) and
email addresses. 

Marc Perkel
http://www.junkemailfilter.com



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4089] Micro Bayesian Filters - Phone Numbers

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4089





------- Additional Comments From marc@perkel.com  2005-01-22 21:53 -------
Subject: Re:  Micro Bayesian Filters - Phone Numbers



bugzilla-daemon@bugzilla.spamassassin.org wrote:

>http://bugzilla.spamassassin.org/show_bug.cgi?id=4089
>  
>
>
>
>>>From the first statement I would tend to draw a different conclusion than
>you did.
>Is it that multiple filters are better than one, or is it that a filter that
>drops extraneous stuff from its db is better than one that doesn't?  You
>first statement leads to the later conclusion.
>  
>
That might end up being true. I don't know if two filters are better 
than one or not where one looks at everything and the other looks at 
only the hottests parts. I also wonder if two filters looking at 
different parts might work better.

>What happens if you disable Bayes in SA completely and just use your second
>filter?  It sounds like this should improve overall results, since you won't
>have the occasional SA Bayes error biasing the score in the wrong direction.
>  
>
It might - I have yey to try that.

>What I'm getting at here is that maybe the solution isn't multiple filters
>(although I see nothing inherently wrong with that idea), but maybe the
>solution is to simply prune the extraneous junk from the input to the main
>SA Bayes filter so that it works like your add-on filter.
>  
>
I think that multiple filters is the solution - in that I think one 
filter to look at the message - and another filter to look at only the 
rules triggered for automatic scoring.

>You clearly have a script or some such that is able to trim a message down
>for input to the second filter.  How hard would it be to add that filtering
>as an option in SA to feed the main Bayes routines?
>
>
>  
>
Interesting enough I'm not much of a programmer. I kind of gather code 
and make it work. But - the sharp people here could do it.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<br>
<a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@bugzilla.spamassassin.org">bugzilla-daemon@bugzilla.spamassassin.org</a> wrote:<br>
<blockquote cite="mid4089.41F330AF.360F5A54.dev@spamassassin.apache.org"
 type="cite">
  <pre wrap=""><a class="moz-txt-link-freetext" href="http://bugzilla.spamassassin.org/show_bug.cgi?id=4089">http://bugzilla.spamassassin.org/show_bug.cgi?id=4089</a>
  </pre>
  <br>
  <pre wrap=""><!---->
&gt;From the first statement I would tend to draw a different conclusion than
you did.
Is it that multiple filters are better than one, or is it that a filter that
drops extraneous stuff from its db is better than one that doesn't?  You
first statement leads to the later conclusion.
  </pre>
</blockquote>
That might end up being true. I don't know if two filters are better
than one or not where one looks at everything and the other looks at
only the hottests parts. I also wonder if two filters looking at
different parts might work better. <br>
<blockquote cite="mid4089.41F330AF.360F5A54.dev@spamassassin.apache.org"
 type="cite">
  <pre wrap="">
What happens if you disable Bayes in SA completely and just use your second
filter?  It sounds like this should improve overall results, since you won't
have the occasional SA Bayes error biasing the score in the wrong direction.
  </pre>
</blockquote>
It might - I have yey to try that.<br>
<blockquote cite="mid4089.41F330AF.360F5A54.dev@spamassassin.apache.org"
 type="cite">
  <pre wrap="">
What I'm getting at here is that maybe the solution isn't multiple filters
(although I see nothing inherently wrong with that idea), but maybe the
solution is to simply prune the extraneous junk from the input to the main
SA Bayes filter so that it works like your add-on filter.
  </pre>
</blockquote>
I think that multiple filters is the solution - in that I think one
filter to look at the message - and another filter to look at only the
rules triggered for automatic scoring.<br>
<blockquote cite="mid4089.41F330AF.360F5A54.dev@spamassassin.apache.org"
 type="cite">
  <pre wrap="">
You clearly have a script or some such that is able to trim a message down
for input to the second filter.  How hard would it be to add that filtering
as an option in SA to feed the main Bayes routines?


  </pre>
</blockquote>
Interesting enough I'm not much of a programmer. I kind of gather code
and make it work. But - the sharp people here could do it.<br>
<br>
<pre class="moz-signature" cols="80">-- 
Marc Perkel - <a class="moz-txt-link-abbreviated" href="mailto:marc@perkel.com">marc@perkel.com</a>

Spam Filter: <a class="moz-txt-link-freetext" href="http://www.junkemailfilter.com">http://www.junkemailfilter.com</a>
    My Blog: <a class="moz-txt-link-freetext" href="http://marc.perkel.com">http://marc.perkel.com</a>
My Religion: <a class="moz-txt-link-freetext" href="http://www.churchofreality.org">http://www.churchofreality.org</a>
~ "If it's real - we believe in it!" ~

</pre>
</body>
</html>




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4089] Micro Bayesian Filters - Phone Numbers

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4089


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE




------- Additional Comments From jm@jmason.org  2005-01-24 10:25 -------
virtually exactly the same discussion is happening in 2 bugs at once.  marking a
DUP.

*** This bug has been marked as a duplicate of 4095 ***



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4089] Micro Bayesian Filters - Phone Numbers

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4089





------- Additional Comments From marc@perkel.com  2005-01-22 09:41 -------
This idea for a second bayesian filter is working really well. It is returning
better accuracy than the bayesian filter that spam assassin comes with.

As I posted originally - this second filter (using spamprobe) is not fed the
entire message. It is only fed the headers (enhanced) and all http references,
email addresses, and phone numbers within the body. And the results are
extremely impressive.

It generally returns the same results ar SAs bayesian filter except when the
don't agree - it's because the second filter is right.

In particualr it excells in:

Catching Nigerian Spam.
Catching Spam where the spammer includes a lot of text to confuse bayesian filters.
Correctly identifying newsletters containing advertising as nonspam.
It learns much faster. With a little hand training - it's almost 100% accurate.

I can post more of the details if I get some interest - but - this works and
it's time to look at bayesian filters differently. And time to think about
multiple filters working on different parts of the message.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4089] Micro Bayesian Filters - Phone Numbers

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4089





------- Additional Comments From lwilton@earthlink.net  2005-01-22 21:05 -------
Subject: Re:  Micro Bayesian Filters - Phone Numbers

> It generally returns the same results ar SAs bayesian filter except when
the
> don't agree - it's because the second filter is right.

> I can post more of the details if I get some interest - but - this works
and
> it's time to look at bayesian filters differently. And time to think about
> multiple filters working on different parts of the message.

>From the first statement I would tend to draw a different conclusion than
you did.
Is it that multiple filters are better than one, or is it that a filter that
drops extraneous stuff from its db is better than one that doesn't?  You
first statement leads to the later conclusion.

What happens if you disable Bayes in SA completely and just use your second
filter?  It sounds like this should improve overall results, since you won't
have the occasional SA Bayes error biasing the score in the wrong direction.

What I'm getting at here is that maybe the solution isn't multiple filters
(although I see nothing inherently wrong with that idea), but maybe the
solution is to simply prune the extraneous junk from the input to the main
SA Bayes filter so that it works like your add-on filter.

You clearly have a script or some such that is able to trim a message down
for input to the second filter.  How hard would it be to add that filtering
as an option in SA to feed the main Bayes routines?





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.