You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Theo Van Dinter <fe...@kluge.net> on 2004/01/12 02:13:40 UTC

muhaha! multipart/alternative evaltest ...

I haven't done a lot of work on this yet, but ...  I wrote up a very
simple algorithm that figures out the % difference of the text and html
parts of a message.  The theory goes that the closer to 0 the difference
is, the less-likely the message is to be spam.

In a quick test from a few messages in each set:

ham		left: 32, orig: 448, difference: 7.14%
ham		left: 11, orig: 148, difference: 7.43%
ham		left: 36, orig: 72, difference: 50.00%		# I have to look at this one
spam		left: 216, orig: 218, difference: 99.08%
spam		left: 91, orig: 91, difference: 100.00%
spam		left: 46, orig: 46, difference: 100.00%

:)  I'm going to try refining it a bit more, then put it in as an
eval test.  (BTW: as a side-effect, this test will also catch the m/a
messages that only have a html part ... ;))

-- 
Randomly Generated Tagline:
"Special?  Our longest phone conversation is 'Get over here.'" - Ross on ER

Re: muhaha! multipart/alternative evaltest ...

Posted by Theo Van Dinter <fe...@kluge.net>.

On Sun, Jan 11, 2004 at 08:13:40PM -0500, Theo Van Dinter wrote:
> ham		left: 36, orig: 72, difference: 50.00%		# I have to look at this one

Turns out this was a bug in the MIME parser (HTML wasn't being rendered).
Now that the bug is fixed:

left: 3, orig: 53, difference: 5.66%

-- 
Randomly Generated Tagline:
"Linux: the operating system with a CLUE...
 Command Line User Environment".
 (seen in a posting in comp.software.testing)

RE: muhaha! multipart/alternative evaltest ...

Posted by Gary Funck <ga...@intrepid.com>.


I agree. It would be a confirming indication of spam.

> From: Theo Van Dinter
> Sent: Sunday, January 11, 2004 7:44 PM
>
> On Sun, Jan 11, 2004 at 05:37:07PM -0800, Gary Funck wrote:
> > Rather than compare one alternative part to the other, would
> the following
> > make sense:
> >    score = max (score (text part), score (html part));
>
> Hrm.  I still think comparing is good (rules should catch spammer tricks),

Re: muhaha! multipart/alternative evaltest ...

Posted by Theo Van Dinter <fe...@kluge.net>.

On Sun, Jan 11, 2004 at 05:37:07PM -0800, Gary Funck wrote:
> Rather than compare one alternative part to the other, would the following
> make sense:
>    score = max (score (text part), score (html part));

Hrm.  I still think comparing is good (rules should catch spammer tricks),
but that might not be bad either.  It would require changing the flow
of rules a bit (at least putting in a loop around the body/uri rules,
then deciding which has the higher score).

-- 
Randomly Generated Tagline:
Hmm, doubtful.  The source code generally wasn't there when I needed it.
              -- Larry Wall when asked if he learned Perl from the perl source

RE: muhaha! multipart/alternative evaltest ...

Posted by Gary Funck <ga...@intrepid.com>.


I look forward to seeing how your tests work out.

Rather than compare one alternative part to the other, would the following
make sense:
   score = max (score (text part), score (html part));


> From: Theo Van Dinter
> Sent: Sunday, January 11, 2004 5:14 PM
>
> I haven't done a lot of work on this yet, but ...  I wrote up a very
> simple algorithm that figures out the % difference of the text and html
> parts of a message.  The theory goes that the closer to 0 the difference
> is, the less-likely the message is to be spam.
> 
>>