You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Theo Van Dinter <fe...@kluge.net> on 2004/01/12 02:13:40 UTC
muhaha! multipart/alternative evaltest ...
I haven't done a lot of work on this yet, but ... I wrote up a very
simple algorithm that figures out the % difference of the text and html
parts of a message. The theory goes that the closer to 0 the difference
is, the less-likely the message is to be spam.
In a quick test from a few messages in each set:
ham left: 32, orig: 448, difference: 7.14%
ham left: 11, orig: 148, difference: 7.43%
ham left: 36, orig: 72, difference: 50.00% # I have to look at this one
spam left: 216, orig: 218, difference: 99.08%
spam left: 91, orig: 91, difference: 100.00%
spam left: 46, orig: 46, difference: 100.00%
:) I'm going to try refining it a bit more, then put it in as an
eval test. (BTW: as a side-effect, this test will also catch the m/a
messages that only have a html part ... ;))
--
Randomly Generated Tagline:
"Special? Our longest phone conversation is 'Get over here.'" - Ross on ER
Re: muhaha! multipart/alternative evaltest ...
Posted by Theo Van Dinter <fe...@kluge.net>.
On Sun, Jan 11, 2004 at 08:13:40PM -0500, Theo Van Dinter wrote:
> ham left: 36, orig: 72, difference: 50.00% # I have to look at this one
Turns out this was a bug in the MIME parser (HTML wasn't being rendered).
Now that the bug is fixed:
left: 3, orig: 53, difference: 5.66%
--
Randomly Generated Tagline:
"Linux: the operating system with a CLUE...
Command Line User Environment".
(seen in a posting in comp.software.testing)
RE: muhaha! multipart/alternative evaltest ...
Posted by Gary Funck <ga...@intrepid.com>.
I agree. It would be a confirming indication of spam.
> From: Theo Van Dinter
> Sent: Sunday, January 11, 2004 7:44 PM
>
> On Sun, Jan 11, 2004 at 05:37:07PM -0800, Gary Funck wrote:
> > Rather than compare one alternative part to the other, would
> the following
> > make sense:
> > score = max (score (text part), score (html part));
>
> Hrm. I still think comparing is good (rules should catch spammer tricks),
Re: muhaha! multipart/alternative evaltest ...
Posted by Theo Van Dinter <fe...@kluge.net>.
On Sun, Jan 11, 2004 at 05:37:07PM -0800, Gary Funck wrote:
> Rather than compare one alternative part to the other, would the following
> make sense:
> score = max (score (text part), score (html part));
Hrm. I still think comparing is good (rules should catch spammer tricks),
but that might not be bad either. It would require changing the flow
of rules a bit (at least putting in a loop around the body/uri rules,
then deciding which has the higher score).
--
Randomly Generated Tagline:
Hmm, doubtful. The source code generally wasn't there when I needed it.
-- Larry Wall when asked if he learned Perl from the perl source
RE: muhaha! multipart/alternative evaltest ...
Posted by Gary Funck <ga...@intrepid.com>.
I look forward to seeing how your tests work out.
Rather than compare one alternative part to the other, would the following
make sense:
score = max (score (text part), score (html part));
> From: Theo Van Dinter
> Sent: Sunday, January 11, 2004 5:14 PM
>
> I haven't done a lot of work on this yet, but ... I wrote up a very
> simple algorithm that figures out the % difference of the text and html
> parts of a message. The theory goes that the closer to 0 the difference
> is, the less-likely the message is to be spam.
>
>>