You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2005/05/26 02:12:35 UTC

Re[2]: "Grouping" input

Hello Matt, John,

Tuesday, May 24, 2005, 7:15:16 PM, you wrote:

MK> John August wrote:
>> I've noticed spam which has a section of "extracted" text after the spam
>> content. It seems to me that by taking things line by line, you'll reach
>> a point at which the spam index peaks, and then trails off after. This
>> is a pattern which would remain even if the "overall" spam index is low.
>> 
>> Does the current spam assassin implement such an approach ? Or is the
>> algorithm sufficiently subtle to null out these attempts ?

MK> AFAIK, no part of SA takes such an approach.

MK> However, these attempts are only going to be effective against the bayes portion
MK> of SA.

As I've said before, my opinion is that these attempts are NOT
effective against SpamAssassin's Bayes system.

As a rule, we do NOT receive hams which contain such extracted text.
No matter where the spammers extract their text from, they're going to
extract words that are not found in ham, and Bayes is going to learn
that the presence of such words means S P A M.

Works here.  Bayes is confused by short messages, no extracted text,
with words that are not spam-specific.  Add to the length of the
messages by adding extracted text, and my Bayes system recognizes the
extractions for what they are.

Bob Menschel




Re: "Grouping" input

Posted by Matt Kettler <mk...@evi-inc.com>.
Robert Menschel wrote:

> MK> However, these attempts are only going to be effective against the bayes portion
> MK> of SA.
> 
> As I've said before, my opinion is that these attempts are NOT
> effective against SpamAssassin's Bayes system.
> 
> As a rule, we do NOT receive hams which contain such extracted text.
> No matter where the spammers extract their text from, they're going to
> extract words that are not found in ham, and Bayes is going to learn
> that the presence of such words means S P A M.

I agree, mostly, however I have found that SOME emails with extracted text
collide with our ham profile. Not all, not even many, but some do collide.

Really this is entirely a function of how well the spammer can match your ham
profile with his extraction. If he can match it accurately, this technique will
be very effective against your bayes. If they can't match your ham profile, it
won't work at  all.


Just today I got one email with this hit list:

score=17.817, required 5,	autolearn=spam, AB_URI_RBL 1.00, BAYES_10 -0.91,
BLACK_URI_RBL 2.00, DRUGS_ERECTILE 1.00, INFO_GREYLIST_NOTDELAYED -0.00,
RAZOR2_CF_RANGE_51_100 0.20, RAZOR2_CHECK 1.05,	RCVD_IN_BL_SPAMCOP_NET 1.50,
RCVD_IN_XBL 4.92, SPAMCOP_URI_RBL 3.00,	VIAGRA_ONLINE 4.06


It got the BAYES_10 because the extracted text closely matches the general
language style of my end users. The spam content was 1 line and a url. The
extracted text was 4 lines.