You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Peter H. Lemieux" <ph...@cyways.com> on 2006/10/26 15:46:28 UTC

Scoring base64 blob messages

I received a spam today where the text was only a base64-encoded blob.

Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: base64
Subject: feel young and strong again

PGh0bWw+DQpTdG9wIG92ZXJwYXlpbmcgZm9yIHlvdXIgcHJlc2NyaXB0aW9uIG1lZGljYXRpb25z
IHRvZGF5Lg0KPGJyPg0KPGJyPg0KU2F2ZSBtb3JlIHRoYW4gc2l4dHkgcGVyY2VudCBvbiBicmFu
ZCBuYW1lIGdlbmVyaWMgbWVkcyB0aGF0IGFyZSBjaGVtaWNhbGx5IGlkZW50aWNhbC4NCjxicj4N

Does SA convert the blob into text before scanning?  It contains a number 
of drug-related words and a URI that points to "pharmconnect.org".

Also is there an SA rule that scores messages that contain only a single 
base64 part (as opposed to a base64-encoded attachment)?  I doubt many 
legitimate messages arrive with only a single base64 part.

Peter

Re: Scoring base64 blob messages

Posted by Matt Kettler <mk...@verizon.net>.
Peter H. Lemieux wrote:
> I received a spam today where the text was only a base64-encoded blob.
>
> Content-Type: text/html;
>     charset="us-ascii"
> Content-Transfer-Encoding: base64
> Subject: feel young and strong again
>
> PGh0bWw+DQpTdG9wIG92ZXJwYXlpbmcgZm9yIHlvdXIgcHJlc2NyaXB0aW9uIG1lZGljYXRpb25z
>
> IHRvZGF5Lg0KPGJyPg0KPGJyPg0KU2F2ZSBtb3JlIHRoYW4gc2l4dHkgcGVyY2VudCBvbiBicmFu
>
> ZCBuYW1lIGdlbmVyaWMgbWVkcyB0aGF0IGFyZSBjaGVtaWNhbGx5IGlkZW50aWNhbC4NCjxicj4N
>
>
> Does SA convert the blob into text before scanning?  
Yes. It's done that for a LONG time.. Even SA 2.3x did that. Even
"rawbody" rules are run after decoding base64.

Otherwise this would be a huge hole in SA and every spammer would very
quickly use base64 for all their spam. (Yes, spammers DO very
aggressively study spamassassin and tune their mail to fit it's
weaknesses. VERY aggressively. Anything this obvious and easy would be
discovered and become widespread within two months of a SA release.)
> It contains a number of drug-related words and a URI that points to
> "pharmconnect.org".
>
> Also is there an SA rule that scores messages that contain only a
> single base64 part (as opposed to a base64-encoded attachment)?  I
> doubt many legitimate messages arrive with only a single base64 part.
No, but there is one that detects base64 encoding of text sections.
MIME_BASE64_TEXT.



Re: Scoring base64 blob messages

Posted by Theo Van Dinter <fe...@apache.org>.
On Fri, Oct 27, 2006 at 05:24:58PM -0400, Peter H. Lemieux wrote:
> >Well, there isn't "a" SA corpus, so there's no answer to that question.
> 
> Ah, I hadn't read this page before:
> 	http://wiki.apache.org/spamassassin/HandClassifiedCorpora
> My recollection was that 2.x used a centrally-defined corpus rather than 
> a variety of developers' corpora (see, I read the wiki).  Either things 
> changed with the switch in scoring algorithms in 3.x, or my recollection 
> is shoddy.  Probably the latter.

Yeah, sorry.  We've had separate corpora since I started with SA several years
ago.  There was a "public corpus" of mail made available which could be
confusing your memory. :)

-- 
Randomly Selected Tagline:
"I pity the shul that won't let Krusty in now. Spin me clown!"
         - Mr. T, The Simpsons, "Today, I Am a Klown"

Re: Scoring base64 blob messages

Posted by "Peter H. Lemieux" <ph...@cyways.com>.
Theo Van Dinter wrote:
> On Thu, Oct 26, 2006 at 12:19:23PM -0400, Peter H. Lemieux wrote:
>>> No, because there are going to be a lot of mails that would hit that.
>> Really?  Maybe it's because I live in the US, but I can't think of a 
>> legitimate message I've ever received consisting only of a base64 blob. 
> 
> You look at a lot of raw messages?  ;)

Doesn't everybody?

Seriously, I do look at a lot of raw messages; for instance, I review the 
full text of nearly every spam message that doesn't get caught by my 
filters and shows up in my inbox.  Obviously I don't get much mail from 
Blackberry users or Ticketmaster!

>> Rather than making anyone else do the work for me, is there something I 
>> can read about how to determine the frequency of different message 
>> features appearing in the corpus?

> Well, there isn't "a" SA corpus, so there's no answer to that question.

Ah, I hadn't read this page before:
	http://wiki.apache.org/spamassassin/HandClassifiedCorpora
My recollection was that 2.x used a centrally-defined corpus rather than 
a variety of developers' corpora (see, I read the wiki).  Either things 
changed with the switch in scoring algorithms in 3.x, or my recollection 
is shoddy.  Probably the latter.

> You can generate some rules and use mass-check to run against your own corpus
> to gather some statistics.  I'm willing to run some rules for you against my
> corpus if you want.  I just don't have time to come up with the rules right
> now.

Thanks for the offer, Theo, but don't spend your valuable time on this. 
I'll give it shot some day when I've got some spare moments.  If I do get 
some candidate rules, I'll pass them along to you for testing.


Thanks again!
Peter

Re: Scoring base64 blob messages

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Oct 26, 2006 at 12:19:23PM -0400, Peter H. Lemieux wrote:
> >No, because there are going to be a lot of mails that would hit that.
> 
> Really?  Maybe it's because I live in the US, but I can't think of a 
> legitimate message I've ever received consisting only of a base64 blob. 

You look at a lot of raw messages?  ;)

> Our of curiosity, how frequently does this appear in the SA ham corpus? 

Well, there isn't "a" SA corpus, so there's no answer to that question.  As
for how often it happens in my corpus, I don't know I'd have to write a rule
and run it against the messages.

> Rather than making anyone else do the work for me, is there something I 
> can read about how to determine the frequency of different message 
> features appearing in the corpus?

You can generate some rules and use mass-check to run against your own corpus
to gather some statistics.  I'm willing to run some rules for you against my
corpus if you want.  I just don't have time to come up with the rules right
now.

-- 
Randomly Selected Tagline:
strrev(strcpy("xus yti     "+7,"varg")-7)[0]='G'

Re: Scoring base64 blob messages

Posted by Stuart Johnston <st...@ebby.com>.
Peter H. Lemieux wrote:
> Theo Van Dinter wrote:
>> On Thu, Oct 26, 2006 at 09:46:28AM -0400, Peter H. Lemieux wrote:
>>> Also is there an SA rule that scores messages that contain only a 
>>> single base64 part (as opposed to a base64-encoded attachment)?  I 
>>> doubt many legitimate messages arrive with only a single base64 part.
>>
>> No, because there are going to be a lot of mails that would hit that.
> 
> Really?  Maybe it's because I live in the US, but I can't think of a 
> legitimate message I've ever received consisting only of a base64 blob. 
> Our of curiosity, how frequently does this appear in the SA ham corpus? 
> Rather than making anyone else do the work for me, is there something I 
> can read about how to determine the frequency of different message 
> features appearing in the corpus?

Most messages sent from a Blackberry would hit this rule, for example.

Re: Scoring base64 blob messages

Posted by Theo Van Dinter <fe...@apache.org>.
On Fri, Oct 27, 2006 at 11:44:48AM -0400, Daryl C. W. O'Shea wrote:
> Ticketmaster sends out *a lot* of their mail this way.  I'm sure it's 
> partly in an attempt to avoid having their mail FP against crappy filters.

I'd also imagine that sometimes it's just easier to do this than try to pay
attention to what is being sent and determine if encoding is necessary.
Programmers tend to be lazy after all. :)

-- 
Randomly Selected Tagline:
"There are two major products to come out of Berkeley: LSD and UNIX.  We
 don't believe this to be a coincidence."      - Unknown

Re: Scoring base64 blob messages

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Peter H. Lemieux wrote:
> Theo Van Dinter wrote:
>> On Thu, Oct 26, 2006 at 09:46:28AM -0400, Peter H. Lemieux wrote:

>>> Also is there an SA rule that scores messages that contain only a 
>>> single base64 part (as opposed to a base64-encoded attachment)?  I 
>>> doubt many legitimate messages arrive with only a single base64 part.
>>
>> No, because there are going to be a lot of mails that would hit that.
> 
> Really?  Maybe it's because I live in the US, but I can't think of a 
> legitimate message I've ever received consisting only of a base64 blob.

I'm not sure what to say to that. ;)


> Our of curiosity, how frequently does this appear in the SA ham corpus? 

Ticketmaster sends out *a lot* of their mail this way.  I'm sure it's 
partly in an attempt to avoid having their mail FP against crappy filters.


Daryl

Re: Scoring base64 blob messages

Posted by "Peter H. Lemieux" <ph...@cyways.com>.
Theo Van Dinter wrote:
> On Thu, Oct 26, 2006 at 09:46:28AM -0400, Peter H. Lemieux wrote:
>> Does SA convert the blob into text before scanning?  It contains a number 
>> of drug-related words and a URI that points to "pharmconnect.org".
> 
> Yes.

I was pretty sure this was the case but wanted to confirm it.

>> Also is there an SA rule that scores messages that contain only a single 
>> base64 part (as opposed to a base64-encoded attachment)?  I doubt many 
>> legitimate messages arrive with only a single base64 part.
> 
> No, because there are going to be a lot of mails that would hit that.

Really?  Maybe it's because I live in the US, but I can't think of a 
legitimate message I've ever received consisting only of a base64 blob. 
Our of curiosity, how frequently does this appear in the SA ham corpus? 
Rather than making anyone else do the work for me, is there something I 
can read about how to determine the frequency of different message 
features appearing in the corpus?

Thanks, Theo.

Peter




Re: Scoring base64 blob messages

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Oct 26, 2006 at 09:46:28AM -0400, Peter H. Lemieux wrote:
> Does SA convert the blob into text before scanning?  It contains a number 
> of drug-related words and a URI that points to "pharmconnect.org".

Yes.

> Also is there an SA rule that scores messages that contain only a single 
> base64 part (as opposed to a base64-encoded attachment)?  I doubt many 
> legitimate messages arrive with only a single base64 part.

No, because there are going to be a lot of mails that would hit that.

-- 
Randomly Selected Tagline:
Interstellar Matter is a Gas