You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kārlis Repsons <ka...@gmail.com> on 2010/01/30 14:35:26 UTC

How should this tricky spam be filtered?

People,
perhaps its simple to be done, but I personally would like to know the ways to 
get rid of something like this:

----------  Forwarded Message  ----------

Subject: marty rizin g  suppe r  socio logy  mason ing
Date: Friday 29 January 2010
From: "Cheap Tamiflu on www.ra97.com" <ha...@icasal.com>
To: repsons@gmail.com

conju nctiv a  remod eled  tsing hai  andro stero ne  ropie r  suici des
 ulste r  ratfi nk  cleri cal  shado wgrap h  plain sman  human ity
griso ns  goosa nder  snipp ed  unhon ourab le  mappa ble  malap rop
idoli zed  tosca na  commi t  garda  speci alism  compe er  duple ix
conso rting  rehab ilita te  berat es  megaw ords  confu siona l  seams
ter  therm omete r  overd ose  withh olds  growl y  manwa rds  berat es
neolo gizes  oblig er  confu siona l  strok ing  signo ri  bogie  mason
ing  naira  disfo rest  appar entne ss  foras much  organ ized  larce
nist  tips  offic iatin g  beeve s  liqui dized  homoe opath  heads
quare  bagga gemen  trico rn  ropie r  exqui sitel y  trico rn  churc
hgoer  retal iated  decea ses  desmo ulins  outbu rsts  purve yed  mappa
ble  wrack ing  docke d  hydro logy

-------------------------------------------------------

Obviously, the only useful part of all that was the From: name field.

SA gives just "X-Spam-Status: No, score=-0.7 required=4.0 tests=BAYES_20 
autolearn=ham version=3.2.5-gr2".

Hopefully a valid question here...

Re: How should this tricky spam be filtered?

Posted by Jeff Mincy <je...@delphioutpost.com>.
   From: Kārlis Repsons <ka...@gmail.com>
   Date: Sat, 30 Jan 2010 17:20:23 +0000
   
   On Saturday 30 January 2010 15:48:36 Jeff Mincy wrote:
   >  BAYES_99,DCC_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_FIVETEN_SPAM,RCVD_IN_NIX
   > SPAM,RCVD_IN_UCEPROTECT1,RCVD_IN_UCEPROTECT2,RCVD_IN_UCEPROTECT3,BOTNET,BOT
   > NET_BADDNS
   > 
   > Botnet/FIVETEN/NIXSPAM/UCEPROTECT are additional rules added.
   > -jeff
   
   Thanks, just about DCC: why its said to be "not opensource" and commented out 
   in a spamassassin default config? Are there any closed-source binaries on a 
   client machine from it? Any such binaries related to SA exist?

DCC is a separately managed project with its own license.  DCC has to be
installed and configured (dccproc and dccifd) outside of SpamAssassin.
After DCC is installed then SpamAssassin has to be configured to use DCC
by loading the plugin.  You can install DCC from source or from various
repositories.   Same is true for razor and pyzor.
-jeff

Re: How should this tricky spam be filtered?

Posted by Kārlis Repsons <ka...@gmail.com>.
On Saturday 30 January 2010 15:48:36 Jeff Mincy wrote:
>  BAYES_99,DCC_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_FIVETEN_SPAM,RCVD_IN_NIX
> SPAM,RCVD_IN_UCEPROTECT1,RCVD_IN_UCEPROTECT2,RCVD_IN_UCEPROTECT3,BOTNET,BOT
> NET_BADDNS
> 
> Botnet/FIVETEN/NIXSPAM/UCEPROTECT are additional rules added.
> 
> -jeff

Thanks, just about DCC: why its said to be "not opensource" and commented out 
in a spamassassin default config? Are there any closed-source binaries on a 
client machine from it? Any such binaries related to SA exist?

Re: How should this tricky spam be filtered?

Posted by Jeff Mincy <je...@delphioutpost.com>.
   From: Ralph Bornefeld-Ettmann <il...@bornefeld-ettmann.de>
   Date: Sat, 30 Jan 2010 18:14:10 +0100
   
   Am 30.01.2010 16:48, schrieb Jeff Mincy:
   >    From: Kārlis Repsons <ka...@gmail.com>
   >    Date: Sat, 30 Jan 2010 14:07:16 +0000
   >    
   >    On Saturday 30 January 2010 13:54:14 Jeff Mincy wrote:
   >    > Retrain the message correctly in Bayes.  Bayes will catch on to this
   >    > after a few times.  The subject alone should be a strong enough clue
   >    > for bayes (I get BAYES_80 on this partial sample), so it looks like
   >    > you are doing only autolearn and not correcting messages that were
   >    > learned incorrectly.
   >    > -jeff
   >    
   > I couldn't figure out how to get an unadulterated version of the
   > message from the spamalyser.com link you posted in a previous message.
   > I tried this
   >  wget -O - -q http://spamalyser.com/v/5cbffujq/original.txt
   > pastebin has a simple way to download the original.
   > Anyway, I eventually got something.

   in the "Raw Message" tab you can get the plain message
   (http://spamalyser.com/v/5cbffujq/raw)
   
Sorry.   Looks more like html here.

  % wget -O - -q  http://spamalyser.com/v/5cbffujq/raw | head
  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
  <html lang="en-GB">
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

To get the raw email message, I'd have to write something like 
  wget -O - -q http://spamalyser.com/v/5cbffujq/raw | w3m -dump -T text/html
followed by sed scripts to keep the lines with line numbers discard
the line numbers.

I guess http://spamalyser.com is looking at the User-Agent: Wget/1.10.2
header.

Maybe there could be a really-raw-without-line-numbers-and-no-html target.

-jeff

Re: How should this tricky spam be filtered?

Posted by Mike Cardwell <sp...@lists.grepular.com>.
On 30/01/2010 17:14, Ralph Bornefeld-Ettmann wrote:

>> I couldn't figure out how to get an unadulterated version of the
>> message from the spamalyser.com link you posted in a previous message.
>> I tried this
>>  wget -O - -q http://spamalyser.com/v/5cbffujq/original.txt
>> pastebin has a simple way to download the original.
>> Anyway, I eventually got something.
>>
> in the "Raw Message" tab you can get the plain message
> (http://spamalyser.com/v/5cbffujq/raw)

On the raw page is the (source) link that gives you a complete
"text/plain" copy of the original content uploaded. There are a bunch of
referer restrictions in order to prevent content being uploaded and then
linked to from spam, which is why the wget failed. I have removed
referer checks for user agents matching /wget|lwp|lynx|links|python/i

-- 
Mike Cardwell    : UK based IT Consultant, Perl developer, Linux admin
Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
Technical Blog   : Tech Blog  - https://secure.grepular.com/
Spamalyser       : Spam Tool  - http://spamalyser.com/

Re: How should this tricky spam be filtered?

Posted by RW <rw...@googlemail.com>.
On Sat, 30 Jan 2010 19:25:15 +0200
Jari Fredriksson <ja...@iki.fi> wrote:

> On 30.1.2010 19:14, Ralph Bornefeld-Ettmann wrote:
> 
> > 
> > in the "Raw Message" tab you can get the plain message
> > (http://spamalyser.com/v/5cbffujq/raw)
> > 
> 
> It's not raw message, it has a line number on each row.
> 
Click on raw-message and then source.

Re: How should this tricky spam be filtered?

Posted by Jari Fredriksson <ja...@iki.fi>.
On 30.1.2010 19:14, Ralph Bornefeld-Ettmann wrote:

> 
> in the "Raw Message" tab you can get the plain message
> (http://spamalyser.com/v/5cbffujq/raw)
> 

It's not raw message, it has a line number on each row.

-- 
http://www.iki.fi/jarif/

You may be recognized soon.  Hide.


Re: How should this tricky spam be filtered?

Posted by Ralph Bornefeld-Ettmann <il...@bornefeld-ettmann.de>.
Am 30.01.2010 16:48, schrieb Jeff Mincy:
>    From: Kārlis Repsons <ka...@gmail.com>
>    Date: Sat, 30 Jan 2010 14:07:16 +0000
>    
>    On Saturday 30 January 2010 13:54:14 Jeff Mincy wrote:
>    > Retrain the message correctly in Bayes.  Bayes will catch on to this
>    > after a few times.  The subject alone should be a strong enough clue
>    > for bayes (I get BAYES_80 on this partial sample), so it looks like
>    > you are doing only autolearn and not correcting messages that were
>    > learned incorrectly.
>    > -jeff
>    
> I couldn't figure out how to get an unadulterated version of the
> message from the spamalyser.com link you posted in a previous message.
> I tried this
>  wget -O - -q http://spamalyser.com/v/5cbffujq/original.txt
> pastebin has a simple way to download the original.
> Anyway, I eventually got something.
> 
>    Hmm, well, I just started with SA, so my filters aren't much trained yet. 
>    The thing is, I didn't believe its the Bayes filter to be used for that case! 
> 
> Bayes is an incredible tool, but only if you let it.  The worst thing
> you can do to bayes is mistrain it by learning spam messages has ham.
> The other bad thing is to limit the number of messages that it learns from.
> 
>    Because I still think, that its not correct to train SA filter on that letter 
>    as spam! It can contain words, which simply should not contribute to be more 
>    "spam", no? Thats not a problem?
> 
> No, that is not a problem.
> Yes, spam contains words, some of those words will also occur in ham.
> Bayes will figure out which words are spammy and which are hammy and
> which occur in both.
> 
> First start with training Bayes and then check if DCC and network
> tests are enabled.
> 
> Anyway, I get the following.   
>    BAYES_99,DCC_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_FIVETEN_SPAM,RCVD_IN_NIXSPAM,RCVD_IN_UCEPROTECT1,RCVD_IN_UCEPROTECT2,RCVD_IN_UCEPROTECT3,BOTNET,BOTNET_BADDNS
> 
> Botnet/FIVETEN/NIXSPAM/UCEPROTECT are additional rules added.
> 
> -jeff
> 

in the "Raw Message" tab you can get the plain message
(http://spamalyser.com/v/5cbffujq/raw)


Re: How should this tricky spam be filtered?

Posted by Jeff Mincy <je...@delphioutpost.com>.
   From: Kārlis Repsons <ka...@gmail.com>
   Date: Sat, 30 Jan 2010 14:07:16 +0000
   
   On Saturday 30 January 2010 13:54:14 Jeff Mincy wrote:
   > Retrain the message correctly in Bayes.  Bayes will catch on to this
   > after a few times.  The subject alone should be a strong enough clue
   > for bayes (I get BAYES_80 on this partial sample), so it looks like
   > you are doing only autolearn and not correcting messages that were
   > learned incorrectly.
   > -jeff
   
I couldn't figure out how to get an unadulterated version of the
message from the spamalyser.com link you posted in a previous message.
I tried this
 wget -O - -q http://spamalyser.com/v/5cbffujq/original.txt
pastebin has a simple way to download the original.
Anyway, I eventually got something.

   Hmm, well, I just started with SA, so my filters aren't much trained yet. 
   The thing is, I didn't believe its the Bayes filter to be used for that case! 

Bayes is an incredible tool, but only if you let it.  The worst thing
you can do to bayes is mistrain it by learning spam messages has ham.
The other bad thing is to limit the number of messages that it learns from.

   Because I still think, that its not correct to train SA filter on that letter 
   as spam! It can contain words, which simply should not contribute to be more 
   "spam", no? Thats not a problem?

No, that is not a problem.
Yes, spam contains words, some of those words will also occur in ham.
Bayes will figure out which words are spammy and which are hammy and
which occur in both.

First start with training Bayes and then check if DCC and network
tests are enabled.

Anyway, I get the following.   
   BAYES_99,DCC_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_FIVETEN_SPAM,RCVD_IN_NIXSPAM,RCVD_IN_UCEPROTECT1,RCVD_IN_UCEPROTECT2,RCVD_IN_UCEPROTECT3,BOTNET,BOTNET_BADDNS

Botnet/FIVETEN/NIXSPAM/UCEPROTECT are additional rules added.

-jeff

Re: How should this tricky spam be filtered?

Posted by Kārlis Repsons <ka...@gmail.com>.
On Saturday 30 January 2010 13:54:14 Jeff Mincy wrote:
> Retrain the message correctly in Bayes.  Bayes will catch on to this
> after a few times.  The subject alone should be a strong enough clue
> for bayes (I get BAYES_80 on this partial sample), so it looks like
> you are doing only autolearn and not correcting messages that were
> learned incorrectly.
> 
> -jeff

Hmm, well, I just started with SA, so my filters aren't much trained yet. The 
thing is, I didn't believe its the Bayes filter to be used for that case! 
Because I still think, that its not correct to train SA filter on that letter 
as spam! It can contain words, which simply should not contribute to be more 
"spam", no? Thats not a problem?

Re: How should this tricky spam be filtered?

Posted by Jeff Mincy <je...@delphioutpost.com>.
   From: Kārlis Repsons <ka...@gmail.com>
   Date: Sat, 30 Jan 2010 13:35:26 +0000
   
   People,
   perhaps its simple to be done, but I personally would like to know the ways to 
   get rid of something like this:

Use pastebin and save the entire message including the headers instead
of forwarding messages like this.

   ----------  Forwarded Message  ----------
   ...
   -------------------------------------------------------
   
   Obviously, the only useful part of all that was the From: name field.

   SA gives just "X-Spam-Status: No, score=-0.7 required=4.0 tests=BAYES_20 
   autolearn=ham version=3.2.5-gr2".
   
   Hopefully a valid question here...

Retrain the message correctly in Bayes.  Bayes will catch on to this
after a few times.  The subject alone should be a strong enough clue
for bayes (I get BAYES_80 on this partial sample), so it looks like
you are doing only autolearn and not correcting messages that were
learned incorrectly.

-jeff

Re: How should this tricky spam be filtered?

Posted by Kārlis Repsons <ka...@gmail.com>.
On Saturday 30 January 2010 13:51:18 Mike Cardwell wrote:
> By forwarding the email the way you have, your email client has stripped
> out most of the useful header information. Try pasting the message
> including the full set of headers into http://spamalyser.com/ or
> http://pastebin.com/ or similar and then come back here with a link to it.

If its useful, here...

http://spamalyser.com/v/5cbffujq/mime

Re: How should this tricky spam be filtered?

Posted by Mike Cardwell <sp...@lists.grepular.com>.
On 30/01/2010 13:35, Kārlis Repsons wrote:

> People,
> perhaps its simple to be done, but I personally would like to know the ways to 
> get rid of something like this:
> 
> ----------  Forwarded Message  ----------
> 
> Subject: marty rizin g  suppe r  socio logy  mason ing
> Date: Friday 29 January 2010
> From: "Cheap Tamiflu on www.ra97.com" <ha...@icasal.com>
> To: repsons@gmail.com
> 
> conju nctiv a  remod eled  tsing hai  andro stero ne  ropie r  suici des
>  ulste r  ratfi nk  cleri cal  shado wgrap h  plain sman  human ity
> griso ns  goosa nder  snipp ed  unhon ourab le  mappa ble  malap rop
> idoli zed  tosca na  commi t  garda  speci alism  compe er  duple ix
> conso rting  rehab ilita te  berat es  megaw ords  confu siona l  seams
> ter  therm omete r  overd ose  withh olds  growl y  manwa rds  berat es
> neolo gizes  oblig er  confu siona l  strok ing  signo ri  bogie  mason
> ing  naira  disfo rest  appar entne ss  foras much  organ ized  larce
> nist  tips  offic iatin g  beeve s  liqui dized  homoe opath  heads
> quare  bagga gemen  trico rn  ropie r  exqui sitel y  trico rn  churc
> hgoer  retal iated  decea ses  desmo ulins  outbu rsts  purve yed  mappa
> ble  wrack ing  docke d  hydro logy
> 
> -------------------------------------------------------
> 
> Obviously, the only useful part of all that was the From: name field.
> 
> SA gives just "X-Spam-Status: No, score=-0.7 required=4.0 tests=BAYES_20 
> autolearn=ham version=3.2.5-gr2".
> 
> Hopefully a valid question here...

By forwarding the email the way you have, your email client has stripped
out most of the useful header information. Try pasting the message
including the full set of headers into http://spamalyser.com/ or
http://pastebin.com/ or similar and then come back here with a link to it.

-- 
Mike Cardwell    : UK based IT Consultant, Perl developer, Linux admin
Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
Technical Blog   : Tech Blog  - https://secure.grepular.com/
Spamalyser       : Spam Tool  - http://spamalyser.com/

Re: How should this tricky spam be filtered?

Posted by John Hardin <jh...@impsec.org>.
On Tue, 9 Feb 2010, John Hardin wrote:

> On Mon, 8 Feb 2010, Adam Katz wrote:
>
>>  Maybe it's just because I'm testing on the command line, but FROM_URI
>>  appears to only fire if there's a character in front of the "www."
>>  portion.
>
> It does. I'm explicitly targeting a quoted comment part. My rule is somewhat 
> tighter than yours in an attempt to mimimize FPs, admittedly at the cost of 
> missing some spams.

Tweaked that a bit, it will now hit on www.{etc} at the beginning of an 
unquoted comment in the From: header.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Politicians never accuse you of "greed" for wanting other people's
   money, only for wanting to keep your own money.    -- Joseph Sobran
-----------------------------------------------------------------------
  3 days until Abraham Lincoln's and Charles Darwin's 201st Birthdays

Re: How should this tricky spam be filtered?

Posted by John Hardin <jh...@impsec.org>.
On Mon, 8 Feb 2010, Adam Katz wrote:

> I wrote:
>>> My tests have been mildly successful on this note, with FROM_WWW
>>> already getting promoted out of testing:
>>> http://ruleqa.spamassassin.org/?rule=/FROM_W&srcpath=khop
>>>
>>> This indicates that we don't actually need to parse any further
>>> because there is no sizable mass of legitimate mail that does
>>> this (and hopefully by getting this rule out the door, people
>>> considering it might decide against it).
>
> John Hardin wrote:
>> Concur.
>>
>> http://ruleqa.spamassassin.org/20100201-r905213-n/T_FROM_URI/detail?srcpath=jhardin
>
> To get them both on the same view:
> http://ruleqa.spamassassin.org/?rule=%2F^FROM_...%24
>
> Let's clear up the differences between FROM_URI and FROM_WWW ...

Good idea.

> Maybe it's just because I'm testing on the command line, but FROM_URI 
> appears to only fire if there's a character in front of the "www." 
> portion.

It does. I'm explicitly targeting a quoted comment part. My rule is 
somewhat tighter than yours in an attempt to mimimize FPs, admittedly at 
the cost of missing some spams.

> It also appears to fire on
> "other.www.user@example.com <ot...@example.com>"

Hrm. I modified it to avoid @www, but not www.*@.* - mod for that is 
checked in. Perhaps we should let another day's masscheck run against 
it...

> Presumably, my rule's lack of a TLD check is the main reason it hits 
> more messages (ham and spam).

Likely true. I'd argue that the list of TLDs spammers use is fairly 
limited and having an explicit match is a good idea.

> We should decide upon one (with or without revisions) and push it out
> the door.  We've seen a few threads here on the list and I've seen
> several inquiries on the IRC channel about this, so I suspect the
> masscheck corpora just aren't getting blasted by it as much as others.

Agreed.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Efficiency can magnify good, but it magnifies evil just as well.
   So, we should not be surprised to find that modern electronic
   communication magnifies stupidity as *efficiently* as it magnifies
   intelligence.                                   -- Robert A. Matern
-----------------------------------------------------------------------
  3 days until Abraham Lincoln's and Charles Darwin's 201st Birthdays

Re: How should this tricky spam be filtered?

Posted by Adam Katz <an...@khopis.com>.
I wrote:
>> My tests have been mildly successful on this note, with FROM_WWW 
>> already getting promoted out of testing: 
>> http://ruleqa.spamassassin.org/?rule=/FROM_W&srcpath=khop
>> 
>> This indicates that we don't actually need to parse any further 
>> because there is no sizable mass of legitimate mail that does
>> this (and hopefully by getting this rule out the door, people
>> considering it might decide against it).

John Hardin wrote:
> Concur.
> 
> http://ruleqa.spamassassin.org/20100201-r905213-n/T_FROM_URI/detail?srcpath=jhardin

To get them both on the same view:
http://ruleqa.spamassassin.org/?rule=%2F^FROM_...%24

Let's clear up the differences between FROM_URI and FROM_WWW ...

Maybe it's just because I'm testing on the command line, but FROM_URI
appears to only fire if there's a character in front of the "www."
portion.  It also appears to fire on
"other.www.user@example.com <ot...@example.com>"  Presumably,
my rule's lack of a TLD check is the main reason it hits more messages
(ham and spam).

We should decide upon one (with or without revisions) and push it out
the door.  We've seen a few threads here on the list and I've seen
several inquiries on the IRC channel about this, so I suspect the
masscheck corpora just aren't getting blasted by it as much as others.
 (Also, I wrote the rule independently after seeing the thing in my
own spam bucket, which is how I was able to respond so quickly to the
first thread here.)

Re: How should this tricky spam be filtered?

Posted by John Hardin <jh...@impsec.org>.
On Mon, 1 Feb 2010, Adam Katz wrote:

> Martin Gregorie wrote:
>> Apparently putting the spam's payload in the "personal name" part
>> of the From: header is as old a trick as putting it in the Subject:
>> header though I hadn't seen it used until recently.
>>
>> There was a recent suggestion that 'personal name' text from the
>> From: header should be included in the text examined by 'body'
>> rules, which already includes the Subject: text. This sounds like a
>> good thing to do.
>
> My tests have been mildly successful on this note, with FROM_WWW
> already getting promoted out of testing:
> http://ruleqa.spamassassin.org/?rule=/FROM_W&srcpath=khop
>
> This indicates that we don't actually need to parse any further
> because there is no sizable mass of legitimate mail that does this
> (and hopefully by getting this rule out the door, people considering
> it might decide against it).

Concur.

http://ruleqa.spamassassin.org/20100201-r905213-n/T_FROM_URI/detail?srcpath=jhardin

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   I'm seriously considering getting one of those bright-orange prison
   overalls and stencilling PASSENGER on the back. Along with the paper
   slippers, I ought to be able to walk right through security.
                                              -- Brian Kantor in a.s.r
-----------------------------------------------------------------------
  Today: the 7th anniversary of the loss of STS-107 Columbia

Re: How should this tricky spam be filtered?

Posted by RW <rw...@googlemail.com>.
On Mon, 01 Feb 2010 12:09:24 -0500
Adam Katz <an...@khopis.com> wrote:

> Martin Gregorie wrote:

> > There was a recent suggestion that 'personal name' text from the
> > From: header should be included in the text examined by 'body'
> > rules, which already includes the Subject: text. This sounds like a
> > good thing to do.
> 
> My tests have been mildly successful on this note, with FROM_WWW
> already getting promoted out of testing:
> http://ruleqa.spamassassin.org/?rule=/FROM_W&srcpath=khop
> 
> This indicates that we don't actually need to parse any further
> because there is no sizable mass of legitimate mail that does this
> (and hopefully by getting this rule out the door, people considering
> it might decide against it).

When I suggested making changes it wasn't specifically to do with
urls -  clearly a url has no place in a from header, so it's a near sure
spam indicator.

The real issue is the textual content - particularly obfuscated words -
tests like SUBJECT_FUZZY_VPILL should also run against the from name
IMO. A good spam has to let the reader know roughly what it's selling
before it's deleted from the message list. A single word is enough to
achieve that.

The situation with Bayes is worse. AFAIK the subject is tokenized via
the body, so I was expecting "From" tokens might  be incompatible with
body/subject tokens; but when I tested this, I found that that the from
name is not tokenized at all [bug 6319].

Re: How should this tricky spam be filtered?

Posted by Mike Cardwell <sp...@lists.grepular.com>.
On 08/02/2010 16:56, Joseph Brennan wrote:

> Here's some more data for whatever it's worth.
>
> Our spam reports box since Jan 25 shows this style in definite spam:
>
> From: "Get Cialis on www.wa93.com" <he...@imagina.es>
> From: "Get Tamiflu on www.qa35.com" <in...@quantumtouch.nl>
> From: "Cheap Tamiflu on www.nu36.com" <ac...@detweedekeer.nl>
>
>
>
> This style was in a message in Portuguese (?) that I can't read, but
> it was reported as spam:
>
> From: "www.vicentecorretor.com" <vi...@bol.com.br>
>
>
>
> This style was in a newsletter that appears to be legitimate although
> it was reported as spam:
>
> From: "News and alerts from www.MindFreedom.org"
> <mi...@intenex.net>
>
>
>
> Outlook might send mail where it creates a dummy personal name out of
> the address, e.g.
>
> From: 'user@www.example.com' <us...@www.example.com>
>
> While this is routine in To and Cc fields, I do not have a real
> example of it in a From field, so I can't be sure it happens.

Space followed by "www." ?

header WWW_IN_FROM From =~ / www\./

-- 
Mike Cardwell    : UK based IT Consultant, Perl developer, Linux admin
Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
Technical Blog   : Tech Blog  - https://secure.grepular.com/
Spamalyser       : Spam Tool  - http://spamalyser.com/

Re: How should this tricky spam be filtered?

Posted by Joseph Brennan <br...@columbia.edu>.
Here's some more data for whatever it's worth.



Our spam reports box since Jan 25 shows this style in definite spam:

From: "Get Cialis on www.wa93.com" <he...@imagina.es>
From: "Get Tamiflu on www.qa35.com" <in...@quantumtouch.nl>
From: "Cheap Tamiflu on www.nu36.com" <ac...@detweedekeer.nl>



This style was in a message in Portuguese (?) that I can't read, but
it was reported as spam:

From: "www.vicentecorretor.com" <vi...@bol.com.br>



This style was in a newsletter that appears to be legitimate although
it was reported as spam:

From: "News and alerts from www.MindFreedom.org" 
<mi...@intenex.net>



Outlook might send mail where it creates a dummy personal name out of
the address, e.g.

From: 'user@www.example.com' <us...@www.example.com>

While this is routine in To and Cc fields, I do not have a real
example of it in a From field, so I can't be sure it happens.



Joseph Brennan
Columbia University Information Technology


Re: How should this tricky spam be filtered?

Posted by Martin Gregorie <ma...@gregorie.org>.
On Mon, 2010-02-01 at 12:09 -0500, Adam Katz wrote:

> It might be nice to have the URI rule check From, Reply-to, and
> Subject.  We'd have to be careful so as to not include /all/ headers
> as many different mailing lists use various headers for subscription
> management and PGP systems often use headers for pubkey locations, and
> I'm sure there's other stuff out there too.
>
I've raised an enhancement request bug (6317) suggesting that its only
necessary to deal with the 'personal name' part of the From: header.
Thats 'personal name' as in 

	From: personal name <us...@example.com>

since Subject can already be searched with body rules. It seems to me
that subverting headers other than From: and Subject: doesn't really
gain a spammer much since you can't guarantee that any other headers
with free text in their value string can be seen by the recipient,
particularly if their MUA has its default configuration.

I'd like to be able to scan From: headers with body rules as well as uri
rules because then one medical product rule can deal with the product
reference regardless of whether its in the message body, subject or
sender name. 

I've only raised this bug as a reminder, so feel free to cancel it if
its doesn't add any value or the implementation and/or run-time costs
are too high.
  

Martin



Re: How should this tricky spam be filtered?

Posted by Adam Katz <an...@khopis.com>.
Martin Gregorie wrote:
> Apparently putting the spam's payload in the "personal name" part
> of the From: header is as old a trick as putting it in the Subject:
> header though I hadn't seen it used until recently.
> 
> There was a recent suggestion that 'personal name' text from the
> From: header should be included in the text examined by 'body'
> rules, which already includes the Subject: text. This sounds like a
> good thing to do.

My tests have been mildly successful on this note, with FROM_WWW
already getting promoted out of testing:
http://ruleqa.spamassassin.org/?rule=/FROM_W&srcpath=khop

This indicates that we don't actually need to parse any further
because there is no sizable mass of legitimate mail that does this
(and hopefully by getting this rule out the door, people considering
it might decide against it).

Developers note:  I'm probably going to merge those two rules since
while FROM_WEBSITE sometimes flips and has a sub-.500 S/O, its ham% in
even those instances is always negligible.

This rule is particularly exciting because most of its hits are
low-scoring; 21.37% of spam is 5 and under, 68.39% is 8 and under.
This reflects a feature that (afaik) the genetic algorithm doesn't
specifically breed for and that is somewhat rare.

> Is it already in the developer's to-do list or should somebody
> (me?) raise a bug requesting it?

It might be nice to have the URI rule check From, Reply-to, and
Subject.  We'd have to be careful so as to not include /all/ headers
as many different mailing lists use various headers for subscription
management and PGP systems often use headers for pubkey locations, and
I'm sure there's other stuff out there too.

Re: How should this tricky spam be filtered?

Posted by Martin Gregorie <ma...@gregorie.org>.
On Sat, 2010-01-30 at 13:35 +0000, Kārlis Repsons wrote:
> People,
> perhaps its simple to be done, but I personally would like to know the ways to 
> get rid of something like this:
> 
Apparently putting the spam's payload in the "personal name" part of the
From: header is as old a trick as putting it in the Subject: header
though I hadn't seen it used until recently.

There was a recent suggestion that 'personal name' text from the From:
header should be included in the text examined by 'body' rules, which
already includes the Subject: text. This sounds like a good thing to
do. 

Is it already in the developer's to-do list or should somebody (me?)
raise a bug requesting it?

Martin